
### **Pre-Test**
**Course:** Stat 766 Machine Learning and Predictive Analytics

**Purpose:** This pre-test is designed to gauge your readiness in the core areas of *background knowledge on modeling*, *programming* and *practical implementation* skills that are crucial for a modern machine learning course. The goal is to identify areas where you might need to review *before* the course begins, ensuring your success. Please answer all questions to the best of your ability.



#### **Section A: Programming Mindset & Debugging (Practical Skills)**

**1. Debugging Attitude (Original Question 1, improved)**
Describe your thought process and the first two steps you take when you encounter an error in your code. (e.g., a `ValueError` in Python, an `NA/NaN` error in R). Do you get frustrated, or do you see it as a puzzle? What is your strategy?


<p style="margin-bottom: 7em;">When I encounter an error raised during interpreting or compiling, I want to immediately know where the error is being raised in the code. First, I look at the error message to see if it is a simple mistake. Second, I find the line where the error is raised, and try to backtrack of why that error might be raised, such as mal-formed input or improperly used functions.</p>





**2. Vectorization and basic Linear Algebra**

Given two vectors:
    `a = [1, 2, 3]`
    `b = [4, 5, 6]`
Calculate the dot product (inner product) `a · b`.



<p style="margin-bottom: 5em;">`a · b` = (1x4) + (2x5) + (3x6) = 4 + 10 + 18 = 32</p>



**3. Optimization Concept**
What is the most common objective function (the function we minimize) to find the best-fitting line in ordinary least squares (OLS) regression? (State its name or formula.)



<p style="margin-bottom: 4em;">We minimize the sum of squared residuals.</p>




**4. Numerical Stability**
(1) For evaluating a model that estimates a probability `p`, the quantity `(y - p) / (p * (1 - p))` needs to be computed.
a) What **numerical** problem can occur when `p` is very close to 0 or 1?


<p style="margin-bottom: 6em;">
If p = 0:
    (y - 0) / (0 * (1 - 0)) = (y - 0) / 0

If p = 1:
    (y - 1) / (1 * (1 - 1)) = (y - 1) / 0

Division by zero for both values.
</p>



b) How would you protect your code against this? Suggest a simple fix.




<p style="margin-bottom: 5em;">Some sort of penalty can be added to p to prevent it from approaching 0 or 1, such as adding a random value to it.</p>




(2) In a project, a method repeatedly models the mean of the data using `mu = exp(X * beta)` and computes `e = (y - mu) / mu`, where `y` and `mu` are both vectors.
a) What potential **programming** issue (e.g., error or warning) could arise if this formula is directly implemented in a language like R or Python? *(Hint: think about dimensions and element-wise operations)*



<p style="margin-bottom: 7em;"><code>y</code> and <code>mu</code> could be different sized lists, which could raise issues when computing mathematical operations such as subtraction or division.</p>


b) How would you correctly implement this calculation in Python using NumPy? Write a line of code.


<p style="margin-bottom: 7em;">
Assumptions:
<ul>
    <li>
        <code>X</code>, <code>beta</code>, <code>y</code>, <code>mu</code> are equal-sized lists.
    </li>
</ul>
<code><pre>
import numpy as np
mu = np.exp(np.dot(X, beta))
e = np.subtract(y - mu) / mu
</pre></code>
</p>


---

#### **Section B: Linear Algebra & Model Fundamentals**

**5. Linear Regression Interpretation**
In a simple linear regression model defined by the equation `ŷ = β₀ + β₁x`.

 (1) what is the interpretation of the coefficient `β₁`? What does it mean if `β₁` is positive? What if it is negative? 



<p style="margin-bottom: 5em;">β₁ represents one of the predictor variables manipulated to predict ŷ. It has some property that acts on the response variable (or not), and we are interest in its relationship to the response variable. If β₁ is positive, then it means for every unit increase in β₁, ŷ increases by some other amount of units. Similarly, if β₁ is negative, then for every unit decrease in β₁, ŷ decreases by some other amount of units.</p>


 (2) Suppose the x and y observations are in an excel file. What will happen to the test of `β₁ =0` if all the data points were accidently copied and pasted twice in the excel file? 


<p style="margin-bottom: 10em;"></p>




**6. Model Evaluation**
Name one metric for evaluating a linear regression model and one for evaluating a binary classification model. Briefly state what each measures.
- **Regression Metric:** Mean Absolute Error (Measures: Actual difference between observed and predicted values; accuracy of estimation.)

- **Classification Metric:** Accuracy of true values over total values (Measures: Actual accuracy of model predicting a true (1) or false (0) value.)

**7. Correlation and Multicollinearity (Original Question 4, improved)**
You are building a linear regression model with five features (X1, X2, X3, X4, X5). The correlation matrix shows:
- `cor(X1, X3) = -0.95`
- `cor(X1, X4) = 0.9`
- `cor(X4, X5) = -0.98`

a) What is the name of the **statistical problem** introduced by these high correlations?

<p style="margin-bottom: 4em;">Multi-correlation or Collinearity</p>


b) What is one **practical consequence** for the model's performance or interpretation if we include all these variables?



<p style="margin-bottom: 4em;">The model may be overfitting to the data, leading to inaccurate interpretation of the model's performance.</p>


c) Name one technique or tool we can use to **diagnose** this problem before building the model.


<p style="margin-bottom: 5em;">Plot residuals for all features against response variable to observe if data points are randomly arranged on the line of best fit.</p>



---

#### **Section C: Python Programming Readiness (Practical Implementation)**

**8. Basic Control Flow**
Write a Python function called `find_max` that takes a list of numbers as input and returns the maximum value. **Do not use the built-in `max()` function.**


<p style="margin-bottom: 6em;">
<code>
<pre>
def find_max(x: list[int]) -> int:
    largest = 0
    for i in range(0, len(x)):
        if x[i] > largest:
            largest = x[i]
    return largest
</pre>
</code>
</p>






**9. Data Manipulation with Pandas**
The code below creates a DataFrame and performs an operation.
```python
import pandas as pd
data = {'Student': ['Alice', 'Bob', 'Charlie'], 'Score': [95, 88, 72]}
df = pd.DataFrame(data)
print(df[df['Score'] > 80])
```
a) What is the **data type** of the object inside the `df[...]` selector? (e.g., boolean, integer, string)

<p style="margin-bottom: 4em;">It would be a series object with integers populating it, since we are selecting values above 80 from the dataframe.</p>

b) What is the **output** of the `print` statement?


<p style="margin-bottom: 4em;">
<code>
<pre>
Student Score
Alice   95
Bob     88
</pre>
</code>
</p>


**9. Understanding Libraries**
Which Python library would you primarily use for the following tasks?
a) Efficient numerical computations on arrays: `Numpy`

b) Creating static, interactive, and animated visualizations: `Dash/Plotly`

c) Training a complex model like a gradient boosting machine: `GradientBoostingClassifier` (e.g., from `sklearn`)

---


