# Quiz 2 Preparation


## 1 Principal Component Analysis (PCA)
#### 1.1 What is PCA?

**Definition** :Principal Component Analysis is a dimensionality reduction technique that transforms correlated variables into a set of uncorrelated variables called principal components (PCs). The key goals are:

1. Reduce dimensionality while retaining maximum variance

2. Create uncorrelated features from correlated ones

3. Identify patterns in high-dimensional data

#### 1.2 Standardization: When and Why

**CRITICAL CONCEPT:** Before running PCA, you must understand whether to standardize (center and scale) your data.

**1.2.1 Centering**
Centering means subtracting the mean from each variable:

$x_{centered} = x − \bar{x}$ (1)

**1.2.2 Scaling**

Scaling means dividing by the standard deviation after centering:

$x_{scaled} = \frac{x_{centered}}{s_x}$ (2)

When to standardize:

1. Variables are measured in different units (e.g., weight in kg, height in cm)
2. Variables have vastly different variances
3. You want each variable to contribute equally to the analysis

In [38]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X = [[2.5, 2.4],
     [0.5, 0.7],
     [2.2, 2.9],
     [1.9, 2.2],
     [3.1, 3.0],
     [2.3, 2.7],
     [2, 1.6],
     [1, 1.1],
     [1.5, 1.6],
     [1.1, 0.9]]


# Standardize the data
scaler = StandardScaler () # This centers AND scales
X_scaled = scaler.fit_transform(X)

# Then run PCA
pca = PCA()
pca.fit(X_scaled)
print("Explained variance ratios:", pca.explained_variance_ratio_) #


Explained variance ratios: [0.96296464 0.03703536]


In [39]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X = [[2.5, 2.4],
     [0.5, 0.7],
     [2.2, 2.9],
     [1.9, 2.2],
     [3.1, 3.0],
     [2.3, 2.7],
     [2, 1.6],
     [1, 1.1],
     [1.5, 1.6],
     [1.1, 0.9]]

# Standardize the data
scaler = StandardScaler () # This centers and scales the data
X_scaled = scaler.fit_transform(X) 

# Then Run PCA
pca = PCA()
pca.fit(X_scaled)
print("Explained Variance Ratios:", pca.explained_variance_ratio_)


Explained Variance Ratios: [0.96296464 0.03703536]


#### 1.3 Interpreting PCA Results

**1.3.1 Principal Component Loadings**

The loadings (also called rotation matrix or components) tell you how each original variable
contributes to each PC.

**How to read loadings:**

1. Each column represents one principal component

2. Each row shows how much that original variable contributes

3. Positive values: variable increases with PC

4. Negative values: variable decreases with PC

5. Larger absolute values = stronger contribution


 **Get the loadings:**
loadings = pca.components_.T  # Transpose to get variables x PCs

print(loadings)

 Variable | PC1 | PC2 | PC3 | PC4 |
| :------- | :------: | -------: | :------: | :-------: |
| Var1    | 0.50   | -0.50    | 0.20   | -0.70 |
| Var2   | 0.49  | 0.51   | -0.30  | 0.10  |
| Var3    | 0.51   | 0.48   | 0.40   | 0.20  |
| Var4   | 0.50   | -0.49  | -0.30  | 0.67  |

**1.3.2 Computing PC Scores**

EXAM TIP: You must know how to write the formula for PC scores!

If data is NOT standardized, PC1 score for observation i is:

$PC1_i = w1 ×x1_i + w2 ×x2_i + ... + wp ×xp_i$ (3)


If data IS standardized, PC1 score for observation i is:

$PC1_i = w_1 × \frac{(x1_i − \bar{x_1})}{s_1} + w_2 × \frac{(x2_i − \bar{x_2})}{s_2} + ... + w_p × \frac{(xp_i − \bar{x_p})}{s_p}$ (4)

where w_j are the loadings for PC1,  ̄x_j are means, and s_j are standard deviations.
Example: Given loadings [0.5, 0.5, 0.5, 0.5] and standardized data, PC1 equals:

$PC1 = 0.5 × \frac{(x_1 − \bar{x_1})}{s_1} + 0.5 × \frac{(x_2 − \bar{x_2})}{s_2} + 0.5 × \frac{(x_3 − \bar{x_3})}{s_3} + 0.5 × \frac{(x_4 − \bar{x_4})}{s_4}$ (5)

#### 1.4 Variance and PCA

**KEY FACT: Principal components are ordered by variance!**

•PC1 has the largest variance of all possible linear combinations

•PC2 has the second largest variance, uncorrelated with PC1

•PC3 has the third largest variance, uncorrelated with PC1 and PC2

In [40]:
# Get the variance explained by each PC
import numpy as np
var_explained = pca.explained_variance_
print(var_explained)

#Total Variance
total_variance = np.sum(var_explained)
print("Total Variance explained:", total_variance)

[2.13992141 0.08230081]
Total Variance explained: 2.2222222222222228


##### 1.5 Proportion of Variance Explained (PVE)

PVE tells you what percentage of total variance each PC captures.

$PVE_j = \frac{Variance of PCj}{Total Variance} = \frac{Variance of PCj}{\sum_{i=1}^{p} Variance of PCi}$

Total Variance = $\sum_{i=1}^{p} \text{Variance of PCi}$

In [41]:
##### Calculate PVE

pve = pca.explained_variance_ratio_
print(pve)

[0.80 , 0.125 , 0.05, 0.025]

6 # This means:

7 # PC1 explains 80% of total variance

8 # PC2 explains 12.5% of total variance

9 # PC3 explains 5% of total variance

10 # PC4 explains 2.5% of total variance

12 # They sum to 1 (100%)

print(np.sum(pve)) # 1.0

[0.96296464 0.03703536]
0.9999999999999999


##### 1.6 Practice Problems: PCA

Problem 1: You run PCA on 5 variables after centering and scaling. The loadings for PC1 are
[0.45, 0.45, 0.44, 0.46, 0.45]. Write the formula for computing PC1 scores. Include the means and
standard deviations in your formula.

Problem 2: Given PVE values [0.65, 0.20, 0.10, 0.05], what percentage of variance does PC1
capture? What about PC1 and PC2 combined?

Problem 3: True or False: If you run PCA on standardized data and all variables have
equal weight (similar loadings) in PC1, then PC1 is approximately an average of the standardized
variables.

Problem 4: You have 4 PCs with variances [16, 3, 0.8, 0.2]. Calculate the PVE for each PC.
Which PC explains the most variance?

## 2. K-means clustering:

##### 2.1 What is K-Means?
K-means is an unsupervised learning algorithm that partitions data into k clusters. Each observation belongs to the cluster with the nearest mean (centroid).

##### 2.2 How K-Means Works
1. Choose number of clusters k
2. Randomly initialize k cluster centroids
3. Assign each point to nearest centroid
4. Recalculate centroids based on assigned points
5. Repeat steps 3-4 until convergence

In [42]:
from sklearn.cluster import KMeans

# Fit k-means with 2 clusters
kmeans = KMeans(n_clusters =2, random_state =42)
kmeans.fit(X)

# Get cluster assignments (0 or 1 for 2 clusters)
labels = kmeans.labels_
print(labels) # [0, 1, 0, 1, 1, 0, ...]

# Count observations in each cluster
unique , counts = np.unique(labels , return_counts=True)
print(counts) # [6, 4] means 6 in cluster 0, 4 in cluster 1

# Get cluster centers
centers = kmeans.cluster_centers_
print(centers)

[0 1 0 0 0 0 0 1 1 1]
[6 4]
[[2.33333333 2.46666667]
 [1.025      1.075     ]]


##### 2.4 Important Notes

EXAM TIP: Pay attention to:

•The random state parameter - affects initialization

•Cluster labels are arbitrary (0, 1, 2, ...) - no inherent ordering

•Cluster sizes are NOT necessarily equal

•Results depend on initialization and can vary

##### 2.5 Interpreting cluster sizes

In [43]:
# If I get an output like:
unique, counts = np.unique(labels, return_counts = True)
print(counts)
# We get [6, 4]

[6 4]


This means:
- There are two clusters (length of array)
- Cluster 0 has 6 observations
- Cluster 1 has 4 observations
Total: 10 observations

**Common Question Format: ”There are X subjects in cluster 1 and Y in cluster 2”**

- Make sure X + Y = total number of observations
- Check which number corresponds to which cluster
- Labels can be 0-indexed or 1-indexed depending on context

##### 2.6 Practice Problems: Clustering

**Problem 1:** You run k-means with k=3 on 300 observations. The output shows counts [100, 125,
75]. How many observations are in each cluster? Do clusters have equal sizes?
- 100, 125, 75 observations
- clusters do not have same sizes

**Problem 2:** True or False: K-means always produces clusters of equal size.
- False

**Problem 3:** If you run k-means twice with different random seeds, will you get the same cluster
sizes? Why or why not?
- False because different seeds change initialization points which can lead to changes in local optima and different cluster sizes 


## 3 Linear Regression
##### 3.1 Simple Linear Regression
Simple linear regression models the relationship between one predictor X and response Y :

Y = β0 + β1X + ε

where:

- β0 = intercept
- β1 = slope (coefficient)
- ε = random error

##### 3.2 Interpreting Coefficients

**The Slope (β1):** ”On average, for every 1 unit increase in X, Y changes by β1 units.”

CRITICAL DISTINCTION:
- Average effect: On average, Y decreases by β1 when X increases by 1
- Individual predictions: For specific observations, actual values can vary around the pre-
dicted mean

In [44]:
# Output:
# Intercept: 35.82
# Coefficient: -0.044
# This means:
# Predicted MPG_Hwy = 35.82 - 0.044 * Horsepower

Interpretation:

- On average, MPG Hwy decreases by 0.044 for each 1 unit increase in Horsepower
- A car with Horsepower=200 has predicted MPG = 35.82 - 0.044(200) = 27.02
- A car with Horsepower=201 has predicted MPG = 35.82 - 0.044(201) = 26.976

**EXAM TIP:** Watch out for tricky questions!

***TRUE Statement:*** ”On average, MPG decreases by 0.044 when Horsepower increases by 1.”

***FALSE Statement:*** ”For any two specific cars where one has Horsepower=200 and another has Horsepower=201, the first car is GUARANTEED to have higher MPG.”

Why is the second false? Because individual observations have variability around the regression
line. The model predicts the mean, not individual values exactly.

##### 3.3 Multiple Linear Regression

Multiple regression includes multiple predictors:


Y = β0 + β1X1 + β2X2 + ... + βpXp + ε 

**CRITICAL CONCEPT:** Interpretation of coefficients in multiple regression

β1 = ”The average change in Y for a 1-unit increase in X_1, ***holding all other variables
constant.”***



In [45]:
# Output:
# Intercept: 46.99
# Horsepower: -0.028
# Weight: -4.01

# Model: MPG_Hwy = 46.99 - 0.028* Horsepower - 4.01* Weight

How to interpret -0.028 for Horsepower:

- ”On average, for each 1 unit increase in Horsepower, MPG Hwy decreases by 0.028, when
Weight is held constant.”

- You cannot say ”1 unit increase in Horsepower always decreases MPG by 0.028” without the
”holding Weight constant” qualifier

##### 3.4 Making Predictions

To predict: Plug values into the equation

 **IMPORTANT: You can only use variables that are in the model!**
- If Seats and Length are not in the model, you don’t need their values to predict
- Having extra information doesn’t hurt - just ignore variables not in the model
- You cannot make predictions if required variables are missing

In [46]:
# Given: MPG_Hwy = 46.99 - 0.028* Horsepower - 4.01* Weight
# Predict for: Horsepower =240, Weight =3.5

predicted_MPG = 46.99 - 0.028*240 - 4.01*3.5
predicted_MPG = 46.99 - 6.72 - 14.035
predicted_MPG = 26.235


from sklearn.linear_model import LinearRegression
import numpy as np

# Simple regression: Y ~ C1
c = car_data[['Horsepower']] # Need double brackets for DataFrame
y = car_data['MPG_Hwy']

model = LinearRegression ()
model.fit(X, y)
print(f"Intercept:␣{model.intercept_}")
print(f"Coefficient:␣{model.coef_ [0]}")

# Multiple regression: Y ~ C1 + C2
c_multi = car_data [['Horsepower', 'Weight']]
y = car_data['MPG_Hwy']
model2 = LinearRegression ()
model2.fit(c_multi , y)

print(f"Intercept:␣{model2.intercept_}")
print(f"Coefficients:␣{model2.coef_}")
# model2.coef_ [0] is coefficient for Horsepower
# model2.coef_ [1] is coefficient for Weight

# Make predictions
new_data = np.array ([[240 , 3.5]]) # Horsepower =240, Weight =3.5
prediction = model2.predict(new_data)
print(f"Predicted␣MPG:␣{prediction [0]}")

KeyError: "None of [Index(['Horsepower'], dtype='object')] are in the [columns]"

##### 3.6 Model Quality: R²

R² (R-squared) measures the proportion of variance in Y explained by the model.

$$
R^2 \;=\; 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}
\;=\;
1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$

Interpretation:
- R² = 0.465 means “The model explains 46.5% of the variance in Y.”
- Range: 0 to 1 (0% to 100%).
- Higher R² = better fit.
- R² always increases (or stays the same) when adding more variables.

Adjusted R² (penalizes unhelpful variables):
$$
R^2_{\text{adj}} \;=\; 1 - (1 - R^2)\,\frac{n - 1}{n - p - 1}
$$

where:
- n = number of observations
- p = number of predictors (excluding the intercept)

##### 3.7 F-Test for Overall Model Significance

The F-test checks whether at least one predictor is useful.

Null and alternative hypotheses:
$$
H_0:\ \beta_1 = \beta_2 = \cdots = \beta_p = 0
\qquad
H_A:\ \text{at least one } \beta_j \neq 0
$$

F-statistic:
$$
F \;=\; \frac{R^2/p}{(1 - R^2)/(n - p - 1)}
$$
where:
- $p$ = number of predictors (excluding intercept)
- $n$ = sample size

Exam tip: Relationship between F-test and $R^2$
- Larger $R^2$ generally leads to a larger $F$-statistic.
- You still need the F-test to assess if $R^2$ is “large enough” to be statistically significant.
- Large $R^2$ alone does not prove significance; use the F-test p-value.

How to reject $H_0$:
- If p-value < $\alpha$ (significance level), reject $H_0$.
- Example: p-value = 0.0001, $\alpha$ = 0.001 → 0.0001 < 0.001 → reject $H_0$.
- “p-value < $2\times 10^{-16}$” means the p-value is extremely small (essentially 0).

In [None]:
import scipy.stats as stats
# Calculate F-statistic
n = len(y)
p = X_multi.shape [1] # number of predictors

# Get R-squared
y_pred = model2.predict(X_multi)
r2 = r2_score(y, y_pred)

# Calculate F-statistic
f_stat = (r2 / p) / ((1 - r2) / (n - p - 1))

# Get p-value
f_pvalue = 1 - stats.f.cdf(f_stat , p, n - p - 1)

print(f"F-statistic:␣{f_stat}")
print(f"p-value:␣{f_pvalue}")

# Interpretation:
if f_pvalue < 0.001:
    print("Reject H0 at alpha =0.001")

print("At least one predictor is significant")


NameError: name 'y' is not defined

###### 3.8 Practice Problems: Regression

**Problem 1: Given the model: ˆY = 50 −0.05X**

- What is the predicted Y when X=100?
    - Answer: 45
- What is the predicted Y when X=101?
    - Answer: 44.95
- On average, what happens to Y when X increases by 1?
    - Answer: For every 1 unit X increases by Y decreases by 0.05. 
- If person A has X=100 and person B has X=101, is Y guaranteed to be higher for person A?
    - Answer: False, expected value yes, but regression calculates the mean. There is variability for individual points around the mean.

**Problem 2: Given: ˆY = 30 + 2X1 −5X2**

- Interpret the coefficient of X1
    - Answer: Holding other variables constant, for every 1 unit X1 increases, Y increases by 2
- Interpret the coefficient of X2
    - Answer: Holding other variables constant, for every 1 unit X2 increases, Y decreases by a factor of 5
- Predict Y when X1 = 10,X2 = 3 
    - Answer: 50-15 = 35
- Can you predict Y if you’re given X1 = 10,X2 = 3,X3 = 5?
    - Answer: Yes! Y would still be 35, but X3 isn't factored into the regression. Extra information is nice, but unncessary in this case.

**Problem 3: A model has R2 = 0.65 and p=2 predictors, n=200 observations. Calculate the F-statistic.**
- Answer: around 183

**Problem 4: True or False: ”If R2 = 0.8, we can automatically reject H0 : β1 = β2 = 0 at α = 0.001.” Explain.**
- Answer: False, need F test to determine whether H0 should be rejected



## 4 Model Diagnostics
##### 4.1 Residual Plots

Residuals are the differences between observed and predicted values:

$$
e_i = yi − \hat{y_i} 
$$

In [None]:
#4.1.1 Residuals vs. Fitted Values Plot
# This plot helps check the linearity assumption.

# Create residual plot
y_pred = model.predict(X)
residuals = y - y_pred

plt.scatter(y_pred , residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.show()

NameError: name 'model' is not defined

**What to look for:**

- ***Good:*** Random scatter around horizontal line at 0

- ***Bad:*** Clear patterns, curves, or systematic deviations

##### 4.2 Interpreting Residual Patterns

Pattern 1: Underestimation
- If residuals are mostly positive (above 0) for certain fitted values
- This means $ yi − \hat{y_i}$  > 0, so $y_i > \hat{y_i}$
- The actual values are higher than predictions
- We are underestimating (predicting too low)

Pattern 2: Overestimation

- If residuals are mostly negative (below 0) for certain fitted values
- This means $ yi − \hat{y_i}$  < 0, so $y_i < \hat{y_i}$
- The actual values are lower than predictions
- We are overestimating (predicting too high)

**EXAM TIP: Common question pattern**

”For cars with smaller MPG Hwy (left side of plot), residuals are positive. This suggests:”
- Positive residuals = actual > predicted
- We are predicting too low
- We are underestimating
- Linearity might be a problem

##### 4.4 Practice Problems: Diagnostics

**Problem 1:** You fit a model predicting house prices. The residual plot shows that for expensive
houses (right side, high fitted values), most residuals are negative. Are you overestimating or
underestimating expensive houses?

- Answer: negative residuals mean you are overestimating  

**Problem 2:** For a model predicting test scores, you observe positive residuals for students with
low predicted scores. What does this suggest about the model’s predictions for low-performing
students?

- Answer: positive residuals mean you are underestimating

**Problem 3:** True or False: ”If all residuals on the left side of the plot are above 0, the model
is overestimating for small values.”

- Answer: False, you are underestimating for small values
