<a href="https://colab.research.google.com/github/KeerHu73/linearModels/blob/main/assignment/assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment: Linear Models
## Do three questions.
### `! git clone https://github.com/ds4e/linearModels`

# I did Q1, Q4 and Q7

**Q1.** Please answer the following questions in your own words.

1. What makes a model "linear"? "Linear" in what?
2. How do you interpret the coefficient for a dummy/one-hot-encoded variable? (This is a trick question, and the trick involves how you handle the intercept of the model.)
3. Can linear regression be used for classification? Explain why, or why not.
4. What are signs that your linear model is over-fitting?
5. Clearly explain multi-colinearity using the two-stage least squares technique.
6. How can you incorporate nonlinear relationships between your target/response/dependent/outcome variable $y$ and your features/control/response/independent variables $x$ into your analysis?
7. What is the interpretation of the intercept? A slope coefficient for a variable? The coefficient for a dummy/one-hot-encoded variable?

1). The model is linear in the parameters, not necessarily in the variables. A linear model predicts the outcome variable y as a linear combination of weights and features:

> "Because the weights 𝑏𝑘 enter the model in a multiplicative way, this is a linear model." (I quoted from the lecture)



2). From the lecture,
> "If you have an intercept and all of the dummies, you can replicate one of your regressors from a combination of other ones. This is called perfect multicollinearity, and some of your coefficients won't be defined."

Thus, the coefficient represents the effect relative to the omitted (baseline) category if `drop_first=True` is used.




3). This is not discussed directly in the PDF. However, the text does differentiate between regression and classification briefly:

> "NN and MC illustrate the distinctions between regression and classification..."

Therefore, while the PDF does not elaborate, it implicitly treats linear regression as a tool for predicting continuous outcomes, not class labels.



4). While "overfitting" is not explicitly mentioned, it is implied in sections discussing expanding the feature space:

> "This is where we run into a significant danger of overfitting: The more complex the feature space, the more opportunities we give the model to pick non-representative cases..."

5). The lecture discusses the impact of multicollinearity and introduces the idea of “partialing out”:

> "Linear regression 'partials out' all of the variation in 𝑥𝑘 that can be explained by the other features..."

It also explains how to compute coefficients via regression on residuals—a two-stage approach:

> "Regress 𝑦 and 𝑥𝑘 on all of the other coefficients... then regress the residuals..."

6). I can include age² and mileage² to control for non-linear aging effects or interact variables and expand the feature space... polynomial, interaction terms, transformations...

7).
*   Intercept b0: It is the expected value of y when all Xi = 0.
*   Slope bk: Marginal effect of a one-unit change in Xk on y, holding others constant.
*   Dummy coefficient: Effect relative to the omitted category in one-hot encoding.







**Q4.** This question refers to the `heart_hw.csv` data. It contains three variables:

  - `y`: Whether the individual survived for three years, coded 0 for death and 1 for survival
  - `age`: Patient's age
  - `transplant`: `control` for not receiving a transplant and `treatment` for receiving a transplant

Since a heart transplant is a dangerous operation and even people who successfully get heart transplants might suffer later complications, we want to look at whether a group of transplant recipients tends to survive longer than a comparison group who does not get the procedure.

1. Compute (a) the proportion of people who survive in the control group who do not receive a transplant, and (b) the difference between the proportion of people who survive in the treatment group and the proportion of people who survive in the control group. In a randomized controlled trial, this is called the **average treatment effect**.
2. Regress `y` on `transplant` using a linear model with a constant. How does the constant/intercept of the regression and the coefficient on transplant compare to your answers from part 1? Explain the relationship clearly.
3. We'd like to include `age` in the regression, since it's reasonable to expect that older patients are less likely to survive an extensive surgery like a heart transplant. Regress `y` on a constant, transplant, and age. How does the intercept change?
4. Build a more flexible model that allows for non-linear age effects and interactions between age and treatment. Use a train-test split to validate your model. Estimate your best model, predict the survival probability by age, and plot your results conditional on receiving a transplant and not. Describe what you see.
5. Imagine someone suggests using these kinds of models to select who receives organ transplants; perhaps the CDC or NIH starts using a scoring algorithm to decide who is contacted about a potential organ. What are your concerns about how it is built and how it is deployed?

In [3]:
import pandas as pd
import numpy as np

# Load the heart transplant dataset
df = pd.read_csv("/content/heart_hw.csv")
df.describe()

Unnamed: 0.1,Unnamed: 0,age,y
count,103.0,103.0,103.0
mean,52.0,44.640777,0.271845
std,29.877528,9.797813,0.447086
min,1.0,8.0,0.0
25%,26.5,41.0,0.0
50%,52.0,47.0,0.0
75%,77.5,52.0,1.0
max,103.0,64.0,1.0


1. Calculate ATE

In [4]:
# First, we need to split control and treatment groups based on the transplant variable
# and then we need to calculate the mean of y, which gives the proportion of people who survived in each group

# Calculate survival rate in the control group (no transplant)
control_group = df[df['transplant'] == 'control']
prop_control = control_group['y'].mean()

# Then calculate survival rate in the treatment group (received transplant)
treatment_group = df[df['transplant'] == 'treatment']
prop_treatment = treatment_group['y'].mean()

# Then the ATE is the difference:
ate = prop_treatment - prop_control


In [21]:
print("Proportion of people who survived in the control group (no transplant):", prop_control*100 ,"%")
print("Proportion of people who survived in the treatment group (received transplant):", prop_treatment *100 , "%")
print("Average Treatment Effect (ATE):", ate *100 ,"%")

Proportion of people who survived in the control group (no transplant): 11.76470588235294 %
Proportion of people who survived in the treatment group (received transplant): 34.78260869565217 %
Average Treatment Effect (ATE): 23.017902813299234 %


The ATE above shows that individuals who received a heart transplant had a 23% higher probability of surviving 3 years than those who did not. This suggests a positive survival benefit from receiving a transplant.


2. This question we need to use a simple linear regression to estimate how receiving a transplant affects survival probability.

In [5]:
# Create a binary dummy variable: 1 if treatment, 0 if control
df['transplant_dummy'] = (df['transplant'] == 'treatment').astype(int)

# Define predictor x and response y
x = df['transplant_dummy']
y = df['y']

# Compute sample means of x and y
x_bar = x.mean()
y_bar = y.mean()

# Use vector inner product to compute slope coefficient (b1)
# Formula: b1 = Cov(x, y) / Var(x) = (x - x̄)'(y - ȳ) / (x - x̄)'(x - x̄)
b1 = np.inner(x - x_bar, y - y_bar) / np.inner(x - x_bar, x - x_bar)

# Compute intercept (b0) using: b0 = ȳ - b1 * x̄
b0 = y_bar - b1 * x_bar

# Predict survival values using the model: ŷ = b0 + b1 * x
y_hat = b0 + b1 * x

# Calculate residuals (difference between observed and predicted y)
residuals = y - y_hat

In [6]:
print("Intercept (control survival):",y_hat)
print("Transplant effect (coefficient):",b1 )


Intercept (control survival): 0      0.117647
1      0.117647
2      0.117647
3      0.117647
4      0.117647
         ...   
98     0.117647
99     0.347826
100    0.347826
101    0.347826
102    0.347826
Name: transplant_dummy, Length: 103, dtype: float64
Transplant effect (coefficient): 0.23017902813299232


In the regression `y ~ transplant`, the intercept equals the control group's survival rate, matching the result in Q4.1. The transplant coefficient equals the difference between treatment and control means — identical to ATE. This confirms that with only one dummy predictor and an intercept, linear regression reproduces group differences.


3. It's a multiple linear regression: y ~ transplant + age

In [25]:
# we need to add age to the model to account for its effect on survival

# Build the design matrix X with columns: intercept, transplant_dummy, age
X2 = df[['transplant_dummy', 'age']].copy()
X2.insert(0, '(Intercept)', 1)  # Add a column of 1s for the intercept
X_mat = X2.to_numpy()           # Convert dataframe to numpy matrix

# Convert response y to a numpy vector
y_vec = df['y'].to_numpy()

# Compute X'X and X'y using matrix multiplication (as in PDF)
XtX = X_mat.T @ X_mat           # X transpose times X
Xty = X_mat.T @ y_vec           # X transpose times y

# Solve the normal equations: (X'X)·b = X'y
# This gives us the optimal coefficients: b0 (intercept), b1 (transplant), b2 (age)
b2 = np.linalg.solve(XtX, Xty)


After including age, the intercept increased while the coefficient on transplant also increased slightly. This means that when controlling for age, the transplant effect appears even stronger. The negative age coefficient suggests that older patients are less likely to survive, which aligns with medical expectations.


4.

In [11]:
# Use age^2 to create nonlinear
df['age_squared'] = df['age'] ** 2
df['age_x_transplant'] = df['age'] * df['transplant_dummy']#To indicate whether the transplantation effect varies with age.

# Then, build the design matrix for flexible model
X_flex = df[['transplant_dummy', 'age', 'age_squared', 'age_x_transplant']].copy()
X_flex.insert(0, '(Intercept)', 1)
X_flex_matrix = X_flex.to_numpy()

# Convert y to numpy vector
y_vec = df['y'].to_numpy()

# Solve the normal equation using matrix algebra.
XtX_flex = X_flex_matrix.T @ X_flex_matrix
Xty_flex = X_flex_matrix.T @ y_vec
beta_flex = np.linalg.solve(XtX_flex, Xty_flex)

The flexible model includes nonlinear age effects and interaction terms. The negative coefficient on age × transplant suggests that transplant is more effective for younger patients, and its benefit diminishes with age. The quadratic age term indicates a curved relationship between age and survival probability. This model better captures complex survival patterns across different patient groups.


5.
* Transparency: Complex models with nonlinear terms are difficult to explain to patients and doctors, reducing trust.
* Ethical use of variables: Age may be statistically useful but ethically problematic as a basis for organ allocation.  
* Fairness: If the model learns from biased data (e.g., overrepresenting certain groups), it may allocate organs unfairly.


**Q7.** In class, we showed that for the single linear regression model,
\begin{alignat*}{3}
a^* &=& \bar{y} \\
b^* &=& \dfrac{\sum_{i=1}^N(y_i - \bar{y})(x_i-\bar{x})}{\sum_{i=1}^N (x_i-\bar{x})^2},
\end{alignat*}

1. When will $b^*$ be large or small, depending on the relationship between $X$ and $Y$ and the variance of $X$?
2. Suppose you have measurement error in $X$ which artificially inflates its variance (e.g. bad data cleaning). We'll model this as saying the "real" value of $X$ for observation $i$ is $z_i$, but we observe $x_i = z_i + n_i$, where $n_i$ is the added noise. Does this affect the intercept of the regression? What happens to the $b^*$ coefficient relative to a noise-less model? How will affect your ability to predict? (This phenomenon is called **attenuation**.)
3. Suppose the noise $n_i$ is independent of $z_i$ and $y_i$, so that (approximately)
$$
\dfrac{1}{N} \sum_{i=1}^N (y_i - \bar{y})(n_i - \bar{n}) =0, \quad \dfrac{1}{N} \sum_{i=1}^N (z_i - \bar{z})(n_i - \bar{n}) =0.
$$
and that the mean of the bias is zero, so that
$$
\dfrac{1}{N} \sum_{i=1}^N n_i = 0.
$$
In this case, the noise $n_i$ is zero on average and independent of the values of $x_i$ and $y_i$: It's just measurement error or lazy data cleaning.
Explain the intuition of your result.

4. How does attenuation factor into the cost-benefit analysis of gathering higher quality data or cleaning it more carefully?

1.
The slope coefficient b* becomes larger when X and Y have a strong relationship—meaning they move together closely (high covariance). It also increases when the values of X are less spread out (low variance), since a smaller denominator makes the slope steeper.

On the other hand, if the relationship between X and Y is weak (low covariance), or if X has a large spread (high variance), the slope b* will be smaller and flatter.

2.
When the observed predictor 𝑥𝑖 includes measurement error, such that 𝑥𝑖 = 𝑧𝑖 + 𝑛𝑖, the variance of 𝑥𝑖 increases relative to the true signal 𝑧𝑖. This inflation causes the estimated slope 𝑏∗ to shrink toward zero, a phenomenon known as attenuation bias. As a result, the model underestimates the true effect of 𝑋 on 𝑌, which reduces predictive accuracy and weakens inference.



3.If the noise 𝑛𝑖 is independent of both 𝑧𝑖 and 𝑦𝑖, and has zero mean, it does not introduce systematic bias, but it still dilutes the signal. The added noise increases the variance of the regressor, which flattens the estimated regression line. Intuitively, even though the noise cancels out on average, it spreads out the predictor values, making the true relationship harder to detect.



4.Attenuation bias highlights the importance of high-quality, clean data. Reducing measurement error leads to more accurate slope estimates and better predictive performance. While collecting or cleaning better data incurs cost, the benefits—improved inference, more trustworthy models, and higher decision-making quality—often justify the investment. In short, cleaner data yields stronger signals and more reliable conclusions.

