# **SA1 - Multiple regression**  



#### **Objective:**  
To assess and strengthen understanding of multiple regression concepts, including coefficient interpretation, model validation, multicollinearity, and residual diagnostics.  

---



### **Exercise 1: Interpretation of Coefficients**  
A researcher conducts a study to predict apartment prices in a city based on three variables:  

- **X₁** = Size in square meters  
- **X₂** = Number of rooms  
- **X₃** = Location (1 if in a central area, 0 otherwise)  

The estimated model is:  
$$
Price = 50000 + 2000X₁ + 15000X₂ + 30000X₃
$$

1. What is the estimated price for an apartment of 80 m², with 3 rooms, located in a central area?  
2. What is the interpretation of the coefficient for variable $X_3$?  
3. What happens if $X_3$ takes the value of 0?  

---



In [7]:
estimated_price = 50000 + 2000*80 + 15000*3 + 30000*1
print("The estimated price is:")
print(estimated_price)

The estimated price is:
285000


Question 2:

The coefficient of \(X_3\) (30,000) indicates that, on average, an apartment located in a central area will have its price increase by 30,000 monetary units compared to an apartment that is not in a central area.

Question 3:

If the X3 takes the value of 0 the apartment price will be 30000 less compared to an apartment located in a central area,

### **Exercise 2: Multicollinearity**  
A dataset contains information on student exam performance with the following variables:  

- **Y** = Exam score  
- **X₁** = Study hours  
- **X₂** = Sleep hours  
- **X₃** = Socioeconomic level  

The researcher finds that the correlation between $X_1$ and $X_2$ is -0.95.  

1. Why can high correlation between $X_1$ and $X_2$ be a problem in multiple regression?  
2. What technique could be used to reduce the impact of multicollinearity?  

---



1. Why can high correlation between 
𝑋
1
X 
1
​
  and 
𝑋
2
X 
2
​
  be a problem in multiple regression?
A high correlation between two predictor variables, such as 
𝑋
1
X 
1
​
  (study hours) and 
𝑋
2
X 
2
​
  (sleep hours), can cause multicollinearity in multiple regression. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to separate their individual effects on the dependent variable.

Specifically, a correlation of -0.95 between 
𝑋
1
X 
1
​
  and 
𝑋
2
X 
2
​
  suggests that they are almost perfectly inversely related. This high correlation can cause the following problems:

Unstable coefficients: The estimated coefficients for 
𝑋
1
X 
1
​
  and 
𝑋
2
X 
2
​
  may be very sensitive to small changes in the data. This can lead to large standard errors for the coefficients and unreliable estimates.
Difficulty in interpretation: When two variables are highly correlated, it becomes challenging to interpret the individual impact of each variable on the dependent variable (exam score, 
𝑌
Y) because their effects are confounded.
Overfitting: The model might overfit the data, capturing noise rather than true relationships, and this can reduce the generalizability of the model to new data.


2. What technique could be used to reduce the impact of multicollinearity?
To reduce the impact of multicollinearity, several techniques can be applied:

Remove one of the correlated variables: If 
𝑋
1
X 
1
​
  and 
𝑋
2
X 
2
​
  are highly correlated, removing one of them from the model can help resolve the issue.
Principal Component Analysis (PCA): PCA is a technique that transforms the correlated variables into a smaller set of uncorrelated variables (principal components). These new components can then be used in the regression model.
Ridge regression or Lasso regression: These are regularization techniques that add a penalty to the model to reduce the size of the coefficients, thus mitigating the impact of multicollinearity. Ridge regression adds an L2 penalty, while Lasso adds an L1 penalty, which can also perform variable selection.
By using these techniques, you can address multicollinearity and improve the stability and interpretability of your regression model.

1. ¿Por qué puede ser un problema la alta correlación entre 
𝑋
1
X 
1
​
  y 
𝑋
2
X 
2
​
  en una regresión múltiple?
Una alta correlación entre dos variables predictoras, como 
𝑋
1
X 
1
​
  (horas de estudio) y 
𝑋
2
X 
2
​
  (horas de sueño), puede causar multicolinealidad en una regresión múltiple. La multicolinealidad ocurre cuando dos o más variables independientes en un modelo de regresión están altamente correlacionadas, lo que dificulta separar sus efectos individuales sobre la variable dependiente.

Específicamente, una correlación de -0.95 entre 
𝑋
1
X 
1
​
  y 
𝑋
2
X 
2
​
  sugiere que están casi perfectamente inversamente relacionadas. Esta alta correlación puede causar los siguientes problemas:

Coeficientes inestables: Los coeficientes estimados para 
𝑋
1
X 
1
​
  y 
𝑋
2
X 
2
​
  pueden ser muy sensibles a pequeños cambios en los datos. Esto puede generar errores estándar grandes para los coeficientes y estimaciones poco confiables.
Dificultad en la interpretación: Cuando dos variables están altamente correlacionadas, se hace difícil interpretar el impacto individual de cada una sobre la variable dependiente (puntaje del examen, 
𝑌
Y), ya que sus efectos están confundidos.
Sobreajuste (overfitting): El modelo puede sobreajustarse a los datos, capturando ruido en lugar de relaciones reales, lo que reduce la capacidad de generalización del modelo a nuevos datos.


2. ¿Qué técnica se podría usar para reducir el impacto de la multicolinealidad?
Para reducir el impacto de la multicolinealidad, se pueden aplicar varias técnicas:

Eliminar una de las variables correlacionadas: Si 
𝑋
1
X 
1
​
  y 
𝑋
2
X 
2
​
  están altamente correlacionadas, eliminar una de ellas del modelo puede ayudar a resolver el problema.
Análisis de Componentes Principales (PCA): El PCA es una técnica que transforma las variables correlacionadas en un conjunto más pequeño de variables no correlacionadas (componentes principales). Estos nuevos componentes se pueden usar en el modelo de regresión.
Regresión Ridge o Lasso: Son técnicas de regularización que agregan una penalización al modelo para reducir el tamaño de los coeficientes, mitigando así el impacto de la multicolinealidad. La regresión Ridge agrega una penalización L2, mientras que Lasso agrega una penalización L1, que también puede realizar selección de variables.
Al utilizar estas técnicas, se puede abordar la multicolinealidad y mejorar la estabilidad e interpretabilidad del modelo de regresión.

### **Exercise 3: Model Fit Assessment**  
A financial analyst builds a multiple regression model to predict investment performance. The results include:  

- **Adjusted $ R^2 $ = 0.78**  
- **Model p-value < 0.01**  
- **Coefficient for a key variable = 0.001, with p-value = 0.45**  

1. Is the model statistically significant overall? Why?  
2. How would you interpret the coefficient of the key variable?  
3. What would you do if the key variable is not significant but should theoretically be included in the model?  

---



### **Exercise 4: Residual Diagnostics**  
A researcher wants to check whether the residuals of their model meet the assumptions of normality and homoscedasticity. The following plots are generated:  

- **Histogram of residuals**: Shows a right-skewed distribution.  
- **Residuals vs. fitted values plot**: Displays a fan-shaped pattern.  

1. What does the skewness in the residuals histogram indicate?  
2. How would you interpret the fan-shaped pattern in the residuals plot?  
3. What transformations could you apply to the model to correct these issues?  

---



### **Exercise 5: Variable Selection**  
A team of economists is developing a model to predict monthly household electricity consumption. Initially, they include the following variables:  

- **X₁** = Number of people in the household  
- **X₂** = House size in square meters  
- **X₃** = Household income level  
- **X₄** = Age of the head of the household  

After training the model, they obtain the following results:  

| Variable | Coefficient | p-value |
|----------|------------|---------|
| X₁       | 25         | 0.01    |
| X₂       | 10         | 0.03    |
| X₃       | 0.5        | 0.48    |
| X₄       | -3         | 0.65    |

1. Which of these variables would you remove from the model? Justify your answer.  
2. What method could you use to automatically select the best variables for the model?  

---



This exercise bulletin will help reinforce your understanding of multiple regression. Once completed, review your answers based on theory, and if possible, test some of these concepts using Python.