# üìù Exercises: Simple Linear Regression

**Module 1: Machine Learning with Python - Lesson 02**

---

## üéØ Goals

These exercises will help you consolidate:
- Linear regression implementation
- Interpretation of coefficients and metrics
- Residue analysis
- Model validation
- Application to real problems

---

## üìã Instructions

1. Read each exercise carefully
2. Solve in the cells provided
3. Check your results by running the code
4. **DO NOT see the solutions until you try**
5. Compare your answers at the end

---

## ‚öôÔ∏è Initial Configuration

In [None]:
# Importar bibliotecas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy import stats

# Configuraci√≥n
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (10, 6)
np.random.seed(42)

print("‚úÖ Bibliotecas importadas correctamente")

---

## Exercise 1: Car Price Prediction (‚≠ê)

### Context:
A dealer wants to predict the price of used cars based on their age.

### Data:

In [None]:
# Ejecuta esta celda para generar los datos
np.random.seed(42)

antiguedad = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
precio = np.array([28000, 25000, 22000, 19000, 17000, 15000, 13000, 11000, 9000, 7500])

df_autos = pd.DataFrame({
    'Antiguedad_a√±os': antiguedad,
    'Precio_USD': precio
})

print("Dataset de Autom√≥viles:")
print(df_autos)

### Tasks:

**a) Exploration:**
1. Create an Age vs Price scatter plot
2. Calculate the correlation between both variables
3. Do you expect a positive or negative relationship?

In [None]:
# Tu c√≥digo aqu√≠ - Exploraci√≥n


**b) Manual Implementation:**
1. Calculate the slope (m) and intercept (b) using the formulas
2. Write the equation of the line
3. Interpret the meaning of m and b

In [None]:
# Tu c√≥digo aqu√≠ - Implementaci√≥n manual


**c) Model with Scikit-learn:**
1. Train a linear regression model
2. Compare the coefficients with your manual calculation
3. Calculate R¬≤

In [None]:
# Tu c√≥digo aqu√≠ - Modelo sklearn


**d) Prediction:**
1. What would be the expected price of a 12-year-old car?
2. And a new one (0 years)?
3. Create a graph showing data and regression line

In [None]:
# Tu c√≥digo aqu√≠ - Predicci√≥n


---

## Exercise 2: Advertising Effectiveness (‚≠ê‚≠ê)

### Context:
A company wants to determine the ROI of its digital advertising investment.

### Data:

In [None]:
# Ejecuta esta celda
np.random.seed(42)

gasto_publicidad = np.random.uniform(5, 50, 30)
ventas = 20 + 2.5 * gasto_publicidad + np.random.normal(0, 8, 30)

df_marketing = pd.DataFrame({
    'Gasto_Publicidad_k': gasto_publicidad,
    'Ventas_k': ventas
})

print("Dataset de Marketing:")
print(df_marketing.head(10))

### Tasks:

**a) Exploratory Analysis:**
1. Show descriptive statistics
2. Create a scatter plot
3. Calculate and report the correlation

In [None]:
# Tu c√≥digo aqu√≠


**b) Model and Evaluation:**
1. Divide the data into train (70%) and test (30%)
2. Train a linear regression model
3. Calculate R¬≤, RMSE and MAE for train and test
4. Is there overfitting?

In [None]:
# Tu c√≥digo aqu√≠


**c) Business Interpretation:**
1. How much do sales increase for each additional $1,000 in advertising?
2. What is the ROI? (Sales generated / Investment)
3. If the budget is $30k, how many sales do you expect?
4. Answer: Is it worth investing more in advertising?

In [None]:
# Tu c√≥digo aqu√≠


**d) Full View:**
Create a figure with 2 subplots:
1. Data + regression line
2. Predictions vs actual values

In [None]:
# Tu c√≥digo aqu√≠


---

## Exercise 3: Fuel Consumption (‚≠ê‚≠ê)

### Context:
Analyzes the relationship between the weight of a vehicle and its fuel consumption.

### Data:

In [None]:
# Ejecuta esta celda
np.random.seed(42)

peso_kg = np.random.uniform(1000, 2500, 50)
consumo_lkm = 3 + 0.004 * peso_kg + np.random.normal(0, 0.5, 50)

df_vehiculos = pd.DataFrame({
    'Peso_kg': peso_kg,
    'Consumo_L_100km': consumo_lkm
})

print("Dataset de Veh√≠culos:")
print(df_vehiculos.head(10))
print(f"\nTotal: {len(df_vehiculos)} veh√≠culos")

### Tasks:

**a) Exploration and Model:**
1. Visualize the weight-consumption relationship
2. Train a model (80/20 split)
3. Report complete metrics

In [None]:
# Tu c√≥digo aqu√≠


**b) Residue Analysis:**
Create a figure with 4 subplots:
1. Residuals vs predictions
2. Residual histogram
3. Q-Q plot
4. Waste vs order

In [None]:
# Tu c√≥digo aqu√≠


**c) Interpretation:**
1. Does the residue appear normal?
2. Are there worrying patterns?
3. Does the model meet the linear regression assumptions?

In [None]:
# Responde aqu√≠ (puedes usar print() o markdown)


**d) Practical Cases:**
Predict consumption for:
1. Small car: 1200 kg
2. Medium sedan: 1600 kg
3. SUV: 2200kg

In [None]:
# Tu c√≥digo aqu√≠


---

## Exercise 4: Grade Prediction (‚≠ê‚≠ê‚≠ê)

### Context:
A university wants to predict the final grade based on weekly study hours.

### Data:

In [None]:
# Ejecuta esta celda
np.random.seed(42)

horas_estudio = np.random.uniform(5, 40, 100)
calificacion = 40 + 1.2 * horas_estudio + np.random.normal(0, 5, 100)
calificacion = np.clip(calificacion, 0, 100)  # Limitar entre 0-100

df_estudiantes = pd.DataFrame({
    'Horas_Estudio_Semanal': horas_estudio,
    'Calificacion_Final': calificacion
})

print("Dataset de Estudiantes:")
print(df_estudiantes.head(10))
print(f"\nEstad√≠sticas:")
print(df_estudiantes.describe())

### Tasks:

**Part 1: Complete Analysis**
1. Visual exploration with scatter plot
2. Train/test division (75/25)
3. Model training
4. Complete evaluation (R¬≤, RMSE, MAE)

In [None]:
# Tu c√≥digo aqu√≠ - Parte 1


**Part 2: Deep Diagnosis**
1. Create the 4 residual graphs
2. Perform a normality test (Shapiro-Wilk)
3. Identify outliers (residuals > 2 standard deviations)
4. Report if there is heteroscedasticity

In [None]:
# Tu c√≥digo aqu√≠ - Parte 2


**Part 3: Academic Interpretation**
Answer:
1. How many points does the grade improve for each additional hour of study?
2. How many hours does a student need to study to pass (‚â•60)?
3. How many hours to obtain excellence (‚â•90)?
4. Is the model reliable? Justify with metrics

In [None]:
# Tu c√≥digo aqu√≠ - Parte 3


**Part 4: Professional Visualization**
Create a figure with 3 subplots:
1. Data with regression line + confidence bands
2. Predictions vs actuals with diagonal line
3. Distribution of residuals with superimposed normal curve

In [None]:
# Tu c√≥digo aqu√≠ - Parte 4


---

## Exercise 5: Integrative Project - Housing Prices (‚≠ê‚≠ê‚≠ê)

### Context:
You are a data scientist at a real estate company. You must create a model to estimate house prices based on size.

### Data:

In [None]:
# Ejecuta esta celda
np.random.seed(42)

tamano_m2 = np.random.uniform(50, 300, 150)
precio_usd = 50000 + 1500 * tamano_m2 + np.random.normal(0, 30000, 150)

df_casas = pd.DataFrame({
    'Tamano_m2': tamano_m2,
    'Precio_USD': precio_usd
})

print("Dataset de Viviendas:")
print(df_casas.head(10))
print(f"\nTotal: {len(df_casas)} propiedades")
print("\nEstad√≠sticas:")
print(df_casas.describe())

### COMPLETE PROJECT:

Develop a complete professional analysis that includes:

#### 1. Data Exploration (EDA)
- Complete descriptive statistics
- Detection of outliers
- Informative visualizations
- Correlation analysis

In [None]:
# Secci√≥n 1: EDA


#### 2. Preparation and Modeling
- Strategic train/test division
- Model training
- Saving important parameters

In [None]:
# Secci√≥n 2: Modelado


#### 3. Rigorous Evaluation
- All metrics (R¬≤, RMSE, MAE, MAPE)
- Train vs test comparison
- Analysis of percentage errors

In [None]:
# Secci√≥n 3: Evaluaci√≥n


#### 4. Complete Residue Diagnosis
- 4 diagnostic graphs
- Statistical tests
- Problem identification
- Suggestions for improvement

In [None]:
# Secci√≥n 4: Diagn√≥stico


#### 5. Business Interpretation
Answer:
1. What is the price per m¬≤?
2. What is the base price (intercept)?
3. Create a table of estimated prices for houses of 80, 120, 160, 200, 240 m¬≤
4. What is the model's margin of error?
5. Would you recommend using this model in production? Because?

In [None]:
# Secci√≥n 5: Interpretaci√≥n


#### 6. Professional Visualization
Create a dashboard with 6 graphs:
1. Scatter plot with regression line
2. Predictions vs actuals
3. Residuals vs predictions
4. Residual histogram
5. Q-Q plot
6. Distribution of percentage errors

In [None]:
# Secci√≥n 6: Dashboard Visual


#### 7. Executive Report
Write a professional summary (use markdown) that includes:
- Objective of the analysis
- Main findings
- Key metrics
- Model limitations
- Recommendations
- Next steps

**EXECUTIVE REPORT**

---

[Write your report here]

---

---

## üéØ Self-assessment

Before looking at the solutions, check:

### Exercise 1: Automobiles
- [ ] I correctly implemented the model
- [ ] I interpreted the negative sign of the coefficient
- [ ] Predictions make logical sense

### Exercise 2: Advertising
- [ ] I calculated ROI correctly
- [ ] I compared train vs test metrics
- [ ] I can explain the business value

### Exercise 3: Fuel
- [ ] I analyzed residues completely
- [ ] I identified if assumptions are met
- [ ] Predictions are realistic

### Exercise 4: Qualifications
- [ ] I performed in-depth diagnosis
- [ ] I found outliers if there are any
- [ ] I interpreted results academically

### Exercise 5: Housing Project
- [ ] Completed all 7 sections
- [ ] Created professional visualizations
- [ ] I wrote a clear executive report
- [ ] I identified limitations and improvements

---

## üí° Final Tips

1. **Always interpret**: Numbers without context have no value
2. **Visualize**: A graph is worth a thousand numbers
3. **Validates assumptions**: Waste tells stories
4. **Think business**: How is this used in practice?
5. **Be critical**: Every model has limitations

---

## üöÄ Next Steps

1. **Review** solutions after trying
2. **Compare** your approach with the proposed solutions
3. **Practice more** with Kaggle datasets
4. **Advance** to Multiple Linear Regression

---

**Good luck! üéì**