<h1><center>Supervised ML Regression Competition</center></h1>


<img align="center" src="https://compraracciones.com/wp-content/uploads/2021/04/insurance.jpg" style="height:200px" style="width:100px"/>

<hr style="border:2px solid pink"> </hr>

You have been assigned the task of building a model that will predict the insurance cost

You'll find the data in the csv file `insurance`


- target col: "charges"


<hr style="border:2px solid pink"> </hr>


**Guidelines:** 


- train_test_split
    - random state = 42
    - test size = 0.3


- The one who gets the highest r2-score on test data wins


## 1. Initial Data Exploration

Let's start by loading our dataset and taking a first look at it.


In [None]:
import pandas as pd

insurance = pd.read_csv("insurance.csv")

## 2. Checking for Missing Values

It's important to know if our data has any missing values. Let's check that next.


In [None]:
insurance.isnull().sum()


## 3. Descriptive Statistics

Now, let's move on to some descriptive statistics.

Understanding the distribution of our data is crucial. Let's calculate some descriptive statistics.


In [None]:
insurance.describe()

## 4. Distribution Analysis

Visualizing the distributions of our features can provide valuable insights. Let's plot the distributions for 'age', 'bmi', and 'charges'.

### Task:
- Plot the histogram for 'age'
- Plot the histogram for 'bmi'
- Plot the histogram for 'charges'


In [None]:
import matplotlib.pyplot as plt

# Angenommen, df ist dein DataFrame mit numerischen Spalten

insurance.hist(bins=30, figsize=(12, 8))
plt.tight_layout()
plt.show()

## 5. Relationship Between Variables

Let's explore the relationship between some of our features and the target variable 'charges'. We'll create scatter plots to visualize these relationships.

### Task:
- Create a scatter plot for 'age' vs 'charges'
- Create a scatter plot for 'bmi' vs 'charges'
- Create a scatter plot for 'children' vs 'charges'


In [None]:
import matplotlib.pyplot as plt

plt.scatter(insurance['age'], insurance['charges'])
plt.xlabel('Age')
plt.ylabel('Charges')
plt.title('Scatterplot Age and Charges')
plt.show()

plt.scatter(insurance['bmi'], insurance['charges'])
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.title('Scatterplot BMI and Charges')
plt.show()

plt.scatter(insurance['children'], insurance['charges'])
plt.xlabel('Children')
plt.ylabel('Charges')
plt.title('Scatterplot Children and Charges')
plt.show()


## 6. Categorical Analysis

Let's analyze the categorical features 'sex', 'smoker', and 'region' to see how they relate to 'charges'.

### Task:
- Plot the distribution of 'charges' for different 'sex'
- Plot the distribution of 'charges' for different 'smoker'
- Plot the distribution of 'charges' for different 'region'


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Distribution von 'charges' nach 'sex'
sns.boxplot(x='sex', y='charges', data=insurance)
plt.title('Charges distribution by sex')
plt.show()

# Distribution von 'charges' nach 'smoker'
sns.boxplot(x='smoker', y='charges', data=insurance)
plt.title('Charges distribution by smoker')
plt.show()

# Distribution von 'charges' nach 'region'
sns.boxplot(x='region', y='charges', data=insurance)
plt.title('Charges distribution by region')
plt.show()


## 7. Correlation Analysis

To understand how our numerical features relate to each other and to the target variable, let's calculate and visualize the correlation matrix.

### Task:
- Calculate the correlation matrix for the dataset
- Visualize the correlation matrix using a heatmap


In [None]:
corr_matrix = insurance[["age", "bmi", "children", "charges"]].corr()
print(corr_matrix)

# Modelling time!

## 1. Find the Naive Baseline

Before we build any models, let's establish a naive baseline. This will help us understand how well our models perform compared to a simple approach. In regression problems, the naive baseline is often the mean of the target variable.

### Task:
- Calculate the mean of the target variable 'charges'
- Explain why it's important to establish a naive baseline


In [None]:
naive_baseline = insurance["charges"].mean()
naive_baseline

## 2. Initial Modelling Without GridSearch or Pipeline

Let's build a simple linear regression model without any feature engineering, grid search, or pipeline. This will serve as our initial baseline for comparison.

### Task:
- Split the data into training and test sets
- Train a simple linear regression model
- Evaluate its performance using regression metrics
- Write it down as a markdown below so you can keep track. This is a scientific experiment


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

target = insurance["charges"]
features = insurance.drop(columns=["charges"])
features_encoded = pd.get_dummies(features, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.3, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Root Mean Squared Error: {rmse:.2f}")

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

# R² Score (Bestimmtheitsmaß)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")

## 3. Feature Engineering

Now, let's brainstorm and create some new features to see if we can improve the model's performance.

### Questions:
1. Should we create an interaction feature between 'bmi' and 'children'? 
2. Should we create age groups to see if the model improves by categorizing age?
3. Should we create a high-risk indicator based on 'smoker' and 'bmi'?

- Remember nothing is set in stone, this is your experiment, your hypothesis. You may not need to, but its important to explore these questions

### Task:
- Create new features based on the questions above
- Explain the rationale behind each feature



In [None]:
# interaction feature bmi and children
bmi_children = insurance["bmi"]*insurance["children"]

# age groups
bins = [0, 18, 30, 45, 60, 100]  # Altersschnittpunkte
labels = ['0-17', '18-29', '30-44', '45-59', '60+']  # Labels für die Gruppen
insurance['age_group'] = pd.cut(insurance['age'], bins=bins, labels=labels, right=False)
print(insurance[['age', 'age_group']].head())

# high risk indicator for smoker and high bmi
high_bmi_threshold = 30
insurance['high_risk'] = ((insurance['smoker'] == 'yes') & (insurance['bmi'] > high_bmi_threshold)).astype(int)
print(insurance[['smoker', 'bmi', 'high_risk']].head())

## 4. Modelling with Feature Engineering

Now that we have new features, let's see if they improve our model's performance.
Did it improve the performance? Yes? No? Why

### Task:
- Split the data into training and test sets
- Train a linear regression model with the new features
- Evaluate its performance using regression metrics


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

target = insurance["charges"]
features = insurance.drop(columns=["charges"])
features_encoded = pd.get_dummies(features, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.3, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Root Mean Squared Error: {rmse:.2f}")

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

# R² Score (Bestimmtheitsmaß)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")

## 5. Modelling with Pipeline and Grid Search

Now, let's see how using pipelines can simplify our workflow and prevent data leakage. We'll also use GridSearchCV to find the best hyperparameters.

### Task:
- Create a pipeline that includes scaling and linear regression
- Define a parameter grid for hyperparameter tuning
- Use GridSearchCV to find the best parameters and evaluate the model performance


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# pipeline including scaling and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),   
    ('model', Ridge())    
])
pipeline.fit(X_train, y_train)

# define parameter grid
param_grid = {
    'model__alpha': range(0, 1001)
}

# initialize GridSearch
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Optimal Parameter:", grid.best_params_)
print("Best CV-result:", grid.best_score_)


## 6. Trying Another Model with Pipeline

Let's try using a Gradient Boosting Regressor to see if it performs better.

### Task:
- Create and use a pipeline for Gradient Boosting Regressor
- Define a parameter grid for grid search
- Use GridSearchCV to find the best parameters and evaluate the model


In [74]:
from sklearn.ensemble import GradientBoostingRegressor

# pipeline including scaling and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),   
    ('model', GradientBoostingRegressor(random_state=42))    
])
pipeline.fit(X_train, y_train)

# define parameter grid
param_grid = {
    "model__n_estimators": [100, 200, 300],     # Anzahl Bäume
    "model__learning_rate": [0.01, 0.05, 0.1],  # Lernrate
    "model__max_depth": [3, 4, 5]               # Baumtiefe
}

# initialize GridSearch
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Optimal Parameter:", grid.best_params_)
print("Best CV-result:", grid.best_score_)

y_pred_final = grid.predict(X_test)

Optimal Parameter: {'model__learning_rate': 0.05, 'model__max_depth': 3, 'model__n_estimators': 100}
Best CV-result: 0.8538321530350546


## 7. GridSearch with Several Models

Finally, let's compare several models using GridSearchCV to find the best one.

### Task:
- Define multiple models and their parameter grids
- Use GridSearchCV to find the best model and parameters


In [None]:
from sklearn.linear_model import Lasso

# LASSO
# pipeline including scaling and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),   
    ('model', Lasso(random_state=42))    
])
pipeline.fit(X_train, y_train)

# define parameter grid
param_grid = {
    "model__alpha": range(0,1001)
}

# initialize GridSearch
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Optimal Parameter:", grid.best_params_)
print("Best CV-result:", grid.best_score_)

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Decision Tree Regressor

# pipeline including scaling and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),   
    ('model', DecisionTreeRegressor(random_state=42))    
])
pipeline.fit(X_train, y_train)

# define parameter grid
param_grid = {
    'model__max_depth': [3, 5, 10, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__max_leaf_nodes': [None, 10, 20, 30]
}

# initialize GridSearch
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Optimal Parameter:", grid.best_params_)
print("Best CV-result:", grid.best_score_)

In [None]:
# best performing linear regression model is GradientBoosterRegressor with R2 of 0.85

# Machine Learning: Master Challenge

## 8. Calculating Potential Cost or Loss

### Challenge:
Now that you've built and optimized your models, it's time for the final challenge! Your task is to minimize the Root Mean Squared Error (RMSE) of your model's predictions and calculate the potential financial impact of your model's errors.

### Task:
1. Calculate the RMSE of your final model's predictions.
2. Break down the errors into underestimation and overestimation.
3. Calculate the total potential cost or loss to the company.
4. Compete with your classmates to see who can achieve the lowest RMSE and financial impact!

### Explanation:
The RMSE provides an estimate of the average error in your model's predictions. We will also analyze the errors by categorizing them into underestimations and overestimations to understand their financial impact.

#### Steps to Calculate Underestimation and Overestimation Errors:

1. **Calculate RMSE**:
   - Use the `mean_squared_error` function from `sklearn.metrics` and pass your actual values (`y_test`) and predicted values (`y_pred_final`) to it.
   - Take the square root of the result to get the RMSE.
   
2. **Calculate Underestimation Error**:
   - Identify the instances where the actual charges (`y_test`) are greater than the predicted charges (`y_pred_final`).
   - For these instances, calculate the difference between the actual and predicted charges.
   - Sum these differences to get the total underestimation error.

3. **Calculate Overestimation Error**:
   - Identify the instances where the actual charges (`y_test`) are less than the predicted charges (`y_pred_final`).
   - For these instances, calculate the difference between the predicted and actual charges.
   - Sum these differences to get the total overestimation error.

4. **Calculate Total Potential Cost or Loss**:
   - Add the total underestimation error and the total overestimation error to get the total potential cost or loss.

### Let's see who can build the best model!

#### Detailed Instructions:

1. **Calculate RMSE**:
   - Use `mean_squared_error` with `y_test` and `y_pred_final`.
   - Use `np.sqrt` to take the square root of the result.

2. **Calculate Underestimation Error**:
   - Use a boolean condition to filter `y_test` values that are greater than `y_pred_final`.
   - Subtract the predicted values from the actual values for these instances.
   - Sum these differences.

3. **Calculate Overestimation Error**:
   - Use a boolean condition to filter `y_test` values that are less than `y_pred_final`.
   - Subtract the actual values from the predicted values for these instances.
   - Sum these differences.

4. **Calculate Total Potential Cost or Loss**:
   - Add the results of the underestimation error and overestimation error to get the total potential cost or loss.

### Example Walkthrough:

1. **Calculate RMSE**:
   - `rmse = np.sqrt(mean_squared_error(y_test, y_pred_final))`
   - This gives you the average prediction error in dollars.

2. **Calculate Underestimation Error**:
   - `underestimation_error = np.sum(y_test[y_test > y_pred_final] - y_pred_final[y_test > y_pred_final])`
   - This gives you the total amount by which the model undercharged.

3. **Calculate Overestimation Error**:
   - `overestimation_error = np.sum(y_pred_final[y_test < y_pred_final] - y_test[y_test < y_pred_final])`
   - This gives you the total amount by which the model overcharged.

4. **Calculate Total Potential Cost or Loss**:
   - `total_potential_loss = underestimation_error + overestimation_error`
   - This gives you the total financial impact of the model's errors.

### Leaderboard:
Post your RMSE score and total potential cost or loss on the class leaderboard. The student with the lowest RMSE and total potential cost or loss wins bragging rights

### Post Your Results 

- Name
- Model Type
- RMSE
- Underestimation Error
- Overestimation Error
- Total Potential Cost/Loss

In [82]:
# RMSE
import numpy as np
rmse_final_model = np.sqrt(mean_squared_error(y_test, y_pred_final))
print(f"Root Mean Squared Error: {rmse:.2f}")

# Underestimation Error
underestimation_error = np.sum(y_pred_final[y_test > y_pred_final] - y_test[y_test > y_pred_final])
print(f"undererstimation error: {underestimation_error:.2f}")

# Overestimation Error
overestimation_error = np.sum(y_pred_final[y_test < y_pred_final] - y_test[y_test < y_pred_final])
print(f"overerstimation error: {overestimation_error:.2f}")

# Total Potential Cost or Loss
total_potential_loss = underestimation_error + overestimation_error

print(f"total potential loss: {total_potential_loss:.2f}")

Root Mean Squared Error: 5220.14
undererstimation error: -445172.67
overerstimation error: 562864.53
total potential loss: 117691.86


## Conclusion

Congratulations! You've completed the lab. Here's a summary of what we've covered:
1. Established a naive baseline using the mean of the target variable.
2. Built an initial linear regression model without any feature engineering or optimization.
3. Performed feature engineering to create new, potentially useful features.
4. Used pipelines and GridSearchCV to optimize the model.
5. Evaluated the final model's performance using RMSE to understand its business impact.

By following these steps, you now have a robust understanding of how to approach a regression problem, from initial exploration to model optimization and business impact assessment. Great job!
