# Introduction to the "Modeling.ipynb" Notebook

In this notebook, the goal is to build a predictive model aimed at forecasting medical charges based on a variety of features. The process will involve selecting the most suitable machine learning algorithm for the task, training the model, and fine-tuning its hyperparameters to optimize its performance.

The notebook will guide you through the steps of:

1. **Data Preparation**: Ensuring that the data is preprocessed and ready for modeling.
2. **Model Creation**: Exploring different modeling techniques such as linear regression, decision trees, or ensemble methods to predict medical charges.
3. **Hyperparameter Tuning**: Applying techniques like GridSearchCV or RandomizedSearchCV to identify the best hyperparameters for the chosen model to achieve better accuracy and generalization.

By the end of this notebook, the best-performing model will be selected, ready to provide predictions on unseen data, helping to gain insights into factors influencing medical charges.


## Data Load and first visualization

In [16]:
import pandas as pd

path = '../medical_insurance_project/data/preprocessed.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,age,bmi,children,charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,0.0,0.414597,0.0,16884.924,0,1,0,0,1
1,0.249979,0.572109,0.6,4449.462,1,0,0,1,0
2,0.331601,0.234754,0.0,21984.47061,1,0,1,0,0
3,0.316796,0.446201,0.0,3866.8552,1,0,1,0,0
4,0.300919,0.342121,0.0,3756.6216,0,0,0,1,0


## Feature Selection

According to the study conducted in the preprocessing notebook, the most important variables are:

- Age
- BMI
- Smoker_Yes
- Children

Therefore, these will be the selected features for the creation of the model.

In [17]:
selected_features = df[['age','bmi','smoker_yes','children','charges']]
selected_features.head()

Unnamed: 0,age,bmi,smoker_yes,children,charges
0,0.0,0.414597,1,0.0,16884.924
1,0.249979,0.572109,0,0.6,4449.462
2,0.331601,0.234754,0,0.0,21984.47061
3,0.316796,0.446201,0,0.0,3866.8552
4,0.300919,0.342121,0,0.0,3756.6216


## Saving the slected_features dataset into a csv file

In [18]:
selected_features.to_csv('../medical_insurance_project/data/selected_features.csv', index=False)

# Model Training and Results Analysis

The training process involved running multiple machine learning models with various combinations of hyperparameters. These models and their respective hyperparameter configurations were specified in the `models.py` file.

## Workflow Overview
1. **Model Definitions:** 
   - All model architectures and hyperparameter ranges were defined in the `models.py` script.
   - The script includes a systematic approach to iterate over different combinations of hyperparameters for each model.

2. **Training Process:**
   - Each model was trained using the specified configurations.
   - Metrics such as accuracy, precision, recall, F1-score, and other relevant evaluation criteria were recorded for each combination.

3. **Validation:**
   - A cross-validation approach was used to ensure robust performance metrics.
   - The results were stored for further analysis.

## Next Steps: Results Analysis
We will now analyze the results to:
- Identify the best-performing model and hyperparameter combination.
- Compare models based on their performance metrics.
- Discuss potential insights and next steps for improving performance.

Stay tuned as we dive deeper into the findings and interpret the results from this extensive training process.


In [None]:
"""
################################ Evaluating: linear_regression ################################ 

Best parameters: {'pca__n_components': None}
Best cross-validation score (MSE): -38769726.84336049
Train MSE: 38259673.17514955
Train RMSE: 6185.440418850509
Test MSE: 34909257.98721024
Test RMSE: 5908.405706043741
Train R2: 0.7532775051417095
Test R2: 0.7102604070256613

 ################################ Evaluating: ridge_regression ################################ 

Best parameters: {'model__alpha': 0.1, 'pca__n_components': None}
Best cross-validation score (MSE): -38769850.42859865
Train MSE: 38259792.648339495
Train RMSE: 6185.450076456805
Test MSE: 34897648.36215247
Test RMSE: 5907.423157532603
Train R2: 0.7532767347032545
Test R2: 0.7103567645030976

 ################################ Evaluating: lasso_regression ################################ 

Best parameters: {'model__alpha': 0.001, 'pca__n_components': None}
Best cross-validation score (MSE): -38769729.87413704
Train MSE: 38259673.175218716
Train RMSE: 6185.440418856099
Test MSE: 34909252.083462305
Test RMSE: 5908.405206437885
Train R2: 0.7532775051412636
Test R2: 0.7102604560255432

 ################################ Evaluating: random_forest ################################ 

Best parameters: {'model__max_depth': 5, 'model__min_samples_leaf': 2, 'model__min_samples_split': 2, 'model__n_estimators': 150, 'pca__n_components': None}
Best cross-validation score (MSE): -25393477.256383758
Train MSE: 16248053.292774929
Train RMSE: 4030.8874076033094
Test MSE: 20417339.381819565
Test RMSE: 4518.555010378823
Train R2: 0.8952223081825053
Test R2: 0.8305403224475671

 ################################ Evaluating: lightgbm ################################ 

Best parameters: {'model__learning_rate': 0.13, 'model__max_depth': 3, 'model__n_estimators': 28, 'model__num_leaves': 13, 'model__verbosity': -1, 'pca__n_components': None}
Best cross-validation score (MSE): -25341414.417343054
Train MSE: 20251299.69063321
Train RMSE: 4500.144407753291
Test MSE: 21547969.14003336
Test RMSE: 4641.979011158211
Train R2: 0.8694068514144745
Test R2: 0.8211563302106207
-------------------------------- 
Best model identified: --------------------------------
Parameters: {'model__learning_rate': 0.13, 'model__max_depth': 3, 'model__n_estimators': 28, 'model__num_leaves': 13, 'model__verbosity': -1, 'pca__n_components': None}
Score (MSE): -25341414.417343054
----------------------------------------------------------------
"""

# Model Training and Evaluation Summary

The training and evaluation of multiple models yielded the following insights:

## **1. Linear Regression**
- **Performance:**
  - Train MSE: `38,259,673.18`
  - Train RMSE: `6,185.44`
  - Test MSE: `34,909,257.99`
  - Test RMSE: `5,908.41`
  - Train R²: `0.7533`
  - Test R²: `0.7103`

Linear Regression demonstrated moderate performance, with a reasonable fit to both training and test data.

---

## **2. Ridge Regression**
- **Performance:**
  - Train MSE: `38,259,792.65`
  - Train RMSE: `6,185.45`
  - Test MSE: `34,897,648.36`
  - Test RMSE: `5,907.42`
  - Train R²: `0.7533`
  - Test R²: `0.7104`

Ridge Regression provided nearly identical results to Linear Regression, suggesting minimal impact from regularization.

---

## **3. Lasso Regression**
- **Performance:**
  - Train MSE: `38,259,673.18`
  - Train RMSE: `6,185.44`
  - Test MSE: `34,909,252.08`
  - Test RMSE: `5,908.41`
  - Train R²: `0.7533`
  - Test R²: `0.7103`

Similar to Ridge, Lasso Regression offered comparable results with minimal regularization impact.

---

## **4. Random Forest**
- Train MSE: 16,248,053.29
- Train RMSE: 4,030.89
- Test MSE: 20,417,339.38
- Test RMSE: 4,518.56
- Train R²: 0.8952
- Test R²: 0.8305

### Random Forest significantly outperformed linear models, capturing more variance and achieving higher R² scores.

## **5. LightGBM**

### Performance:

- Train MSE: 20,251,299.69
- Train RMSE: 4,500.14
- Test MSE: 21,547,969.14
- Test RMSE: 4,641.98
- Train R²: 0.8694
- Test R²: 0.8212

### LightGBM performed similarly to Random Forest, with slightly lower variance captured on the training set but still competitive on test data.

## Best Model: LightGBM

- Best Cross-Validation Score (MSE): -25,341,414.42

### LightGBM was identified as the best-performing model based on its cross-validation score and overall test performance.

### Conclusions

- Linear Models: Performed reasonably but showed limited ability to capture non-linear patterns.
- Random Forest & LightGBM: These models captured more variance, with LightGBM slightly edging out Random Forest in terms of cross-validation score and interpretability.

# Conclusions

In this notebook, several machine learning models were trained and evaluated to predict medical charges. The following steps summarize the process:

1. **Data Preprocessing**: The dataset was preprocessed, including scaling numerical features and handling categorical variables.
2. **Model Training**: Multiple models were trained using GridSearchCV to find the optimal hyperparameters. Models included:
   - Linear Regression
   - Ridge Regression
   - Lasso Regression
   - Random Forest
   - LightGBM
   
3. **Evaluation**: Models were evaluated using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² for both training and test sets. The best-performing model was **[LightGBM]** with the following characteristics:
   - Parameters: [
      'model__learning_rate': 0.13,
      'model__max_depth': 3,
      'model__n_estimators': 28,
      'model__num_leaves': 13
   ]
   - Test RMSE: 4641.979011158211
   - Test R²: 0.8211563302106207

4. **Exporting the Model**: The best model was serialized and exported as a file, ensuring it is ready to be deployed in a production environment.

---

The selected model has demonstrated robust performance and is now prepared for production deployment. Further monitoring and evaluation will be conducted in the production phase to ensure its real-world performance aligns with expectations.
