<a href="https://colab.research.google.com/github/RafaelAnga/Artificial-Intelligence/blob/main/Supervised-Learning/Regression/Regression_LightBGM_Insurance_Cost_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Insurance Costs Using LightGBM Regressor

This code implements a LightGBM Regressor to predict insurance costs based on customer data. The dataset (insurance.csv) includes features such as age, sex, BMI, number of children, smoking status, and region. The code preprocesses the data, trains a LightGBM regression model, and evaluates its performance using metrics like R-squared, Adjusted R-squared, and k-Fold Cross-Validation. Additionally, it performs Grid Search to optimize hyperparameters for better model performance.

**Business Applications:**

This model can be used in various business scenarios, such as:

1. Insurance Premium Prediction: Predicting the cost of insurance premiums for new customers based on their demographic and health-related data.
2. Risk Assessment: Identifying high-risk customers (e.g., smokers or individuals with high BMI) to adjust premiums or offer targeted health programs.
3. Customer Segmentation: Grouping customers based on predicted costs to design personalized insurance plans.
4. Policy Optimization: Helping insurance companies optimize their pricing strategies to remain competitive while managing risk.

## Part 1 - Data Preprocessing

### Importing the dataset

In [33]:
# Used to connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
 # Library necesary to access the folder route
import os
os.chdir('/content/drive/MyDrive/Machine Learning/Regression Templates/DataSets')

#Lists the available directories
os.listdir()

['Data.csv',
 'Salary_Data.csv',
 '50_Startups.csv',
 'Position_Salaries.csv',
 'insurance.csv']

In [35]:
import pandas as pd
dataset = pd.read_csv('insurance.csv')

In [36]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Checking missing data

In [37]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


### Handling categorical variables

Sex column

In [38]:
dataset['sex'].unique()

array(['female', 'male'], dtype=object)

In [39]:
dataset['sex'] = dataset['sex'].apply(lambda x: 0 if x == 'female' else 1)

In [40]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,yes,southwest,16884.924
1,18,1,33.77,1,no,southeast,1725.5523
2,28,1,33.0,3,no,southeast,4449.462
3,33,1,22.705,0,no,northwest,21984.47061
4,32,1,28.88,0,no,northwest,3866.8552


Smoker column

In [41]:
dataset['smoker'].unique()

array(['yes', 'no'], dtype=object)

In [42]:
dataset['smoker'] = dataset['smoker'].apply(lambda x: 0 if x == 'no' else 1)

In [43]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


Region column

In [44]:
dataset['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [45]:
region_dummies = pd.get_dummies(dataset['region'], drop_first = True)

In [46]:
region_dummies

Unnamed: 0,northwest,southeast,southwest
0,False,False,True
1,False,True,False
2,False,True,False
3,True,False,False
4,True,False,False
...,...,...,...
1333,True,False,False
1334,False,False,False
1335,False,True,False
1336,False,False,True


In [47]:
dataset = pd.concat([region_dummies, dataset], axis = 1)

In [48]:
dataset.head()

Unnamed: 0,northwest,southeast,southwest,age,sex,bmi,children,smoker,region,charges
0,False,False,True,19,0,27.9,0,1,southwest,16884.924
1,False,True,False,18,1,33.77,1,0,southeast,1725.5523
2,False,True,False,28,1,33.0,3,0,southeast,4449.462
3,True,False,False,33,1,22.705,0,0,northwest,21984.47061
4,True,False,False,32,1,28.88,0,0,northwest,3866.8552


In [49]:
dataset.drop(['region'], axis = 1, inplace = True)

In [50]:
dataset.head()

Unnamed: 0,northwest,southeast,southwest,age,sex,bmi,children,smoker,charges
0,False,False,True,19,0,27.9,0,1,16884.924
1,False,True,False,18,1,33.77,1,0,1725.5523
2,False,True,False,28,1,33.0,3,0,4449.462
3,True,False,False,33,1,22.705,0,0,21984.47061
4,True,False,False,32,1,28.88,0,0,3866.8552


### Creating the Training Set and the Test Set

Getting the inputs and output

In [51]:
X = dataset.iloc[:, :-1].values

In [52]:
y = dataset.iloc[:, -1].values

In [53]:
X

array([[False, False, True, ..., 27.9, 0, 1],
       [False, True, False, ..., 33.77, 1, 0],
       [False, True, False, ..., 33.0, 3, 0],
       ...,
       [False, True, False, ..., 36.85, 0, 0],
       [False, False, True, ..., 25.8, 0, 0],
       [True, False, False, ..., 29.07, 0, 1]], dtype=object)

In [54]:
y

array([16884.924 ,  1725.5523,  4449.462 , ...,  1629.8335,  2007.945 ,
       29141.3603])

Getting the Training Set and the Test Set

In [55]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [56]:
import lightgbm as lgb
model = lgb.LGBMRegressor()

### Training the model

In [57]:
model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000223 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1070, number of used features: 8
[LightGBM] [Info] Start training from score 13201.182046




### Inference

In [58]:
y_pred = model.predict(X_test)



## Part 3: Evaluating the model

### R-Squared

In [59]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)

In [60]:
r2

0.8875426023265389

### Adjusted R-Squared

In [61]:
k = X_test.shape[1]
n = len(X_test)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

In [62]:
adj_r2

0.8840690147536134

### k-Fold Cross Validation

In [71]:
import warnings
warnings.filterwarnings("ignore")

import lightgbm as lgb
model = lgb.LGBMRegressor(verbose=-1)

from sklearn.model_selection import cross_val_score
r2s = cross_val_score(estimator = model,
                      X = X,
                      y = y,
                      scoring = 'r2',
                      cv = 10)
print("R-Squared: {:.2f} %".format(r2s.mean()*100))
print("Standard Deviation of R-Squared: {:.2f} ".format(r2s.std()))

R-Squared: 84.18 %
Standard Deviation of R-Squared: 0.05 


### Grid Search

In [72]:
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb

# Create the LightGBM model
model = lgb.LGBMRegressor(verbose=-1)

parameters = [{'num_leaves': [29, 30, 31, 32, 33],
               'learning_rate': [0.08, 0.09, 0.1, 0.11, 0.12],
               'n_estimators': [80, 90, 100, 110, 120]}]

grid_search = GridSearchCV(estimator=model,
                           param_grid=parameters,
                           scoring='r2',
                           cv=10,
                           verbose=0)

grid_search.fit(X, y)

best_r2 = grid_search.best_score_
best_parameters = grid_search.best_params_

print("Best R-Squared: {:.2f} %".format(best_r2 * 100))
print("Best Parameters:", best_parameters)

Best R-Squared: 84.89 %
Best Parameters: {'learning_rate': 0.08, 'n_estimators': 80, 'num_leaves': 30}


# Explanation of the Code:
**1. Data Preprocessing:**
1. Handling Categorical Variables:
* The sex and smoker columns are converted into binary values (e.g., 0 for female/no and 1 for male/yes).
* The region column is one-hot encoded to create dummy variables for each region.

2. Splitting the Data:

* The dataset is split into training and test sets (80% training, 20% testing) to evaluate the model's performance.

**2. Model Building and Training:**
1. LightGBM Regressor:
* A gradient boosting model optimized for speed and performance is used to predict insurance costs.
The model is trained on the training set using default parameters.

**3. Model Evaluation:**
1. R-Squared:
* Measures how well the model explains the variance in the target variable. A higher R-squared indicates better performance.
2. Adjusted R-Squared:
* Adjusts the R-squared value for the number of predictors in the model, preventing overfitting.
3. k-Fold Cross-Validation:
* Splits the data into 10 folds to evaluate the model's performance across different subsets, providing a more robust estimate of its accuracy.

**4. Hyperparameter Tuning:**
1. Grid Search:
* Optimizes the model by testing combinations of hyperparameters such as num_leaves, learning_rate, and n_estimators.
* The best parameters and corresponding R-squared score are identified to improve the model's performance.

**Key Metrics Explained:**

1.  R-Squared:
* Indicates the proportion of variance in the target variable explained by the model.
* Example: An R-squared of 0.85 means the model explains 85% of the variance in insurance costs.
2.  Adjusted R-Squared:
* Accounts for the number of predictors in the model.
* Prevents overfitting by penalizing models with too many features that do not improve performance.
3.  k-Fold Cross-Validation:
* Splits the data into 10 subsets (folds) and trains the model on 9 folds while testing on the remaining fold.
* Provides an average R-squared score and its standard deviation to assess model stability.
4.  Grid Search:
* Systematically tests combinations of hyperparameters to find the best configuration for the model.
* Example: The best parameters might include num_leaves = 30, learning_rate = 0.1, and n_estimators = 100.