<a href="https://colab.research.google.com/github/RafaelAnga/Artificial-Intelligence/blob/main/Supervised-Learning/Regression/Catboost_Regressor_Insurance_charges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Insurance Costs Using CatBoost Regressor

This code implements a CatBoost Regressor to predict insurance costs based on customer data. The dataset (insurance.csv) includes features such as age, sex, BMI, number of children, smoking status, and region. The code preprocesses the data, trains a CatBoost regression model, and evaluates its performance using metrics like R-squared, Adjusted R-squared, and k-Fold Cross-Validation. Additionally, it performs Grid Search to optimize hyperparameters for better model performance.

**Business Applications:**

This model can be used in various business scenarios, such as:

* Insurance Premium Prediction: Predicting the cost of insurance premiums for new customers based on their demographic and health-related data.
* Risk Assessment: Identifying high-risk customers (e.g., smokers or individuals with high BMI) to adjust premiums or offer targeted health programs.
* Customer Segmentation: Grouping customers based on predicted costs to design personalized insurance plans.
* Policy Optimization: Helping insurance companies optimize their pricing strategies to remain competitive while managing risk.

## Part 1 - Data Preprocessing

### Importing the dataset

In [1]:
# Used to connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
 # Library necesary to access the folder route
import os
os.chdir('/content/drive/MyDrive/Machine Learning/Regression Templates/DataSets')

#Lists the available directories
os.listdir()

['Data.csv',
 'Salary_Data.csv',
 '50_Startups.csv',
 'Position_Salaries.csv',
 'insurance.csv']

In [3]:
import pandas as pd
dataset = pd.read_csv('insurance.csv')

In [4]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Checking missing data

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


### Handling categorical variables

Sex column

In [6]:
dataset['sex'].unique()

array(['female', 'male'], dtype=object)

In [7]:
dataset['sex'] = dataset['sex'].apply(lambda x: 0 if x == 'female' else 1)

In [8]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,yes,southwest,16884.924
1,18,1,33.77,1,no,southeast,1725.5523
2,28,1,33.0,3,no,southeast,4449.462
3,33,1,22.705,0,no,northwest,21984.47061
4,32,1,28.88,0,no,northwest,3866.8552


Smoker column

In [9]:
dataset['smoker'].unique()

array(['yes', 'no'], dtype=object)

In [10]:
dataset['smoker'] = dataset['smoker'].apply(lambda x: 0 if x == 'no' else 1)

In [11]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


Region column

In [12]:
dataset['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [13]:
region_dummies = pd.get_dummies(dataset['region'], drop_first = True)

In [14]:
region_dummies

Unnamed: 0,northwest,southeast,southwest
0,False,False,True
1,False,True,False
2,False,True,False
3,True,False,False
4,True,False,False
...,...,...,...
1333,True,False,False
1334,False,False,False
1335,False,True,False
1336,False,False,True


In [15]:
dataset = pd.concat([region_dummies, dataset], axis = 1)

In [16]:
dataset.head()

Unnamed: 0,northwest,southeast,southwest,age,sex,bmi,children,smoker,region,charges
0,False,False,True,19,0,27.9,0,1,southwest,16884.924
1,False,True,False,18,1,33.77,1,0,southeast,1725.5523
2,False,True,False,28,1,33.0,3,0,southeast,4449.462
3,True,False,False,33,1,22.705,0,0,northwest,21984.47061
4,True,False,False,32,1,28.88,0,0,northwest,3866.8552


In [17]:
dataset.drop(['region'], axis = 1, inplace = True)

In [18]:
dataset.head()

Unnamed: 0,northwest,southeast,southwest,age,sex,bmi,children,smoker,charges
0,False,False,True,19,0,27.9,0,1,16884.924
1,False,True,False,18,1,33.77,1,0,1725.5523
2,False,True,False,28,1,33.0,3,0,4449.462
3,True,False,False,33,1,22.705,0,0,21984.47061
4,True,False,False,32,1,28.88,0,0,3866.8552


### Creating the Training Set and the Test Set

Getting the inputs and output

In [19]:
X = dataset.iloc[:, :-1].values

In [20]:
y = dataset.iloc[:, -1].values

In [21]:
X

array([[False, False, True, ..., 27.9, 0, 1],
       [False, True, False, ..., 33.77, 1, 0],
       [False, True, False, ..., 33.0, 3, 0],
       ...,
       [False, True, False, ..., 36.85, 0, 0],
       [False, False, True, ..., 25.8, 0, 0],
       [True, False, False, ..., 29.07, 0, 1]], dtype=object)

In [22]:
y

array([16884.924 ,  1725.5523,  4449.462 , ...,  1629.8335,  2007.945 ,
       29141.3603])

Getting the Training Set and the Test Set

In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [27]:
!pip install catboost



In [28]:
import catboost as cb
model = cb.CatBoostRegressor()

# Ignores Warnings for display porpuses

In [37]:
import warnings
warnings.filterwarnings("ignore")

### Training the model

In [38]:
model.fit(X_train, y_train)

Learning rate set to 0.041383
0:	learn: 11611.5326660	total: 908us	remaining: 908ms
1:	learn: 11297.2362282	total: 1.82ms	remaining: 907ms
2:	learn: 10987.8561010	total: 2.65ms	remaining: 880ms
3:	learn: 10664.1180964	total: 3.46ms	remaining: 862ms
4:	learn: 10377.3027972	total: 4.46ms	remaining: 888ms
5:	learn: 10078.6082882	total: 5.22ms	remaining: 866ms
6:	learn: 9809.1374130	total: 6.08ms	remaining: 863ms
7:	learn: 9571.6815432	total: 6.69ms	remaining: 830ms
8:	learn: 9319.9322507	total: 7.47ms	remaining: 822ms
9:	learn: 9081.2252419	total: 8.24ms	remaining: 816ms
10:	learn: 8862.0378680	total: 8.89ms	remaining: 799ms
11:	learn: 8630.0769266	total: 9.76ms	remaining: 804ms
12:	learn: 8437.0370569	total: 10.5ms	remaining: 799ms
13:	learn: 8239.7925079	total: 11.3ms	remaining: 794ms
14:	learn: 8052.4841061	total: 12ms	remaining: 790ms
15:	learn: 7864.5778395	total: 12.8ms	remaining: 788ms
16:	learn: 7693.6490256	total: 13.6ms	remaining: 785ms
17:	learn: 7521.2723681	total: 14.5ms	rema

<catboost.core.CatBoostRegressor at 0x7bef393bc6d0>

### Inference

In [39]:
y_pred = model.predict(X_test)

## Part 3: Evaluating the model

### R-Squared

In [40]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)

In [41]:
r2

0.8943206977287299

### Adjusted R-Squared

In [42]:
k = X_test.shape[1]
n = len(X_test)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

In [43]:
adj_r2

0.8910564721759494

### k-Fold Cross Validation

In [44]:
from sklearn.model_selection import cross_val_score
r2s = cross_val_score(estimator = model,
                      X = X,
                      y = y,
                      scoring = 'r2',
                      cv = 10)
print("R-Squared: {:.2f} %".format(r2s.mean()*100))
print("Standard Deviation: {:.2f} %".format(r2s.std()*100))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
6:	learn: 9882.1673821	total: 14.8ms	remaining: 2.1s
7:	learn: 9606.3664267	total: 16.5ms	remaining: 2.04s
8:	learn: 9344.0173099	total: 19.8ms	remaining: 2.18s
9:	learn: 9090.5544258	total: 24.6ms	remaining: 2.44s
10:	learn: 8860.6987446	total: 27ms	remaining: 2.42s
11:	learn: 8619.8829962	total: 28.7ms	remaining: 2.36s
12:	learn: 8410.8987434	total: 41.4ms	remaining: 3.14s
13:	learn: 8194.6867796	total: 44.6ms	remaining: 3.14s
14:	learn: 8000.5011819	total: 48.1ms	remaining: 3.16s
15:	learn: 7808.4554127	total: 49.9ms	remaining: 3.07s
16:	learn: 7633.2294400	total: 52.1ms	remaining: 3.01s
17:	learn: 7458.7752491	total: 54.8ms	remaining: 2.99s
18:	learn: 7297.3863227	total: 56.3ms	remaining: 2.91s
19:	learn: 7146.5091822	total: 58.4ms	remaining: 2.86s
20:	learn: 7002.1975514	total: 60.5ms	remaining: 2.82s
21:	learn: 6861.3190982	total: 62.8ms	remaining: 2.79s
22:	learn: 6730.2741190	total: 65ms	remaining: 2.76s
23:	learn

### Grid Search

In [45]:
from sklearn.model_selection import GridSearchCV
parameters = [{'learning_rate': [0.008,0.009,0.01],
               'depth': [4,7,10],
               'l2_leaf_reg': [2,6,10],
               'random_strength': [0,5,10]}]
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'r2',
                           cv = 10)
grid_search.fit(X, y)
best_r2 = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best R-Squared: {:.2f} %".format(best_r2*100))
print("Best Parameters:", best_parameters)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2:	learn: 11908.9722187	total: 3.78ms	remaining: 1.26s
3:	learn: 11839.4207532	total: 4.62ms	remaining: 1.15s
4:	learn: 11774.4016368	total: 6.42ms	remaining: 1.28s
5:	learn: 11706.1588478	total: 6.77ms	remaining: 1.12s
6:	learn: 11660.5412094	total: 11.9ms	remaining: 1.69s
7:	learn: 11596.2159838	total: 13.2ms	remaining: 1.64s
8:	learn: 11554.8437783	total: 19.7ms	remaining: 2.17s
9:	learn: 11493.5671220	total: 20.6ms	remaining: 2.04s
10:	learn: 11444.8518259	total: 27.5ms	remaining: 2.47s
11:	learn: 11397.6528643	total: 36.6ms	remaining: 3.01s
12:	learn: 11352.5957777	total: 42.5ms	remaining: 3.23s
13:	learn: 11289.6865290	total: 43.9ms	remaining: 3.09s
14:	learn: 11226.0242630	total: 44.2ms	remaining: 2.9s
15:	learn: 11180.8116003	total: 50.8ms	remaining: 3.12s
16:	learn: 11124.8895976	total: 52.1ms	remaining: 3.01s
17:	learn: 11081.9980149	total: 57.6ms	remaining: 3.14s
18:	learn: 11021.6078484	total: 58.1ms	remaining

# Explanation of the Code:
**Data Preprocessing:**

1. Handling Categorical Variables:
* The sex and smoker columns are converted into binary values (e.g., 0 for female/no and 1 for male/yes).
* The region column is one-hot encoded to create dummy variables for each region, and the first category is dropped to avoid multicollinearity.

2. Splitting the Data:
* The dataset is split into training and test sets (80% training, 20% testing) to evaluate the model's performance.

**Model Building and Training:**

1. CatBoost Regressor:
* A gradient boosting model optimized for categorical data is used to predict insurance costs.
* The model is trained on the training set using default parameters.

**Model Evaluation:**
1. R-Squared:
* Measures how well the model explains the variance in the target variable. A higher R-squared indicates better performance.
2. Adjusted R-Squared:
* Adjusts the R-squared value for the number of predictors in the model, preventing overfitting.
3. k-Fold Cross-Validation:
* Splits the data into 10 folds to evaluate the model's performance across different subsets, providing a more robust estimate of its accuracy.

**Hyperparameter Tuning:**
1. Grid Search:
* Optimizes the model by testing combinations of hyperparameters such as learning_rate, depth, l2_leaf_reg, and random_strength.
* The best parameters and corresponding R-squared score are identified to improve the model's performance.

## Key Metrics Explained:
1. R-Squared:
* Indicates the proportion of variance in the target variable explained by the model.
* Example: An R-squared of 0.85 means the model explains 85% of the variance in insurance costs.

2. Adjusted R-Squared:
* Accounts for the number of predictors in the model.
* Prevents overfitting by penalizing models with too many features that do not improve performance.

3. k-Fold Cross-Validation:
* Splits the data into 10 subsets (folds) and trains the model on 9 folds while testing on the remaining fold.
* Provides an average R-squared score and its standard deviation to assess
model stability.

4. Grid Search:
* Systematically tests combinations of hyperparameters to find the best configuration for the model.
* Example: The best parameters might include learning_rate = 0.01, depth = 7, and l2_leaf_reg = 6.