# **Insurance Pricing Prediction Using XGBoost Regressor**

## **Project Overview**
Health insurance companies offer financial protection by covering medical expenses in exchange for premiums. To remain profitable, insurers must ensure that the total premiums collected exceed payouts made on valid claims.

Traditional premium calculations involve manual, expert-driven processes, which are increasingly challenged by the growing complexity of healthcare data. This project addresses that challenge by developing a machine learning model to predict individual medical expenses based on factors like age, BMI, and smoking status.

**Objectives:**

*  Establish a baseline using Linear Regression.

*  Analyze feature relationships, including categorical correlations.

*  Build and evaluate an advanced XGBoost Regressor for improved prediction accuracy.

*  Compare models and communicate findings to non-technical stakeholders.

The goal is to enable data-driven pricing strategies that enhance profitability and decision-making in the health insurance sector.



![image](https://images.unsplash.com/photo-1637763723578-79a4ca9225f7?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1171&q=80)

## **Problem Statement**


For health insurance companies to remain profitable, the total premiums collected must exceed the payouts made on valid claims. Achieving this requires an accurate method for estimating expected healthcare costs for individuals based on various personal and lifestyle factors.

The goal of this project is to develop a machine learning model that can accurately predict individual healthcare expenses using provided features. This will support more informed and data-driven premium pricing decisions.

### Business Objective

The primary business objective is to help health insurance companies make informed, data-driven decisions when determining premiums for individuals. By accurately predicting future medical expenses using key features such as age, BMI, and smoking status, insurance providers can:

- Set fair and profitable insurance premiums.
- Reduce the risk of undercharging high-cost individuals.
- Improve overall financial planning and risk management.
- Enhance customer satisfaction by offering data-backed pricing.

Ultimately, the goal is to ensure that the company remains profitable while continuing to provide adequate financial protection to policyholders.

### 2. Data Understanding
- Dataset overview  
- Feature descriptions  
- Initial observations

Before we start, let's first import the necessary third party libraries:

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import sys
from src.stats import chi2, anova

from src.eda import plot_histograms, plot_univariate_numeric, plot_univariate_categorical, \
    plot_heatmap, plot_paired_boxplots, plot_paired_scatterplots, plot_residuals, plot_pearson_wrt_target

from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.feature_selection import RFE

## **Exploratory Data Analysis (EDA)**

Exploratory Data Analysis (EDA) is the process of examining a dataset using statistical summaries and visualizations to uncover patterns, relationships, anomalies, or assumptions before building models.

- EDA is a crucial step in the machine learning pipeline, as it helps us:

- Understand feature distributions

- Identify relationships between variables

- Detect missing values or outliers

- Gain insights that guide feature engineering and model selection

First, let us read in the dataset, which is stored in the `insurance.csv` file in the data folder and proceed with exploratory analysis:

### Import the CSV Data as Pandas DataFrame

In [3]:
df = pd.read_csv('data/insurance.csv')

### Let us eyeball the data by showing Top 5 Records

In [4]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Shape of the dataset

In [5]:
df.shape

(1338, 7)

### Dataset information

The column defintions are below:

* `age`: Age of primary beneficiary.
* `sex`: Gender of primary benficiary.
* `bmi`: Body mass index of primary benficiary: $\frac{weight_{kg}}{(height_{metres})^2}$
* `children`: Number of children that the primary beneficiary has.
* `smoker`: Whether the primary beneficiary smokes.
* `region`: The primary beneficiary's residential area in the US.
* `charges`: Individual medical costs billed by health insurance.

Let us return the datatypes of the columns:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


So we have three numeric features (`age`, `bmi` and `children`) and three categorical features (`sex`, `smoker` and `region`).

### Check Missing values

In [7]:
df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

There are no missing values in the data set

### Check Duplicates

In [8]:
df.duplicated().sum()

np.int64(1)

There is 1 duplicate row. This will dropped

### Checking the number of unique values of each column

In [9]:
df.nunique()

age           47
sex            2
bmi          548
children       6
smoker         2
region         4
charges     1337
dtype: int64

### Descriptive Statistics
Check statistics of data set


In [10]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Key Insights

**Age:**
The ages of individuals range from 18 to 64 years, with a mean of 39.2 years. The distribution appears fairly spread out with a standard deviation of 14 years. This indicates a good representation across different age groups.

**BMI (Body Mass Index):**
BMI values range from 15.96 to 53.13, with an average of 30.66. This suggests that on average, individuals fall in the overweight category (BMI ≥ 25), and there are cases indicating obesity (BMI ≥ 30).

**Children:**
The number of children covered by insurance ranges from 0 to 5. The median is 1, and about 25% of the policyholders have no children.

**Charges:**
Insurance charges (payouts) range widely from approximately 1,122 to over 63,770. The mean charge is 13,270, while the median is significantly lower at 9,382. This indicates a right-skewed distribution i.e, a small portion of individuals incur very high medical expenses.
(note: charges is in dollars)

### 3. Exploratory Data Analysis (EDA)
- Univariate analysis (e.g., distributions of age, charges, BMI)  
- Bivariate analysis (e.g., age vs charges, smoking vs charges)  
- Categorical vs numerical features  
- Correlation analysis (heatmaps, scatter plots)

The target (i.e. the variable that we want to predict) is the `charges` column, so let's split the dataset into features (`X`) and the target (`y`):

In [11]:
target = 'charges'
X = df.drop(target, axis=1)
y = df[target]

In [12]:
# check shape
X.shape, y.shape

((1338, 6), (1338,))

### **Distributions**

Let us now look at the distribution of each feature by plotting a histogram for each:

In [13]:
# Plots histogram for each feature using plotly library
plot_histograms(X)

Feature Distribution Insights

`age:` Roughly uniform distribution. This suggests good coverage across different age groups.

`sex:` The dataset has a nearly equal representation(number) of male and female individuals i.e, there is gender balance.

`bmi:` Approximately normal distribution centered around 30, with some higher values indicating possible obesity.

`children:` This feature is right-skewed, with most individuals having between 0 to 2 children. Few individuals have 4 or more children.

`smoker:` Majority are non-smokers. This could be an important predictor of insurance charges.

`region:` The four regions are almost equally represented, which helps reduce bias based on geographical location.

### Distribution of target variable

In [14]:
# Plots histogram for target using plotly library
plot_histograms(pd.DataFrame(y), height=300)

Target Variable Distribution (charges)

The distribution of insurance charges is right-skewed. This indicates that most individuals incur relatively low medical expenses while a few have significantly high costs.

### **Univariate analysis (with respect to the target)**

In this step, we examine how each independent feature individually relates to the target variable (charges).

- For numeric features, we use scatterplots to visualize potential correlations with the target.

- For categorical features, we use boxplots to compare the distribution of charges across different categories.

This analysis helps identify which features might have a strong influence on insurance charges and provides direction for further modeling.

#### Numeric features

In [15]:
plot_univariate_numeric(
    X.select_dtypes(include=np.number),
    y
)

Insights from Univariate Analysis (Numeric Features)

`age:` As age increases, insurance charges generally increase as well. However, there is significant variance in charges across all age groups.

`bmi:` No strong trend is observed. However, individuals with a BMI over 30 (classified as obese) appear more likely to have charges exceeding $30,000. This trend may become clearer during bivariate analysis.

`children:` There is no obvious relationship between the number of children and charges. Interestingly, charges tend to slightly decrease as the number of children increases.
**Since children has only six unique values, we will treat it as a categorical variable for univariate analysis.**

#### Categorical features

In [16]:
plot_univariate_categorical(
    X[['sex', 'smoker', 'region', 'children']],
    y
)

Insights from Univariate Analysis (Categorical Features)

`sex: `There is no meaningful difference in insurance charges between male and female policyholders.

`smoker:` This is a highly influential feature. Smokers tend to incur significantly higher charges compared to non-smokers. This strong signal will likely be critical for modeling.

`region:` No major variation in charges across different regions, suggesting that geographic location may not be a key driver in determining healthcare costs.

`children:` No consistent pattern observed. Although individuals with 4 or more children appear to have lower charges, this could be attributed to the small number of observations in those categories (as noted in the distribution analysis).

### **Bivariate analysis (with respect to the target)**

In this step, we explore how pairs of features jointly relate to the target variable (charges). This can help uncover hidden interactions or compound effects between variables that may not be apparent from univariate analysis alone.

The approach depends on the data types of the feature pairs:

`Numeric–Numeric Pairs:` We use a correlation heatmap to visualize the strength and direction of linear relationships with the target.

`Categorical–Categorical Pairs:` We use boxplots to analyze how different category combinations relate to variations in charges.

`Categorical–Numeric Pairs:` We use scatterplots to investigate how a numeric feature’s relationship with the target varies across categories.

This analysis will guide us in selecting important features and help improve the performance of our predictive model.

#### **Numeric pairs: Correlation Heatmap**

We use a correlation heatmap to visualize linear relationships between numeric variables. This helps identify which features are strongly related to the target variable (`charges`) and to each other, which can inform feature selection and modeling decisions.

In [17]:
plot_heatmap(
    X[['age', 'bmi', 'children']],
    y,
    bins=10
)

Heatmap Insights:

The heatmap helps visualize relationships between numeric features and the target (`charges`). However, in this case, it does not reveal any new insights beyond what was already observed in the univariate analysis.

#### **Categorical pairs: Box Plots**

Box plots provide a visual summary of the distribution of `charges` across different categorical feature combinations. These plots help reveal patterns or interactions between categorical variables and the target.




In [18]:
plot_paired_boxplots(
    X[['sex', 'smoker', 'region']],
    y
)

**Key insights:**
- **Sex × Smoker**: Male smokers have a higher median charge (36k) than female smokers (~29k).
- **Smoker × Region**: Smokers in the **southeast** and **southwest** regions tend to have higher median charges (35–37k) than those in the **northeast** and **northwest** (27–28k).

#### **Numeric-categorical pairs**

#### Paired Scatterplots (Categorical-Numeric Pairs)

These scatterplots help us visualize how pairs of features—particularly combinations of numeric and categorical variables—interact with the target (`charges`).

In [19]:
plot_paired_scatterplots(X, y)



#### Insights:
- **age-smoker**: There is a dense cluster of non-smokers under age 50 with `charges` consistently below \$10,000, indicating lower healthcare costs in this group.
- **bmi-smoker**: A distinct group of smokers with `BMI` > 30 shows `charges` exceeding \$30,000, suggesting high healthcare costs among obese smokers.

These interactions may highlight potential high-risk groups and will be important in downstream modeling and interpretation.

### **Collinearity (between features)**

Understanding collinearity—how strongly features relate to each other—is crucial for feature selection and model performance.

We will approach this based on the data types involved:

- **Numeric-Numeric pairs** ➝ *Pearson's correlation*
- **Categorical-Categorical pairs** ➝ *Chi-squared ($\chi^2$) test*
- **Categorical-Numeric pairs** ➝ *ANOVA (Analysis of Variance) test*

#### Numeric Features

We start by visualizing the relationships among numeric features using a **pairplot**. This allows us to spot potential linear relationships and outliers visually before calculating Pearson’s correlation.

In [20]:
px.scatter_matrix(
    X.select_dtypes(include=np.number)
)

Insights

I doesn't look like there is much correlation between any of the numeric features. To be sure, let us calculate and plot the **Pearson's correlation matrix**:

**Correlation**

The **correlation coefficient** quantifies the strength and direction of a linear relationship between two continuous variables. Its value ranges from –1 to +1:

- **+1**: Perfect positive linear relationship  
- **–1**: Perfect negative linear relationship  
- **0**: No linear relationship

It helps identify whether features move together (collinearity) and guides feature selection.

In [21]:
px.imshow(X.select_dtypes(include=np.number).corr())

Insights:

Visualizing the Pearson correlation matrix confirms minimal collinearity among numeric features. The highest correlation coefficient is only 0.11, indicating that `age`, `bmi`, and `children` each provide largely independent information to the model.
This means we can include all three in our model without worrying about redundancy.

### Colinearity in Categorical pairs


#### Chi-Squared Test of Independence

To assess whether two categorical variables are related, we use the **chi-squared test of independence**.

A chi-square test will provide us with a p-value. The p-value indicates whether or not our test results are significant.



* The following are the different values of p that indicate different hypothesis interpretations:

* P = 0.05; The result is significant i.e, the two features are related(Hypothesis is rejected)
* P > 0.05; The result is not significant i.e, the two features are not clearly related (Acceptance of Hypothesis)

let us calculate the $\chi^2$ values, p-values and degrees of freedom:

In [22]:
X_chi2 = chi2(X.select_dtypes(object))
X_chi2

Unnamed: 0,column1,column2,chi_squared,p_value,dof
0,sex,smoker,7.392911,0.006548,1
1,sex,region,0.435137,0.932892,3
2,smoker,region,7.343478,0.06172,3


The only feature pair with a p-value less than 0.05 is `sex` and `smoker`. This means we have enough evidence to suggest that sex and smoker are corelated

### ANOVA Test (for Numeric-Categorical Feature Pairs)

We use ANOVA (Analysis of Variance) to check if the average values of a numeric feature (like age or BMI) differ significantly across the groups of a categorical feature (like sex or region).

If the test gives us a `p-value less than 0.05`, it means there is a statistically significant difference in group means i.e., the numeric variable depends on the categorical one.


In [23]:
X_anova = anova(X)
X_anova

Unnamed: 0,num_column,cat_column,f_stat,p_value
0,age,sex,0.581369,0.4459107
1,age,smoker,0.836777,0.3604853
2,age,region,0.079782,0.9709891
3,bmi,sex,2.87897,0.08997637
4,bmi,smoker,0.018792,0.890985
5,bmi,region,39.495057,1.881839e-24
6,children,sex,0.393659,0.5304898
7,children,smoker,0.078664,0.7791596
8,children,region,0.717493,0.5415543


Insight

Only `bmi` and `region` show a statistically significant relationship(i.e, p-value is less than 0.05). This means the average BMI differs across regions.

### **Correlation (with respect to the target)**
checking correlation of features with the target is important because we are using linear regression as the baseline model which assumes a linear relationship between the predictors and the target.

#### Numeric Features: Pearson’s Correlation
This shows how strongly each numeric feature correlates with the target charges.

In [24]:
X_numeric = X.select_dtypes(include=np.number)
X_numeric['charges'] = y  # Add target for correlation
X_numeric.corr()['charges'].sort_values(ascending=False)


charges     1.000000
age         0.299008
bmi         0.198341
children    0.067998
Name: charges, dtype: float64

Let us visualise the result

In [25]:
plot_pearson_wrt_target(X, y)

#### Categorical Features: Boxplots or ANOVA

In [26]:
data_anova = anova(df) # Use df as it contains the target
anova_wrt_target = data_anova[data_anova['num_column']=='charges']
anova_wrt_target

Unnamed: 0,num_column,cat_column,f_stat,p_value
9,charges,sex,4.399702,0.03613272
10,charges,smoker,2177.614868,8.271435999999999e-283
11,charges,region,2.969627,0.03089336


Insight

All three p-values are less than 0.05, so we can say that these categorical variables are correlated with the target (charges).

##  Exploratory Data Analysis (EDA) Summary

---

###  Feature Distribution

- **Age** is uniformly distributed, ensuring diverse age representation.
- **Sex** is balanced between males and females.
- **BMI** follows a near-normal distribution centered around 30, with a tail suggesting obesity in some individuals.
- **Children** is right-skewed, with most individuals having 0–2 children.
- **Smoker** status shows most individuals are non-smokers.
- **Region** is evenly distributed across all four geographic regions.

---

###  Target Variable (`charges`)

- `charges` is **right-skewed**, indicating most individuals incur low costs, while a few have **very high medical expenses**.

---

###  Univariate & Bivariate Analysis

#### **Numeric Features**
- **Age** and **charges** show a positive trend — older individuals tend to have higher medical costs.
- **BMI** shows no strong linear trend, but **BMI > 30** correlates with high charges (>30k), especially for smokers.
- **Children** has no strong pattern, with a slight decrease in charges for individuals with more children.

#### **Categorical Features**
- **Sex** does not significantly impact charges.
- **Smoker** status is highly predictive — smokers incur much higher costs.
- **Region** shows minimal impact on charges.
- **Boxplots** reveal:
  - Male smokers tend to pay more than female smokers.
  - Smokers in the southeast and southwest incur the highest median charges.

---

###  Feature Relationships

#### **Correlation Analysis**
- **Pearson correlation** confirms **minimal collinearity** among numeric features (highest = 0.11). All numeric features can be included in a linear model.
- **Chi-Square test** shows a **statistically significant association** between `sex` and `smoker`.
- **ANOVA** reveals:
  - A significant difference in **BMI across regions**.
  - All three categorical variables (**sex**, **smoker**, **region**) are significantly associated with `charges`.






###  Key Takeaways for Modeling

- `smoker` is the strongest predictor of healthcare costs.
- `Age`, `BMI`, and smoker `status` show clear patterns with the target and should be prioritized in modeling.
- All numeric and categorical features are relevant and can be included, with no concerning multicollinearity or redundancy.