------------------------------------------------------------------------------------------

<h1 style="font-size:40px; background-color:darkorange;color:white;font-weight:800;padding:12px;border-radius:8px;text-align:center">Predicting Medical Expenses Linear Regression Analysis </h1>

------------------------------------------------------------------------------------------

![](https://i.imgur.com/1EzyZvj.png)

------------------------------------------------------------------------------------------
<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">
<h3 style="font-size:20px; font-weight:bold"> About Dataset</h3>

**Introduction:**

This dataset provides insights into factors influencing customer charges. Each record represents a customer with various attributes. The goal is to predict the value of the "charges" column based on these attributes, such as age, sex, BMI, number of children, smoking habits, and region.

</div>

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">
Variables Description

------------------------------------------------------------------------------------------

Here’s an analysis of each variable (column) from the dataset:

1. **age (numeric)**:  
   - **Description**: Represents the age of the customer in years.
   - **Example Values**: 19, 18, 28, 33, 50, etc.
   - **Analysis**: Age is a key factor that could impact insurance charges. Older customers may have higher charges due to increased health risks.

2. **sex (categorical)**:  
   - **Description**: Represents the gender of the customer.
   - **Categories**: 
     - male
     - female
   - **Example Values**: "female", "male".
   - **Analysis**: Gender may play a role in healthcare costs, though this could vary depending on the region or the specific insurance policy.

3. **bmi (numeric)**:  
   - **Description**: Body Mass Index (BMI) of the customer, which is a measure of body fat based on height and weight.
   - **Example Values**: 27.9, 33.77, 22.705, 36.85, etc.
   - **Analysis**: BMI is an important factor in health insurance costs. Higher BMI values may be associated with higher charges due to increased health risks such as obesity, diabetes, or heart disease.

4. **children (numeric)**:  
   - **Description**: Number of children/dependents covered by the insurance plan.
   - **Example Values**: 0, 1, 3, etc.
   - **Analysis**: The number of children may affect insurance costs as plans with more dependents could have higher premiums or medical expenses.

5. **smoker (categorical)**:  
   - **Description**: Indicates whether the customer is a smoker.
   - **Categories**: 
     - yes
     - no
   - **Example Values**: "yes", "no".
   - **Analysis**: Smoking is strongly correlated with health risks, leading to significantly higher charges for smokers due to the potential for more severe health issues like lung disease, cancer, etc.

6. **region (categorical)**:  
   - **Description**: Geographic region where the customer resides.
   - **Categories**: 
     - southwest
     - southeast
     - northwest
     - northeast
   - **Example Values**: "southwest", "southeast", "northwest", "northeast".
   - **Analysis**: The region could influence healthcare costs due to differences in regional healthcare availability, cost of living, or state-level healthcare policies.

7. **charges (numeric)**:  
   - **Description**: The medical charges billed to the customer or their insurance provider.
   - **Example Values**: 16884.92, 1725.55, 4449.46, 29141.36, etc.
   - **Analysis**: This is the target variable we want to predict. It represents the total medical expenses incurred by the customer. Factors like age, BMI, smoking status, and region are likely to have a strong impact on this value.

### General Insights:
- **Age, BMI, and Smoking Habits** are likely to be the most influential factors affecting the **charges** due to their direct impact on health risks.
- **Gender, Number of Children, and Region** may also contribute to variations in healthcare costs but possibly to a lesser extent than the other factors.

</div>

----------------------------
<a id="contents_tabel"></a>

## Contents

----------------------------

<span style="font-size: 1.2em;line-height:1.3em">
    

- **[1 : Introduction](#l)**
- **[2 : Purpose](#2)**
- **[3 : Import Libraries](#3)** 
- **[4 : Dataset Preparation](#4)** 
- **[5 : Exploratory Data Analysis EDA](#5)** 
    - **[5.1 : Univariate Analysis](#5.1)** 
    
- **[8 : Conclusion](#8)** 


----------------------------

<a id="1"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">1 : Question</p>

How can ACME Insurance Inc. predict the annual medical expenditure for new customers based on their demographic and lifestyle information, and ensure that these predictions are transparent and explainable?

⬆️ [Go to Contents](#contents_tabel)

----------------------------

<a id="1"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">2 : Introduction</p>

ACME Insurance Inc. provides health insurance to customers across the United States, offering premiums based on individual risk factors. As the lead data scientist, you are tasked with creating a predictive system that estimates the annual medical expenses for new customers using factors such as age, sex, BMI, number of children, smoking habits, and region of residence. These estimates are crucial for determining the monthly insurance premiums offered to customers. The system must also comply with regulatory standards by being able to explain its predictions.

⬆️ [Go to Contents](#contents_tabel)

----------------------------

<a id="1"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">3 : Purpose</p>

The purpose of this project is to develop an automated, explainable system to estimate annual medical expenditures for ACME's new customers. This will allow ACME to tailor insurance premiums based on individual risk profiles, ensuring fair and accurate pricing, while maintaining transparency and regulatory compliance. The model will utilize customer attributes such as age, sex, BMI, smoking status, children, and region to predict medical charges.

⬆️ [Go to Contents](#contents_tabel)

----------------------------

<a id="3"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">4 : Import Libraries</p>

⬆️ [Go to Contents](#contents_tabel)

----------------------------

In [13]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.facecolor'] = '#00000000'
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
sns.set_style('darkgrid')
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.stats import shapiro
import nbformat
import warnings
import joblib
warnings.filterwarnings('ignore')


----------------------------
<a id="4"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">5 : Dataset Preparation</p>

⬆️ [Go to Contents](#contents_tabel)

----------------------------

In [14]:
# Section: Data Loading
file_path = 'medical.csv'
df = pd.read_csv(file_path)
print("Data loaded successfully. Here's a preview:")


Data loaded successfully. Here's a preview:


In [15]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [16]:
df.shape

(1338, 7)

The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one customer.

Our objective is to find a way to estimate the value in the "charges" column using the values in the other columns. If we can do so for the historical data, then we should able to estimate charges for new customers too, simply by asking for information like their age, sex, BMI, no. of children, smoking habits and region.

Let's check the data type for each column.

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB



Looks like "age", "children", "bmi" ([body mass index](https://en.wikipedia.org/wiki/Body_mass_index)) and "charges" are numbers, whereas "sex", "smoker" and "region" are strings (possibly categories). None of the columns contain any missing values, which saves us a fair bit of work!

In [18]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

No missing value

In [19]:
df.duplicated().sum()

1

In [20]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801



------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

1. **age**:

</div>

------------------------------------------------------------------------------------------


   - **Count**: 1338 records in the dataset.
   - **Mean**: The average age is approximately 39.21 years.
   - **Standard Deviation (std)**: 14.05, indicating that ages vary quite a bit around the mean.
   - **Min/Max**: The youngest customer is 18, while the oldest is 64.
   - **Percentiles**:
     - **25%**: 27 years, meaning 25% of the customers are younger than this.
     - **50% (Median)**: 39 years, indicating that half the customers are younger than 39, half are older.
     - **75%**: 51 years, meaning 25% of the customers are older than 51.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

2. **bmi (Body Mass Index)**:

</div>

------------------------------------------------------------------------------------------


   - **Mean**: The average BMI is around 30.66, indicating many customers may be in the overweight or obese category.
   - **Standard Deviation (std)**: 6.10, showing variation in BMI values.
   - **Min/Max**: The lowest BMI is 15.96 (which may be underweight), while the highest is 53.13 (suggesting severe obesity).
   - **Percentiles**:
     - **25%**: A BMI of 26.30, indicating that 25% of the customers have a BMI below this.
     - **50% (Median)**: 30.4, indicating that half the customers have a BMI below 30.4.
     - **75%**: 34.69, meaning 25% of the customers have a BMI above this value.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

3. **children**:

</div>

------------------------------------------------------------------------------------------

   - **Mean**: On average, customers have about 1.09 children.
   - **Standard Deviation (std)**: 1.21, showing variation in the number of children.
   - **Min/Max**: Some customers have no children (min = 0), while the maximum number of children is 5.
   - **Percentiles**:
     - **25%**: 0 children, meaning 25% of customers have no children.
     - **50% (Median)**: 1 child, indicating that half the customers have one or fewer children.
     - **75%**: 2 children, meaning 25% of customers have more than 2 children.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

4. **charges** (Medical Costs):

</div>

------------------------------------------------------------------------------------------

   - **Mean**: The average medical charge is approximately $13,270.42.
   - **Standard Deviation (std)**: 12,110.01, showing a large variation in medical expenses.
   - **Min/Max**: The lowest medical charge is $1,121.87, while the highest is $63,770.43.
   - **Percentiles**:
     - **25%**: $4,740.29, meaning 25% of customers are charged below this amount.
     - **50% (Median)**: $9,382.03, indicating that half the customers are charged less than this amount.
     - **75%**: $16,639.91, meaning 25% of customers are charged more than this amount.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

5. **Summary:**

</div>

------------------------------------------------------------------------------------------

- **Age and BMI** show moderate variation among customers, with ages ranging from 18 to 64 and BMIs between 15.96 and 53.13.
- **Number of children** is mostly centered around 0 to 2.
- **Charges** have significant variability, suggesting the presence of extreme medical costs for certain individuals, likely influenced by factors like age, BMI, and smoking status.

----------------------------

<a id="5"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">6 :  Exploratory Analysis and Visualization</p>

⬆️ [Go to Contents](#contents_tabel)

----------------------------

Let's explore the data by visualizing the distribution of values in some columns of the dataset, and the relationships between "charges" and other columns.

### Age

Age is a numeric column. The minimum age in the dataset is 18 and the maximum age is 64. Thus, we can visualize the distribution of age using a histogram with 47 bins (one for each year) and a box plot. We'll use plotly to make the chart interactive, but you can create similar charts using Seaborn.

In [22]:
df.age.describe()

count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [27]:
sns.set(style="whitegrid")
fig = px.histogram(df, 
                   x='age', 
                   marginal='box', 
                   nbins=47, 
                   title='Distribution of Age')
fig.update_layout(bargap=0.1)
fig.update_traces(marker_color="darkorange")
fig.show()

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:30px; font-weight:bold">

**Summary:**

</div>

------------------------------------------------------------------------------------------

The distribution of ages in the dataset is almost uniform, with 20-30 customers at every age, except for the ages 18 and 19, which seem to have over twice as many customers as other ages. The uniform distribution might arise from the fact that there isn't a big variation in the [number of people of any given age](https://www.statista.com/statistics/241488/population-of-the-us-by-sex-and-age/) (between 18 & 64) in the USA.


The reason why there are over twice as many customers at ages 18 and 19 compared to other ages could be due to a combination of life-stage factors and health insurance-related policies:

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

1. **New Independence**:

</div>

------------------------------------------------------------------------------------------

At age 18, individuals in the U.S. gain legal independence and are often required to take responsibility for their own health insurance. Many young adults may no longer be covered by their parents’ health plans, prompting them to seek their own insurance.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

2. **Affordable Premiums for Younger Age Groups**: 

</div>

------------------------------------------------------------------------------------------

Younger individuals, particularly those aged 18 and 19, generally have lower health risks and therefore lower premiums. Health insurance providers often target these age groups with attractive pricing, making insurance more accessible and affordable at these ages.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

3. **Student Enrollment**: 

</div>

------------------------------------------------------------------------------------------

Many people aged 18 and 19 are either starting or continuing their college education. Health insurance is often mandatory for students, and many of them might purchase new plans either through their schools or independently, contributing to the larger number of customers in this age group.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

4. **Policy Renewals and First-Time Buyers**: 

</div>

------------------------------------------------------------------------------------------

Individuals in this age group might also be transitioning from dependent to individual plans. As a result, a significant portion of customers in this dataset might be first-time health insurance buyers, which can explain the higher customer count for these ages.

These factors collectively contribute to the overrepresentation of customers aged 18 and 19 in the dataset compared to other age groups.

### Body Mass Index

Let's look at the distribution of BMI (Body Mass Index) of customers, using a histogram and box plot.

In [28]:
fig = px.histogram(df, 
                   x='bmi', 
                   marginal='box', 
                   color_discrete_sequence=['red'], 
                   title='Distribution of BMI (Body Mass Index)')
fig.update_layout(bargap=0.1)
fig.update_traces(marker_color="darkorange")
fig.show()

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:30px; font-weight:bold">

**Summary:**

</div>

------------------------------------------------------------------------------------------

The measurements of body mass index seem to form a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) centered around the value 30, with a few outliers towards the right. Here's how BMI values can be interpreted ([source](https://study.com/academy/lesson/what-is-bmi-definition-formula-calculation.html)):


------------------------------------------------------------------------------------------

![](https://i.imgur.com/lh23OiY.jpg)


------------------------------------------------------------------------------------------

The difference in the distributions of **ages** and **BMIs** can be explained by the nature of how these variables are spread in the population and the factors influencing each:

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

### **1. Age Distribution (Uniform Distribution)**:

</div>

------------------------------------------------------------------------------------------


- **Why it's uniform**: Age is a **demographic factor** that tends to be evenly distributed across a population within a specific range (in this case, 18 to 64). In this dataset, the number of customers at each age is fairly consistent because there is not much variation in the overall population numbers for different ages between 18 and 64 in the U.S. 
- **Reason for uniformity**: This uniformity arises from the fact that people are continuously aging and there is no significant drop or peak in the number of people at different ages in this range. Hence, each age group has roughly the same number of individuals.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

### **2. BMI Distribution (Gaussian Distribution)**:

</div>

------------------------------------------------------------------------------------------

### **2. BMI Distribution (Gaussian Distribution)**:
- **Why it's Gaussian**: BMI is a **health-related measurement** that tends to follow a normal distribution because most people’s BMI clusters around an average (around 30 in this dataset), with fewer individuals at the extreme ends (either very low or very high BMI). 
- **Reason for normality**: BMI is influenced by a combination of factors such as genetics, lifestyle, and diet, and most people tend to have values around the average due to shared biological and environmental influences. The few individuals with very low or very high BMI values represent outliers, possibly due to unusual circumstances like health conditions, extreme fitness, or obesity.

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">

### **Summary**:

</div>

------------------------------------------------------------------------------------------


- **Age** is distributed uniformly because it is a demographic attribute that does not vary much in number across different ages within the working population.
- **BMI** is distributed normally because it is a health metric influenced by many variables that cluster most people around an average value, with fewer people in the extreme ranges.