<h1 style="font-size:40px; background-color:darkorange;color:white;font-weight:800;padding:12px;border-radius:8px;text-align:center">Predicting Medical Expenses Linear Regression Analysis </h1>

------------------------------------------------------------------------------------------
<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">
<h3 style="font-size:20px; font-weight:bold"> About Dataset</h3>

**Introduction:**

This dataset provides insights into factors influencing customer charges. Each record represents a customer with various attributes. The goal is to predict the value of the "charges" column based on these attributes, such as age, sex, BMI, number of children, smoking habits, and region.

</div>

------------------------------------------------------------------------------------------

<div style="background-color:darkorange; color:white; padding:12px;border-radius:8px; font-size:20px; font-weight:bold">
Variables Description

------------------------------------------------------------------------------------------

Here’s an analysis of each variable (column) from the dataset:

1. **age (numeric)**:  
   - **Description**: Represents the age of the customer in years.
   - **Example Values**: 19, 18, 28, 33, 50, etc.
   - **Analysis**: Age is a key factor that could impact insurance charges. Older customers may have higher charges due to increased health risks.

2. **sex (categorical)**:  
   - **Description**: Represents the gender of the customer.
   - **Categories**: 
     - male
     - female
   - **Example Values**: "female", "male".
   - **Analysis**: Gender may play a role in healthcare costs, though this could vary depending on the region or the specific insurance policy.

3. **bmi (numeric)**:  
   - **Description**: Body Mass Index (BMI) of the customer, which is a measure of body fat based on height and weight.
   - **Example Values**: 27.9, 33.77, 22.705, 36.85, etc.
   - **Analysis**: BMI is an important factor in health insurance costs. Higher BMI values may be associated with higher charges due to increased health risks such as obesity, diabetes, or heart disease.

4. **children (numeric)**:  
   - **Description**: Number of children/dependents covered by the insurance plan.
   - **Example Values**: 0, 1, 3, etc.
   - **Analysis**: The number of children may affect insurance costs as plans with more dependents could have higher premiums or medical expenses.

5. **smoker (categorical)**:  
   - **Description**: Indicates whether the customer is a smoker.
   - **Categories**: 
     - yes
     - no
   - **Example Values**: "yes", "no".
   - **Analysis**: Smoking is strongly correlated with health risks, leading to significantly higher charges for smokers due to the potential for more severe health issues like lung disease, cancer, etc.

6. **region (categorical)**:  
   - **Description**: Geographic region where the customer resides.
   - **Categories**: 
     - southwest
     - southeast
     - northwest
     - northeast
   - **Example Values**: "southwest", "southeast", "northwest", "northeast".
   - **Analysis**: The region could influence healthcare costs due to differences in regional healthcare availability, cost of living, or state-level healthcare policies.

7. **charges (numeric)**:  
   - **Description**: The medical charges billed to the customer or their insurance provider.
   - **Example Values**: 16884.92, 1725.55, 4449.46, 29141.36, etc.
   - **Analysis**: This is the target variable we want to predict. It represents the total medical expenses incurred by the customer. Factors like age, BMI, smoking status, and region are likely to have a strong impact on this value.

### General Insights:
- **Age, BMI, and Smoking Habits** are likely to be the most influential factors affecting the **charges** due to their direct impact on health risks.
- **Gender, Number of Children, and Region** may also contribute to variations in healthcare costs but possibly to a lesser extent than the other factors.

</div>

----------------------------
<a id="contents_tabel"></a>

## Contents

----------------------------

<span style="font-size: 1.2em;line-height:1.3em">
    

- **[1 : Introduction](#l)**
- **[2 : Purpose](#2)**
- **[3 : Import Libraries](#3)** 
- **[4 : Dataset Preparation](#4)** 
- **[5 : Exploratory Data Analysis EDA](#5)** 
    - **[5.1 : Univariate Analysis](#5.1)** 
    
- **[8 : Conclusion](#8)** 


----------------------------

<a id="1"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">1 : Question</p>

How can ACME Insurance Inc. predict the annual medical expenditure for new customers based on their demographic and lifestyle information, and ensure that these predictions are transparent and explainable?

⬆️ [Go to Contents](#contents_tabel)

----------------------------

<a id="1"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">2 : Introduction</p>

ACME Insurance Inc. provides health insurance to customers across the United States, offering premiums based on individual risk factors. As the lead data scientist, you are tasked with creating a predictive system that estimates the annual medical expenses for new customers using factors such as age, sex, BMI, number of children, smoking habits, and region of residence. These estimates are crucial for determining the monthly insurance premiums offered to customers. The system must also comply with regulatory standards by being able to explain its predictions.

⬆️ [Go to Contents](#contents_tabel)

----------------------------

<a id="1"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">3 : Purpose</p>

The purpose of this project is to develop an automated, explainable system to estimate annual medical expenditures for ACME's new customers. This will allow ACME to tailor insurance premiums based on individual risk profiles, ensuring fair and accurate pricing, while maintaining transparency and regulatory compliance. The model will utilize customer attributes such as age, sex, BMI, smoking status, children, and region to predict medical charges.

⬆️ [Go to Contents](#contents_tabel)

----------------------------

<a id="3"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">4 : Import Libraries</p>

⬆️ [Go to Contents](#contents_tabel)

----------------------------

In [2]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.stats import shapiro
import nbformat
import warnings
import joblib
warnings.filterwarnings('ignore')


----------------------------
<a id="4"></a>
# <p style="background-color:darkorange; color:white; font-family:calibri; font-size:130%; color:white; text-align:left; border-radius:8px; padding:10px">5 : Dataset Preparation</p>

⬆️ [Go to Contents](#contents_tabel)

----------------------------

In [5]:
# Section: Data Loading
file_path = 'medical.csv'
df = pd.read_csv(file_path)
print("Data loaded successfully. Here's a preview:")


Data loaded successfully. Here's a preview:


In [6]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Our objective is to find a way to estimate the value in the "charges" column using the values in the other columns. If we can do so for the historical data, then we should able to estimate charges for new customers too, simply by asking for information like their age, sex, BMI, no. of children, smoking habits and region.

Let's check the data type for each column.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB



The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one customer. 

Looks like "age", "children", "bmi" ([body mass index](https://en.wikipedia.org/wiki/Body_mass_index)) and "charges" are numbers, whereas "sex", "smoker" and "region" are strings (possibly categories). None of the columns contain any missing values, which saves us a fair bit of work!