# Healthcare and Insurance Cost Analysis

#### ü©∫ Introduction

Healthcare costs vary widely depending on demographics, lifestyle, and health conditions.  
This dataset contains information on **100,000 individuals**, including:

- Age, gender, and region  
- Income and socioeconomic factors  
- Lifestyle habits (BMI, smoking, exercise)  
- Health conditions  
- Insurance information  
- Annual medical expenditures  

The goal of this analysis is to explore the factors influencing medical costs and identify patterns that can support prediction and risk assessment.

---

#### üéØ Key Questions

1. **Demographics and Costs**: How do demographic factors (age, gender, region) relate to annual medical costs?  
2. **Lifestyle Impact**: How do lifestyle factors (BMI, smoking, exercise) affect healthcare expenditures?  
3. **Health Conditions**: Which health conditions are associated with higher medical costs?  
4. **Cost Prediction**: Can we predict annual medical costs using demographic, lifestyle, and health features?  
5. **Risk Classification**: Can we classify individuals by risk level based on their profiles?


üì¶ Next Steps   
- Set up the project structure.  
- Install the required Python libraries.  
- Begin exploratory data analysis (EDA).
-------------
‚öôÔ∏è Required Libraries:

In [2]:
%pip install -q -U pandas numpy matplotlib seaborn

Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install -q -U watermark

Note: you may need to restart the kernel to use updated packages.


### üì¶ Project Libraries - import

This section lists all the Python libraries used in this project. Keeping them organized here helps with reproducibility and makes it easier to install dependencies.

In [4]:
# Importing the library for data manipulation in tables
import pandas as pd 

# Importing the NumPy library for mathematical operations and arrays
import numpy as np  

# Importing the Matplotlib library for generating plots
import matplotlib.pyplot as plt  

# Importing the Seaborn library for statistical data visualization
import seaborn as sns  

# Jupyter Notebook magic command to display plots directly in the notebook
%matplotlib inline

In [5]:
# Load the watermark extension
%reload_ext watermark

# Display metadata for your notebook
%watermark -a "Maykon Analysis" -d -u -v -p numpy,pandas,matplotlib,seaborn

Author: Maykon Analysis

Last updated: 2025-11-01

Python implementation: CPython
Python version       : 3.13.7
IPython version      : 9.6.0

numpy     : 2.3.4
pandas    : 2.3.3
matplotlib: 3.10.7
seaborn   : 0.13.2



#### Loading a dataset into your working environment (in this case, Python using the pandas library).

In [6]:
data = r"C:\Users\LarTI\OneDrive\Desktop\Projects\Data-Analysis-Medical-Insurance-Cost-Prediction\data\medical_insurance.csv"

df = pd.read_csv(data)

In [7]:
df.shape # df.shape returns a tuple (rows, columns) showing the dimensions of the DataFrame

(100000, 54)

In [10]:
df.head() #shows the first 5 rows.

Unnamed: 0,person_id,age,sex,region,urban_rural,income,education,marital_status,employment_status,household_size,...,liver_disease,arthritis,mental_health,proc_imaging_count,proc_surgery_count,proc_physio_count,proc_consult_count,proc_lab_count,is_high_risk,had_major_procedure
0,75722,52,Female,North,Suburban,22700.0,Doctorate,Married,Retired,3,...,0,1,0,1,0,2,0,1,0,0
1,80185,79,Female,North,Urban,12800.0,No HS,Married,Employed,3,...,0,1,1,0,0,1,0,1,1,0
2,19865,68,Male,North,Rural,40700.0,HS,Married,Retired,5,...,0,0,1,1,0,2,1,0,1,0
3,76700,15,Male,North,Suburban,15600.0,Some College,Married,Self-employed,5,...,0,0,0,1,0,0,1,0,0,0
4,92992,53,Male,Central,Suburban,89600.0,Doctorate,Married,Self-employed,2,...,0,1,0,2,0,1,1,0,1,0


In [11]:
df.tail() 

Unnamed: 0,person_id,age,sex,region,urban_rural,income,education,marital_status,employment_status,household_size,...,liver_disease,arthritis,mental_health,proc_imaging_count,proc_surgery_count,proc_physio_count,proc_consult_count,proc_lab_count,is_high_risk,had_major_procedure
99995,6266,50,Male,West,Urban,127200.0,No HS,Married,Employed,2,...,0,0,0,0,0,1,0,0,0,0
99996,54887,42,Male,East,Suburban,21600.0,HS,Married,Employed,2,...,0,0,0,0,0,0,0,0,0,0
99997,76821,41,Male,West,Rural,81900.0,HS,Divorced,Unemployed,1,...,0,0,0,1,0,1,0,0,0,0
99998,861,51,Female,South,Urban,43400.0,Doctorate,Single,Unemployed,3,...,0,0,1,0,0,2,2,1,0,0
99999,15796,44,Female,South,Rural,43700.0,Some College,Married,Employed,2,...,0,0,0,0,0,1,1,1,0,0


In [12]:
df.info() # shows a summary of the DataFrame‚Äôs structure and data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 54 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   person_id                    100000 non-null  int64  
 1   age                          100000 non-null  int64  
 2   sex                          100000 non-null  object 
 3   region                       100000 non-null  object 
 4   urban_rural                  100000 non-null  object 
 5   income                       100000 non-null  float64
 6   education                    100000 non-null  object 
 7   marital_status               100000 non-null  object 
 8   employment_status            100000 non-null  object 
 9   household_size               100000 non-null  int64  
 10  dependents                   100000 non-null  int64  
 11  bmi                          100000 non-null  float64
 12  smoker                       100000 non-null  object 
 13  

In [20]:
df.describe(include='all')  # Summary of all columns (numeric stats + categorical info)

Unnamed: 0,person_id,age,sex,region,urban_rural,income,education,marital_status,employment_status,household_size,...,liver_disease,arthritis,mental_health,proc_imaging_count,proc_surgery_count,proc_physio_count,proc_consult_count,proc_lab_count,is_high_risk,had_major_procedure
count,100000.0,100000.0,100000,100000,100000,100000.0,100000,100000,100000,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
unique,,,3,5,3,,6,4,4,,...,,,,,,,,,,
top,,,Female,South,Urban,,Bachelors,Married,Employed,,...,,,,,,,,,,
freq,,,49193,28029,60019,,27996,53252,55269,,...,,,,,,,,,,
mean,50000.5,47.5215,,,,49873.9,,,,2.4309,...,0.01477,0.10831,0.13014,0.50853,0.15869,0.50839,0.50933,0.50914,0.36781,0.1697
std,28867.657797,15.988752,,,,46800.21,,,,1.075126,...,0.120632,0.310773,0.336459,0.749755,0.463562,0.747218,0.75363,0.750455,0.482212,0.375371
min,1.0,0.0,,,,1100.0,,,,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,25000.75,37.0,,,,21100.0,,,,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,50000.5,48.0,,,,36200.0,,,,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,75000.25,58.0,,,,62200.0,,,,3.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0


In [None]:
df.dtypes #shows the data type of each column


person_id                        int64
age                              int64
sex                             object
region                          object
urban_rural                     object
income                         float64
education                       object
marital_status                  object
employment_status               object
household_size                   int64
dependents                       int64
bmi                            float64
smoker                          object
alcohol_freq                    object
visits_last_year                 int64
hospitalizations_last_3yrs       int64
days_hospitalized_last_3yrs      int64
medication_count                 int64
systolic_bp                    float64
diastolic_bp                   float64
ldl                            float64
hba1c                          float64
plan_type                       object
network_tier                    object
deductible                       int64
copay                    

In [None]:
df.isnull().sum() #counts missing values in each column


person_id                          0
age                                0
sex                                0
region                             0
urban_rural                        0
income                             0
education                          0
marital_status                     0
employment_status                  0
household_size                     0
dependents                         0
bmi                                0
smoker                             0
alcohol_freq                   30083
visits_last_year                   0
hospitalizations_last_3yrs         0
days_hospitalized_last_3yrs        0
medication_count                   0
systolic_bp                        0
diastolic_bp                       0
ldl                                0
hba1c                              0
plan_type                          0
network_tier                       0
deductible                         0
copay                              0
policy_term_years                  0
p

In [None]:
df['alcohol_freq'].value_counts(dropna=False)  #Count occurrences of each category in 'alcohol_freq', including missing values

alcohol_freq
Occasional    45078
NaN           30083
Weekly        19833
Daily          5006
Name: count, dtype: int64

In [18]:
# Fill missing alcohol_freq values with the mode
mode_value = df['alcohol_freq'].mode()[0]  # 'Occasional'
df['alcohol_freq'] = df['alcohol_freq'].fillna(mode_value)

# Check if missing values remain
df['alcohol_freq'].isnull().sum()


np.int64(0)

In [22]:
df['alcohol_freq'].value_counts() # Count occurrences of each category in 'alcohol_freq', ignoring missing values

alcohol_freq
Occasional    75161
Weekly        19833
Daily          5006
Name: count, dtype: int64