## Insurance Cost Analysis

This project analyzes a medical insurance dataset to understand the factors influencing insurance charges and to build predictive regression models.

## 1. Data Loading

In this section, we load the dataset and assign appropriate column names. The original file does not contain headers, so column names are defined manually.

In [13]:
import pandas as pd
pd.__version__

'3.0.1'

In [14]:
df = pd.read_csv("../data/raw/insurance.csv")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 2. Initial Data Inspection

We performe an initial inspection of the dataset to:
- Check the number of rows and columns
- Review data types
- Identify potencial data quality issues

In [18]:
print("Shape:", df.shape)
print("\nData types:")
df.info()
print("\nMissing values:")
print(df.isnull().sum())
print("\nDuplicate rows:")
print(df.duplicated().sum())

Shape: (1338, 7)

Data types:
<class 'pandas.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   str    
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   str    
 5   region    1338 non-null   str    
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), str(3)
memory usage: 73.3 KB

Missing values:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Duplicate rows:
1


## 3. Data Cleaning

In this step, we:
- Remove duplicate rows
- Verify dataset integrity after cleaning

In [20]:
df = df.drop_duplicates()
print("\nDuplicate rows after dropping duplicates:", df.duplicated().sum())


Duplicate rows after dropping duplicates: 0
