## Insurance Cost Analysis

This project analyzes a medical insurance dataset to understand the factors influencing insurance charges and to build predictive regression models.

## 1. Data Loading

In this section, we load the dataset and assign appropriate column names. The original file does not contain headers, so column names are defined manually.

In [26]:
import pandas as pd
pd.__version__
columns = ["age", "gender", "bmi", "children", "smoker", "region", "charges"]
df = pd.read_csv("../data/raw/insurance.csv", names=columns, header=None)
df.head()

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,19,1,27.9,0,1,3,16884.924
1,18,2,33.77,1,0,4,1725.5523
2,28,2,33.0,3,0,4,4449.462
3,33,2,22.705,0,0,1,21984.47061
4,32,2,28.88,0,0,1,3866.8552


## 2. Initial Data Inspection

We performe an initial inspection of the dataset to:
- Check the number of rows and columns
- Review data types
- Identify potencial data quality issues

In [27]:
df.shape

(2772, 7)

In [28]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 2772 entries, 0 to 2771
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       2772 non-null   str    
 1   gender    2772 non-null   int64  
 2   bmi       2772 non-null   float64
 3   children  2772 non-null   int64  
 4   smoker    2772 non-null   str    
 5   region    2772 non-null   int64  
 6   charges   2772 non-null   float64
dtypes: float64(2), int64(3), str(2)
memory usage: 151.7 KB


## 3. Data Quality Check - Unique Values

We inspect unique values in selected columns to detect potencial data inconsistencies or invalid entries that may affect further analysis.

In [29]:
print("Age unique:", df["age"].unique())
print("Smoker unique:", df["smoker"].unique())

Age unique: <StringArray>
['19', '18', '28', '33', '32', '31', '46', '37', '60', '25', '62', '23', '56',
 '27', '52', '30', '34', '59', '63', '55', '22', '26', '35', '24', '41', '38',
 '36', '21', '48', '40', '58', '53', '43', '64', '20', '61', '44', '57', '29',
 '45', '54', '49', '47', '51', '42', '50', '39',  '?']
Length: 48, dtype: str
Smoker unique: <StringArray>
['1', '0', '?']
Length: 3, dtype: str


### Observations

- The dataset contains unexpected placeholder values `"?"` in at least two columns: `smoker` and `age`.
- These values likely represent missing/invalid entries and must be handled before converting data types.

## 4. Handling Missing or Invalid Values

The dataset contains invalid placeholder values `"?"` in multiple columns. We replace `"?"` with `NaN` and remove affected rows to unsure consistent data types.

In [30]:
import numpy as np

df = df.replace("?", np.nan)
df = df.dropna()

### Distribution of `smoker` values after cleaning

In [31]:
df["smoker"].value_counts()

smoker
0    2198
1     563
Name: count, dtype: int64

In [32]:
df.shape

(2761, 7)

The dataset now excludes invalid entries represented by "?". All remaining values are valid and consistent.

## 5. Data Type Correction

After removing invalid entries, numeric variables stored as strings must be converted to appropriate data types for further analysis.

In [33]:
df["age"] = df["age"].astype(int)
df["smoker"] = df["smoker"].astype(int)
df.info()

<class 'pandas.DataFrame'>
Index: 2761 entries, 0 to 2771
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       2761 non-null   int64  
 1   gender    2761 non-null   int64  
 2   bmi       2761 non-null   float64
 3   children  2761 non-null   int64  
 4   smoker    2761 non-null   int64  
 5   region    2761 non-null   int64  
 6   charges   2761 non-null   float64
dtypes: float64(2), int64(5)
memory usage: 172.6 KB
