# Project Context

You have just launched your consulting company in the field of data and AI with your partner. After some prospecting work, you are competing in a first call for tenders: Assur'aimant, a French insurer historically operating at a national level, decides to set up in the United States. The insurer is soliciting several AI companies to create a solution that could estimate the insurance premium of its subscribers in this market. Currently, in order to be able to estimate insurance premiums, brokers use ratios and their experience, however this method is long and expensive.

Following several discussions, you went to the Assur'Aimant offices in Houston to create a data set that can be used for your modeling. In particular, you extracted the following information:

- Body mass index (BMI): this allows you to give a relationship between height and weight. Ideally, you should be between 28.5 and 24.9 
- Sex: the gender of the person taking out the insurance, 
- Male or female age: the age of the main beneficiary 
- Number of dependent children (children): Number of children covered by the insurance 
- Smoker: smoker or non-smoker 
- Region: the residential area in the US, northeast, southeast, southwest, northwest 
- Charges: the insurance premium billed (target)

The Assur'aimant management team also asks you to perform a data analysis so that it can better understand its customers. Your objective is therefore twofold:

- Conduct an exploratory study of the data 
- Create a machine learning model that will estimate the insurance premiums of customers based on their demographic data.

Given your small structure, you are versatile and you both take on the roles of data scientist / data analyst / data engineer. You are full stack data.

In [1]:
import pandas as pd
import missingno as msno

### Data Cleaning:

1. **Check missing information and duplicates** (tool: `missingno`).

In [2]:
# Load dataset
df = pd.read_csv("assurance_dataset.csv")

In [3]:
# Make a copy of raw dataset
dfi = df.copy()

In [None]:
# View first 10 data rows
dfi.head(10)

In [None]:
# View data types for each variable
dfi.dtypes

In [None]:
# Check for missing data (in numbers)
dfi.isna().sum()

In [None]:
# Check for missing data (vizualization)
msno.bar(dfi) 

In [None]:
# Check for duplicates
dfi.duplicated().sum()

In [9]:
# Drop duplicates
dfi.drop_duplicates(keep='first', inplace=True)

In [10]:
# Round bmi to 2 decimal places
dfi['bmi'] = dfi['bmi'].apply(lambda x: round(x, 2))

In [11]:
# Round charges to 2 decimal places
dfi['charges'] = dfi['charges'].apply(lambda x: round(x, 2))

In [13]:
# Convert smokers column to numeric
dfi['smoker'] = dfi['smoker'].map({'yes': 1, 'no': 0})

In [14]:
# Strip sex column data
dfi['sex'] = dfi['sex'].str.strip()

In [15]:
# Convert sex column to numeric
dfi['sex'] = dfi['sex'].map({'male': 1, 'female': 0})

In [17]:
# Export cleaned data to csv
dfi.to_csv("cleaned_insurance_data.csv")