# Project 5: Healthcare Insurance Analysis

# Prefactory Remarks

- [x] **Create a virtual environment to download the packages**

In [None]:
# You don't have to do this, it's just safer.

# Install virtualenv (virtual environment):

# !pip install virtualenv

# Create a virtual environment named "myenv":

# !python -m venv myenv

# Activate the virtual environment:

# myenv\Scripts\activate (Windows)
# source myenv/bin/activate (macOS/Linux)

# Upgrade pip and install essential data science libraries inside the virtual environment:

# !myenv/bin/python -m pip install --upgrade pip  
# !myenv/bin/python -m pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels jupyterlab plotly openpyxl xlrd tensorflow keras torch torchvision pyspark ipykernel

# Add the virtual environment as a Jupyter kernel:

# !myenv/bin/python -m ipykernel install --user --name=myenv --display-name "Python (myenv)"

# Deactivate the virtual environment (Run this in the terminal):

# deactivate

- [x] **Libraries we might need to install or upgrade**

In [None]:
# If you don't care to create a virtual environment, here is what you need to do to download the libraries

# Run these directly in a cell to download the libraries:

#!pip install tensorflow
#!pip install pyspark
#!pip install scikit-optimize (for skopt)
#!pip install missingno
#!pip install seaborn
#!pip install numpy
#!pip install pandas
#!pip install matplotlib
#!pip install scikit-learn

# To update them, run this (with your desired library):

#!pip install --upgrade scikit-learn

- [x] **Tips for rearranging your Notebook**

- Hold ctrl+shift and click on the various cells you want to move, then press the arrow keys to move them up or down.

# Data Analysis (Using the Pandas Library)

## 1. Visualize the data

- [x] **View the data**

In [72]:
import pandas as pd
import numpy as np
import math as ma
import re
import matplotlib.pyplot as plt
import seaborn as sns

df_i = pd.read_csv("insurance.csv")
df_v = pd.read_csv("validation_dataset.csv")

df_i

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.900,0.0,yes,southwest,16884.924
1,18.0,male,33.770,1.0,no,Southeast,1725.5523
2,28.0,male,33.000,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.880,0.0,no,northwest,$3866.8552
...,...,...,...,...,...,...,...
1333,50.0,male,30.970,3.0,no,Northwest,$10600.5483
1334,-18.0,female,31.920,0.0,no,Northeast,2205.9808
1335,18.0,female,36.850,0.0,no,southeast,$1629.8335
1336,21.0,female,25.800,0.0,no,southwest,2007.945


- [x] **Check the data types**

In [22]:
df_v.dtypes
df_i.dtypes

age         float64
sex          object
bmi         float64
children    float64
smoker       object
region       object
charges      object
dtype: object

- [x] **Count Occurrences**

In [124]:
df_i = pd.read_csv("insurance.csv")
df_v = pd.read_csv("validation_dataset.csv")

df_v["sex"].value_counts()

# We can do all of them instead of one by one:

for column in df_i.columns:
    print(f"Value counts for {column}:\n{df_v[column].value_counts()}\n{'-'*40}\n")  # something wrong here

Value counts for age:
age
18.0    3
27.0    3
29.0    3
28.0    2
70.0    2
19.0    2
63.0    2
44.0    2
57.0    2
92.0    2
46.0    1
74.0    1
45.0    1
43.0    1
52.0    1
78.0    1
40.0    1
35.0    1
89.0    1
26.0    1
58.0    1
21.0    1
23.0    1
47.0    1
33.0    1
49.0    1
39.0    1
55.0    1
84.0    1
83.0    1
20.0    1
60.0    1
51.0    1
48.0    1
42.0    1
71.0    1
61.0    1
Name: count, dtype: int64
----------------------------------------

Value counts for sex:
sex
female    25
male      25
Name: count, dtype: int64
----------------------------------------

Value counts for bmi:
bmi
38.060000    2
24.090000    1
38.600000    1
25.800000    1
25.740000    1
33.700000    1
32.395000    1
33.110000    1
20.235000    1
26.220000    1
24.700000    1
40.375000    1
66.370173    1
21.780000    1
35.720000    1
60.617535    1
34.400000    1
39.710000    1
27.200000    1
68.736874    1
32.490000    1
36.955000    1
84.973279    1
44.880000    1
65.454749    1
32.300000    1


KeyError: 'charges'

## 2. Reorganizing and Cleaning the Data

- [x] **Check for missing values**

- [x] **Check for NaN values in all of the columns, then in a specific set of columns**

- [x] **Rearranje and Rename Columns**

- [x] **Check for erronious values (age should be only positive integers equal or above 18, region should have consistent names, charges sholdn't have units missing)**

In [108]:
df_i = pd.read_csv("insurance.csv")

df_i["age"].value_counts()  # We see thatthey're all integers but we have negative ages, so let's use the absolute value function

def absolute_value(x):
    return abs(x)

df_i["age"]= df_i["age"].apply(absolute_value)

df_i["age"].value_counts()

# Another way to do it is using the lambda function

df_i["age"] = df_i["age"].apply(lambda x: abs(x))

df_i["age"].value_counts()

# Now, let's fix the region

# df_i["region"].value_counts()  # Some of them are not properly capitalized, let's make it consistent

df_i["region"] = df_i["region"].str.capitalize()

df_i["region"].value_counts()

# Now let's see the charges column

df_i["charges"].value_counts()

df_i["charges"] = df_i["charges"].str.replace("$", "", regex=False)
df_i["charges"] = pd.to_numeric(df_i["charges"], errors="coerce")  # Converts to float, handling "$nan" as NaN

df_i.rename(columns={"charges": "charges (in American dollars)"}, inplace=True)

# Use imputation to fill in the NaN values (put the mean of the available values in the NaN slots)

df_i["charges (in American dollars)"] = df_i["charges (in American dollars)"].fillna(df_i["charges (in American dollars)"].mean())


df_i["charges (in American dollars)"]


0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges (in American dollars), Length: 1338, dtype: float64

## 3. Analysis and Visualizations

## 4. Data Merging

# Data Science (Using the Pandas Library)

## 5. Inferential Statistics

## 6. PCA (Principal Component Analysis)

## 7. Random Forest

- [] **Use a Random Forest model, coupled with feature importance, to help us understand which features (columns) in this dataset contribute the most to the model’s predictions**

## 8. KMeans Clustering

## 9. Logistic Regression

## 10. Gradient Boost

# Transfering the data to MySQL

- [x] **Save the original dataset with fixed columns**

- [x] **Save the clean dataset**