# 1. Exploratory Data Analysis: Medical Insurance

- This Dataset is from Kaggle: https://www.kaggle.com/mirichoi0218/insurance

- This project was inspired by the codecademy portfolio project.
- In this project the goals were to:
    - Discover Trends
    - Clean and refine data using pandas
    - Optimize data for Machine Learning using scipy and scikit-learn
    - Create insightful visualiztions using matplotlib and seaborn
    - Create a regression model and a predictive analysis
    
##  Project Contributors  
- **Corey Arrington**
- **Jeremy Cruzado**

## Profiles
- Linkedin: 
- Github:

## 1.1 Import Required Libraries

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# -- -- -- -- --
# Preprocessing
from sklearn import preprocessing
# -- -- -- -- -- 
pd.options.display.max_rows = None

## 1.2 Loading DataFrame

In [2]:
df = pd.read_csv('/Users/jtc/Data_Science_Projects/Medical_Insurance/Medical-Insurance-Analysis-EDA/datasets/insurance.csv')


FileNotFoundError: [Errno 2] No such file or directory: '/Users/jtc/Data_Science_Projects/Medical_Insurance/insurance.csv'

In [None]:
df.head(5)

# 2. Early EDA



## 2.1 Identifying key statistical data.

In [None]:
df.describe()

## 2.2 Evaluating Columns for Null Values.

In [None]:
df.info()

## 2.3 Inspecting the Max Value

In [None]:
df.max()

## 2.4 Inspecting the Min Value

In [None]:
df.min()

## 2.5 Inspecting the Average Value

In [None]:
df.mean()

## 2.6 Evaluating Distribution of Males and Females

In [None]:
print(df.loc[df.sex == 1].count())# 676 [1]
print(df.loc[df.sex == 0].count())# 662 [0]

# 3. Data Cleaning

    - Rounding 2 decimal places for BMI column and charges column.


In [None]:
df.bmi = np.round(df.bmi,2) # bmi column
df.charges = np.round(df.charges,2) # charges column

In [None]:
df.head()

# 4. Data Visualizations: Histogram Distribution

## 4.1 Histogram: BMI

In [None]:
plt.hist(df.bmi)

## 4.2 Histogram: Charges

In [None]:
plt.hist(df.charges)

## 4.3 Histogram: Age

In [None]:
plt.hist(df.age)

## 4.4 Histogram: Children

In [None]:
plt.hist(df.children)

# 5. Feature Engineering: Label Encoding 

    - In order to realize the correlations of the features the row values are converted from object to numerical.

In [None]:
le = preprocessing.LabelEncoder()
df['sex'] = le.fit_transform(df.sex.values)
df['smoker'] = le.fit_transform(df.smoker.values)
#df['region'] = le.fit_transform(df.region.values)  # regions have no corr

## 5.1 Drop Region Column
    - Dropping due to negative correlation.

In [None]:
df.drop(['region'], axis=1, inplace = True)

In [None]:
df.head()

## 5.2 Checking Column Correlations to the Charges feature

In [None]:
corr_matrix = df.corr()
corr_matrix['charges'].sort_values(ascending=False)

# 5.3 Data Visualization: Heatmap ( Correlation )

In [None]:
fig = plt.gcf()
fig.set_size_inches(15,8)
sns.heatmap(corr_matrix,annot = True,fmt='.1g',vmin=-1,vmax=1,center=0,cmap="cubehelix")

# 6. Identifying Trends: From Highest Correlation
  - **Male vs Female**
  - **Smoker vs Non-Smoker**


In [None]:
check_rates_mean = df.groupby(['sex','smoker'])['charges'].mean()
check_rates_mean

##  6.1 **Analysis Insight:** 
- Males on average have lower medical insurance charges compared to women when both genders are non-smokers. Females on average pay less when both genders smoke.

In [None]:
check_rates_median = df.groupby(['sex','smoker'])['charges'].median()
check_rates_median

## 6.2 Preparing Data: Category Plot

- Created a Data Frame to display the Median values for Male and Female Smoker and Non-Smoker Charges data.
- Using Median to avoid exposer to heavily weighted outliers.

In [None]:
rates = pd.DataFrame(data={'Gender': ['Female','Male'], 
                         'Smoker': [28950.47,36085.22], 
                         'Non-Smoker': [7639.42, 6985.51]})

rates_plot = pd.melt(rates, id_vars = "Gender")
# rename columns to fit descriptor.
rates_plot.rename(columns={"Gender":"Gender", "variable":"Smokes","value":"Charges"},inplace=True)
rates_plot

## 6.3 Data Visualization: Category side by side Bar plot 

In [None]:
fig = plt.figure()

sns.catplot(x = 'Gender', y = 'Charges',
              hue = 'Smokes', data=rates_plot, kind='bar', palette="coolwarm")
plt.show()

# 7. Data Visualization: ScatterPlot

- Visualizing feature correlation to charges.

In [None]:
fig = plt.gcf()
fig.set_size_inches(15,12)
sns.scatterplot(
        data = df, x = 'age', y = 'charges', hue = 'bmi', size = 'smoker',
        sizes = (20,200), legend = 'auto',palette="icefire"
)
plt.show()

##  7.1 **Analysis Insight:** 

- Clear Regression pattern, bottom details that the older you become there is an increase in charges in addition to the higher bmi

- Second regression pattern shows that individuals with low bmi's that are smokers pay nearly the same as high bmi non smokers.

- Third regression pattern shows higher bmi that are smokers pay alot more the older they become.

## 7.2 Exporting CSV Prepaired for Machine Learning.

In [None]:
df.to_csv('/Users/jtc/Data_Science_Projects/Medical_Insurance/insurance_ml.csv')