### Health Insurance Fees

<b> The goal</b> of this analysis is to use <b>patient data</b> to estimate the <b>average medical care expenses</b> for such population segments. These estimates could be used to create tables which set the price of yearly premiums higher or lower depending on the expected treatment costs.

## 1. Getting to know our data

We will use a reproduced dataset that contains medical expenses for patients in the United States

### 1.1 Importing

In [None]:
#Import relevant libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import scipy.stats as stats
import statsmodels.api as sm
import numpy as np
import plotly.offline as pyo
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# import plotly.plotly as py
import plotly.graph_objs as go
import chart_studio.plotly

In [None]:
#Read in csv file as dataframe
insurance = pd.read_csv('insurance.csv')
insurance.head(1)

### 1.2 How our data looks 

In [None]:
#print info
insurance.info()

In [None]:
#print shape
insurance.shape

### 1.3 Decription of our data



<h1><center>Type of data</center></h1> 

| Continuous | Categorical | Binary |
| --- | --- | --- |
| age | region | sex |
| bmi | - | children |
| charges | - | smoker |


<h1><center>Feature description</center></h1> 

| Feature | Description |
| --- | --- |
| age | This is an integer indicating the age of the primary beneficiary <br> (excluding those above 64 years, since they are generally covered by the government). |
| sex | This is the policy holder's gender, either male or female. |
| bmi | This is the body mass index (BMI), which provides a sense of how over <br> or under-weight a person is relative to their height.BMI is equal to weight (in kilograms) <br> divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9. |
| children | This is an integer indicating the number of children / <br> dependents covered by the insurance plan. |
| smoker | This is yes or no depending on whether the insured regularly smokes tobacco |
| region | This is the beneficiary's place of residence in the U.S., <br> divided into four geographic regions: northeast, southeast, southwest, or northwest. |

<h1><center>Response description</center></h1> 

| Response | Description |
| --- | --- |
| <font color='black'> charges </font>| <font color='black'>  Yearly medical expenses in dollars </font> | 

## 2. Cleaning data

In [None]:
# Checking for missing values
insurance.isnull().sum()

- There appears to be no missing values in out dataset

In [None]:
#Examining duplicates within dataset
insurance.drop_duplicates(inplace= True)

- All the duplicates that may have been apparent the dataset has been dropped 

In [None]:
#Examining outliers
sns.set()
sns.set(style="whitegrid")
fig, axes = plt.subplots(2,2, figsize=(15, 15))
sns.boxplot(x=insurance["age"], ax=axes[0,0], data = insurance)
sns.boxplot(x=insurance["bmi"], ax=axes[0,1],data = insurance)
sns.boxplot(x=insurance["charges"], ax=axes[1,0], data = insurance);

In [None]:
#Creating an interquartile variable
iqr = np.subtract(*np.percentile(insurance['charges'], [75, 25]))
print(iqr)

# identify outliers for charges

q25, q75 = np.percentile(insurance['charges'], 25), np.percentile(insurance['charges'], 75)
iqr = q75 - q25
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

outliers = [x for x in insurance['charges'] if x < lower or x > upper]
print('Identified outliers for charges out of 1138: %d' % len(outliers))


# identify outliers for bmi
q25, q75 = np.percentile(insurance['bmi'], 25), np.percentile(insurance['bmi'], 75)
iqr = q75 - q25
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off
outliers = [x for x in insurance['bmi'] if x < lower or x > upper]
print('Identified outliers for bmi out of 1338 records: %d' % len(outliers))


# identify outliers for age
q25, q75 = np.percentile(insurance['age'], 25), np.percentile(insurance['age'], 75)
iqr = q75 - q25
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

outliers = [x for x in insurance['age'] if x < lower or x > upper]
print('Identified outliers for age out of 1338 records: %d' % len(outliers))

From the calculations above we can see that:
- The identified outliers for the column charges amounts to 139
- The identified outliers for the column bmi amounts to 9
- The have not been any outliers identified in the age column

No outlier will be removed as the outiers contain information that will be needed in order to do futher predictions

## 4. Distribution

In [None]:
#2.1.2 Describe the center of your data
#Mean: 
mean= insurance["charges"].mean()
print("The mean charges is:",mean)
    
#Median:    
median = insurance["charges"].median()
print("The median charges is:",median)

#Mode:
mode = 1639.5631 #insurance["charges"].mode()
print("The charges that appears the most is:", mode)

#Graphics
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

x = insurance["charges"]

sns.boxplot(x, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(x, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
ax_hist.axvline(mode, color='b', linestyle='-')

plt.legend({'Mean':mean,'Median':median,'Mode':mode})

ax_box.set(xlabel='charges')
ax_hist.set(xlabel='Charges')

plt.show()

- The large majority of individuals in our data have yearly medical expenses between 0 dollars and 15,000 dollars, although the tail of the distribution extends far past these peaks.
- The mean value is greater than the median, this implies that the distribution of insurance charges is right-skewed
- Linear regression assumes a normal distribution for the dependent variable, this distribution is not ideal.
 The assumptions of linear regression is violated.
- The package that seems to be the most popular is the one that amouts to the charges $1639.56

In [None]:
#2.1.2 Describe the center of your data
#Mean: 
mean= insurance["age"].mean()
print("The mean age is:",mean)
    
#Median:    
median = insurance["age"].median()
print("The median age is:",median)

#Mode:
mode = 18 #insurance["age"].mode()
print("The age that appears the most is:", mode)

#Graphics
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

x = insurance["age"]

sns.boxplot(x, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(x, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
ax_hist.axvline(mode, color='b', linestyle='-')

plt.legend({'Mean':mean,'Median':median,'Mode':mode})

ax_box.set(xlabel='Age')
ax_hist.set(xlabel='Age')

plt.show()

- The large majority of individuals in our data have ages within the decade 20, although the tail of the distribution extends far past these peaks.
- The mean value is fairly close the median we could say that they are equal, however there seems to be more peaks so this implies a bimodal distribution
- The age 

In [None]:
#2.1.2 Describe the center of your data
#Mean: 
mean= insurance["bmi"].mean()
print("The mean bmi is:",mean)
    
#Median:    
median = insurance["bmi"].median()
print("The median bmi is:",median)

#Mode:
mode = 32.3 #insurance["bmi"].mode()
print("The bmi that appears the most is:", mode)

#Graphics
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

x = insurance["bmi"]

sns.boxplot(x, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(x, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
ax_hist.axvline(mode, color='b', linestyle='-')

plt.legend({'Mean':mean,'Median':median,'Mode':mode})

ax_box.set(xlabel='bmi')
ax_hist.set(xlabel='bmi')

plt.show()

- The large majority of individuals in our data have body mass indexes between 25 and 35
- The mean value is fairly close the median we could say that they are equal, this implies that the distribution of the body mass indexes is symmetric.

In [None]:
# scatter plot with regression plot (target)

print('The correlation and the pvalue of age and charges are respectively:',stats.pearsonr(insurance['age'], insurance['charges']))
print('The correlation and the pvalue of bmi and charges are respectively:',stats.pearsonr(insurance['bmi'], insurance['charges']))

sns.set()
# sns.set(style = 'whitegrid')
fig, axes = plt.subplots(1,2, figsize=(16,16))

sns.regplot(x=insurance['age'], y=insurance['charges'], ax= axes[0])
sns.regplot(x=insurance['bmi'], y=insurance['charges'],ax= axes[1]);

## 4. Statistical Infernces

In [None]:
fig, ax = plt.subplots(2,2, figsize=(15,7))
# use unstack()
insurance.groupby(['children','region']).count()['charges'].unstack().plot(ax=ax[0,0])
insurance.groupby(['children','sex']).count()['charges'].unstack().plot(ax=ax[0,1])
insurance.groupby(['children','smoker']).count()['charges'].unstack().plot(ax=ax[1,0]);

In [None]:
fig, ax = plt.subplots(2,2, figsize=(15,7))
# use unstack()
insurance.groupby(['age','region']).count()['charges'].unstack().plot(ax=ax[0,0])
insurance.groupby(['age','sex']).count()['charges'].unstack().plot(ax=ax[0,1])
insurance.groupby(['age','smoker']).count()['charges'].unstack().plot(ax=ax[1,0]);

In [None]:
fig, ax = plt.subplots(2,2, figsize=(15,7))
# use unstack()
insurance.groupby(['bmi','region']).count()['charges'].unstack().plot(ax=ax[0,0])
insurance.groupby(['bmi','sex']).count()['charges'].unstack().plot(ax=ax[0,1])
insurance.groupby(['bmi','smoker']).count()['charges'].unstack().plot(ax=ax[1,0]);

In [None]:
import plotly.express as px
df = px.data.tips()
fig = px.histogram(insurance, x="age", y="bmi", color="smoker",
                   marginal="box", 
                   hover_data=insurance.columns)
fig.show()

In [None]:
import plotly.express as px
df = px.data.tips()
fig = px.histogram(insurance, x="age", y="charges", color="children",
                   marginal="box",
                   hover_data=insurance.columns)
fig.show()

In [None]:
import plotly.express as px
df = px.data.tips()
fig = px.histogram(insurance, x="sex", y="bmi", color="region",
                   marginal="box", 
                   hover_data=insurance.columns)
fig.show()

In [None]:
import plotly.express as px
df = px.data.tips()
fig = px.histogram(insurance, x="age", y="bmi", color="region",
                   marginal="box", 
                   hover_data=insurance.columns)
fig.show()

In [None]:
import plotly.graph_objects as go # or plotly.express as px
fig = go.Figure() # or any Plotly Express function e.g. px.bar(...)
# fig.add_trace( ... )
# fig.update_layout( ... )

import dash
import dash_core_components as dcc
import dash_html_components as html

app = dash.Dash()
app.layout = html.Div([
    dcc.Graph(figure=fig)
])

app.run_server(debug=True, use_reloader=False)  # Turn off reloader if inside Jupyter