# Medical Cost Personal Dataset
**Columns**

**age**: age of primary beneficiary

**sex**: insurance contractor gender: female, male

**bmi**: bmi is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m², resulting from mass in kilograms and height in metres

**children**: Number of children covered by health insurance / Number of dependents

**smoker**: Smoking

**region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges**: Individual medical costs billed by health insurance

In notebook I am trying determine predictive variables by analysing the dataset visualy

In [None]:
import numpy as np # linear algebra
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('../input/insurance/insurance.csv')

In [None]:
df.head()

## Data Wrangling
#### What is the purpose of Data Wrangling?
Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.

In [None]:
df.columns

In [None]:
# Looking for nomber of missing value in each column
df.isnull().sum()

So we dont have any missing value.

In [None]:
#Calculate the average of the column
#Lets calculate average of age.
avg_age = df['age'].mean()
avg_age

In [None]:
#Replace NaN by mean value
df['age'].replace(np.nan, avg_age, inplace=True)

In [None]:
# Lets check how many regions we have
df['region'].unique()

In [None]:
#df['region'] = df.region you can use both 
df.region.value_counts()

In [None]:
df['age'].value_counts().sort_values(ascending=False)
#as we can see 18 and 19 years old people are in the majority

In [None]:
df['smoker'].value_counts()

### Lets convert data types to proper format



In [None]:
df.dtypes

In [None]:
#We will use numerical data, so we should convert 'sex','smoker' amd 'region' to numerical data
# Encoding the data with map function

df['sex'] = df['sex'].map({'female':0,'male':1})
df['smoker'] = df['smoker'].map({'yes':1,'no':0})
df['region'] = df['region'].map({'southeast':0,'southwest':1,'northwest':2,'northeast':3})

In [None]:
df.head()

## Linear Regression

As we can see all is numeric data now.

In [None]:
df.describe().T

In [None]:
sns.regplot(x ='age', y = 'charges', data =df)
# Weak Linear Relationship

In [None]:
df['region'].value_counts()

In [None]:
#Let's repeat the above steps but save the results to the dataframe "region_value_counts" and rename the column 
#'region' to 'value_counts'.
region_value_counts = df['region'].value_counts().to_frame()
region_value_counts.rename(columns={'region': 'value_counts'}, inplace=True)
region_value_counts

In [None]:
smoker_value_counts = df['smoker'].value_counts().to_frame()
smoker_value_counts.rename(columns ={'smoker': 'value_counts'},inplace =True)
smoker_value_counts

In [None]:
smoker_value_counts.index.name = 'Smoker'

In [None]:
smoker_value_counts

# Basic Grouping

In [None]:
df['age'].unique()

In [None]:
#If we want to know, on average, which age_group and 'sex'  are charged more,
df_group_one = df[['age','sex','charges']]

In [None]:
df_group_one = df_group_one.groupby(['age','sex'], as_index =False).mean()

In [None]:
df_group_one

This grouped data is much easier to visualize when it is made into a pivot table.



In [None]:
group_pivot = df_group_one.pivot(index = 'age',columns = 'sex')

In [None]:
group_pivot.T

## Exploratory Data Analysis

In [None]:
df.corr()
#weak correlation

In [None]:
sns.boxplot(x="sex", y="charges", data=df)

In [None]:
from scipy import stats


In [None]:
#Let's calculate the Pearson Correlation Coefficient and P-value of 'age' and 'charges'.
pearson_coef, p_value = stats.pearsonr(df['age'], df['charges'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

Conclusion:

Since the p-value is < < 0.001, the correlation between age and charges is statistically significant, although the linear relationship isn't extremely strong (~0.585)

In [None]:
#sex vs charges
#Let's calculate the Pearson Correlation Coefficient and P-value of 'sex' and 'charges'.
pearson_coef, p_value = stats.pearsonr(df['sex'], df['charges'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

Conclusion: Since p-value is <0.001 , the correlation between sex and charges is statistically significant,and the linear realtionship is quite strong (0.88),close to 1

## Binning

In [None]:
#Binning the age column.
bins = [17,35,55,1000]
slots = ['Young adult','Senior Adult','Elder']

df['Age_range']=pd.cut(df['age'],bins=bins,labels=slots)

In [None]:
df.head()

In [None]:
# I can check the number of unique values is a column
# If the number of unique values <=40: Categorical column
# If the number of unique values in a columns> 50: Continuous

df.nunique().sort_values()

In [None]:
plt.figure(figsize=(25, 16))
plt.subplot(2,3,1)
sns.boxplot(x = 'smoker', y = 'charges', data = df)
plt.title('Smoker vs Charges',fontweight="bold", size=20)
plt.subplot(2,3,2)
sns.boxplot(x = 'children', y = 'charges', data = df,palette="husl")
plt.title('Children vs Charges',fontweight="bold", size=20)
plt.subplot(2,3,3)
sns.boxplot(x = 'sex', y = 'charges', data = df, palette= 'husl')
plt.title('Sex vs Charges',fontweight="bold", size=20)
plt.subplot(2,3,4)
sns.boxplot(x = 'region', y = 'charges', data = df,palette="bright")
plt.title('Region vs Charges',fontweight="bold", size=20)
plt.subplot(2,3,5)
sns.boxplot(x = 'Age_range', y = 'charges', data = df, palette= 'husl')
plt.title('Age vs Charges',fontweight="bold", size=20)
plt.show()

. Medical Charges are more for smoker than the non smoker.
2. Medical Charges are more in Southeast Region
3. Senior Adults are charged more

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='region', y='charges', hue='sex', data=df, palette='Paired')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x = 'region', y = 'charges',hue='smoker', data=df, palette='cool')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='region', y='charges', hue='children', data=df, palette='Set1')
plt.show()

As we can see from these barplots the highest charges due to smoking are still in the Southeast but the lowest are in the Northeast. People in the Southwest generally smoke more than people in the Northeast, but people in the Northeast have higher charges by gender than in the Southwest and Northwest overall. And people with children tend to have higher medical costs overall as well

In [None]:
plt.figure(figsize=(12,6))
sns.violinplot(x = 'children', y = 'charges', data=df, hue='smoker', palette='inferno')
plt.show()

From above plot we can see that, Smoking has the highest impact on medical costs, even though the costs are growing with age, bmi and children. Also people who have children generally smoke less

In [None]:
#Heatmap to see correlation between variables
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='RdYlGn', annot = True)
plt.title("Correlation between Variables")
plt.show()

In [None]:
sns.scatterplot(x=df['bmi'], y=df['charges'], hue=df['smoker'])

The scatter plot shows that while nonsmokers do tend to pay slightly more with increasing BMI, smokers pay much more. To further emphasize this fact, I have added two regression lines, corresponding to smokers and nonsmokers

In [None]:
sns.lmplot(x="bmi", y="charges", hue="smoker", data=df)

We can notice that the regression line for smokers has a much steeper slope, relative to the line for nonsmokers. Lets conclude with categorical scatter plot.

In [None]:
sns.swarmplot(x=df['smoker'],y=df['charges'])

On average, non-smokers are charged less than smokers, and the customers who pay the most are smokers whereas the customers who pay the least are non-smokers. Hence, smoking habits determine the insurance charges.

## Anova

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant is our calculated score value.

In [None]:
df_group_two=df_group_one[['age', 'charges']].groupby(['age'])
df_group_two.head()

In [None]:
df_group_two.get_group(18)['charges']
#we see the 18 years old female(0) and male(1) charges

In [None]:
# ANOVA
f_val, p_val = stats.f_oneway(df_group_two.get_group(20)['charges'], df_group_two.get_group(40)['charges'], df_group_two.get_group(60)['charges'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)  

## Model Development

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()
lm

In [None]:
X = df[['age','sex','bmi','region']]
Y = df['charges']

In [None]:
lm.fit(X,Y)

In [None]:
Yhat = lm.predict(X)
Yhat[0:5]

In [None]:
lm.intercept_

In [None]:
lm.coef_

In [None]:
width = 6
height = 4
plt.figure(figsize=(width, height))
sns.regplot(x="bmi", y="charges", data=df)
plt.ylim(0,)

# Training and Testing

In [None]:
y_data = df['charges']

In [None]:
x_data =df.drop('charges',axis =1)

In [None]:
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

In [None]:
#Use the function "train_test_split" to split up the data set such that 40% of the data samples will be utilized for testing, set the parameter "random_state" equal to zero. The output of the function should be the following: "x_train_1" , "x_test_1", "y_train_1" and "y_test_1".

In [None]:

x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, random_state=0) 
print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lre=LinearRegression()  

In [None]:
lre.fit(x_train[['bmi']], y_train)   # we fit the model using the feature bmi

In [None]:
lre.score(x_test[['bmi']], y_test)   # claculates th R^2 on the test data

In [None]:
lre.score(x_train[['bmi']], y_train)  # we can see the R^2 is much smaller using the test data

In [None]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.1, random_state=0)
lre.fit(x_train1[['bmi']],y_train1)
lre.score(x_test1[['bmi']],y_test1)