# U.S. Medical Insurance Costs

**Project Scope and research questions**
* average age of the patients in the dataset
* average bmi
* where a majority of the individuals are from
* different costs between smokers vs. non-smokers
* what the average age is for someone who has at least one child in this dataset
* features that are the most influential for an individual’s medical insurance charges based on analysis
* build regression model

We will explore how a set of particular factors (age, sex, smoking, region of the US) influence insurance costs and build regression model based on available data.

In [None]:
import pandas as pd
import numpy as np 
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/insurance/insurance.csv')
df.head()

There are seven columns, some columns are numerical while some are categorical. Let's check if there is no missing data.

In [None]:
df.isnull().sum()

Great! Let's than get main stats for all the columns.

In [None]:
df.describe()

# Inferences based on summary statistics

* age

Age range is not that big, we have data for adults only, with no data on elder or children. Exclusion of elder adults may bias our dataset. 

* bmi

We notice that both mean and bmi within IQR are highher that what consider healthy bmi.

* children 

We need to explore if having children correlates with greater insurance cost, as children may be covered by parents health insurance.

* charges 

We observe that mean is higher than median, that indicates right skewness.
 

To bettter get a grasp of a data we would visualize it with graphs.
First, let's look at the distribution of cost.

In [None]:
plt.figure(figsize=(14,5))
plt.hist(df.charges, bins=20)
plt.xlabel('Insurance Cost, $')
plt.title('Insurance Cost Distribution')
plt.show()

As was noticed, we have right-skewed distribution of charges with some outliers with extremely expensive health insurance.

In [None]:
sns.countplot(data=df, x='region')
plt.show()

In [None]:
sns.kdeplot(df[(df.region=='southwest')]["charges"], shade=True)
sns.kdeplot(df[(df.region=='southeast')]["charges"], shade=True)
sns.kdeplot(df[(df.region=='northwest')]["charges"], shade=True)
sns.kdeplot(df[(df.region=='northeast')]["charges"], shade=True)
plt.show()

In [None]:
sns.countplot(data=df, x='sex')
plt.show()

In [None]:
sns.kdeplot(df[(df.sex=='male')]["charges"], shade=True)
sns.kdeplot(df[(df.sex=='female')]["charges"], shade=True)
plt.show()

In [None]:
plt.figure(figsize=(14,5))
plt.title("Distribution of age")
ax = sns.histplot(df["age"], bins=20, color = 'g')

In [None]:
plt.figure(figsize=(14,5))
ax = plt.scatter(df.age, df.charges)
plt.show()

So we've got even representation of both males/females and all four regions, however our dataset contains more datapoints on young adults. 
That may explain cost distribution we saw earlier and we could notice slight correlation between age and actual charges.

Let's look into costs of smokers vs non-smokers.

In [None]:
plt.figure(figsize=(14,5))
plt.hist(df[(df.smoker == 'yes')]["charges"], bins=20, alpha=0.5, color='red')
plt.hist(df[(df.smoker == 'no')]["charges"], bins=20, alpha=0.5, color='green')
plt.show()

Smoking patients spend more on insurance. But it looks like the number of non-smoking patients is greater. Lets's check this.

In [None]:
sns.catplot(x="smoker", kind="count",hue = 'sex', palette="rocket", data=df)
plt.show()

There are significantly more non-smokers than smokers and we could see that there is more male smokers. Let's see if cost for them are higher.

In [None]:
sns.catplot(x="sex", y="charges", hue="smoker", kind="violin", data=df, palette = 'husl')
plt.show()

It's obviously better for your finances not to smoke, because not only you pay for cigarettes, but also your insurance would be more expensive. 

What about bmi, are there such strong relations between bmi and cost too? Let's look into it.

In [None]:
plt.figure(figsize=(14,5))
plt.title("Distribution of bmi")
ax = sns.histplot(df["bmi"], color = 'r')

In [None]:
plt.figure(figsize=(14,5))
ax = plt.scatter(df.bmi, df.charges)
plt.show()

Quite interesting: bmi follows almost normal distribution, but it seems there is no obvious relationship between bmi and insurance cost.

Let us outline smokers on this scatterplot.

In [None]:
plt.figure(figsize=(14,5))
ax = sns.scatterplot(x='bmi',y='charges',data=df, hue='smoker')

plt.show()

Now we see that smoking has the biggesr effect on cost. Let's see if number of children influences cost.

In [None]:
plt.figure(figsize=(14,5))
plt.hist(df[(df.children > 0)]["charges"], bins=25, alpha=0.5, color='blue')
plt.hist(df[(df.children == 0)]["charges"], bins=25, alpha=0.5, color='yellow')
plt.show()

It seems that children don't influence insurance cost that much.

Now, that we've gotten feel of our dataset let's prepare data for creting regression.
Let's begin with encoding categorical features.

In [None]:
from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)

In [None]:
df.corr()['charges'].sort_values()

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(240,10,as_cmap=True),
            square=True, ax=ax)
plt.show()

A strong correlation is observed only with smoking. 

Let's predict insurance cost using diffrent regressions.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
scaler = StandardScaler()
x = scaler.fit_transform(df[['age', 'bmi', 'children', 'smoker']])
y = scaler.fit_transform(df[['charges']])

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 0)
mlr = LinearRegression().fit(x_train,y_train)

y_train_pred = mlr.predict(x_train)
y_test_pred = mlr.predict(x_test)

print(mlr.score(x_test,y_test))

Let's see if we can do better with other algorithms.

In [None]:
regressor = DecisionTreeRegressor(max_depth = 4)
regressor.fit(x_train, y_train)
predictions = regressor.predict(x_test)
print(regressor.score(x_test, y_test))

A lit bit of tweaking of the max_depth parameter and we got ourselves 89% score.
Not ideal, let's try random forest.

Through tuning max-depth and n_estimators only we found the best score for our random forest on this dataset.

In [None]:
tree = RandomForestRegressor(n_estimators = 32, random_state = 0, max_depth=4, n_jobs=-1)
tree.fit(x_train, y_train.ravel())
print(tree.score(x_test, y_test))
print(tree.score(x_train, y_train))
print(tree.feature_importances_)

Let's plot feature_importances

In [None]:
col = ['age', 'bmi', 'children', 'smoker']
y = tree.feature_importances_
#plot
fig, ax = plt.subplots(figsize=(25,5)) 
width = 0.6 # the width of the bars 
ind = np.arange(len(y)) # the x locations for the groups
ax.barh(ind, y, width)
ax.set_yticks(ind+width/10)
ax.set_yticklabels(col, minor=False)
plt.title('Feature importance in Random Forest Regressor')
plt.xlabel('Relative importance')
plt.ylabel('feature') 
plt.show()

### Conclusion

Accuracy is not ideal, but we trained our algorithm to predict with roughly 90% accuracy the insurance cost, based on such factors as age, sex, bmi, children, smoker, region. 
 
We take into consideration that based on our data we heavily rely in our predictions on the fact of smoking and our data could be biased as we don't have data for elder patients.