
# U.S. Medical Insurance Costs

In this project, the goal is to analyze a csv file titled 'insurance.csv' in order to find out certain patterns inside the data. The objectives of the analysis of this data is to see how different parameters can affect charges (insurance cost) of people. Several outputs will be shown, such as:
1. Finding out how regions can affect charges of insurance
2. How regions affect number of children and correlation to charges
3. How regions affect smoking status, and their correlation
4. How all the variables correlate to result in the insurance costs

In [75]:
#Import da shit here
import csv 
import pandas as pd
import numpy as np
from sklearn import linear_model

In [100]:
#Check Data
with open('insurance.csv') as insurance_data:
    insurance = pd.read_csv('insurance.csv')

print(insurance)

      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]


Before we find the different analysis and evaluations from our data, we first need to manage and sort our data accordingly in an orderly manner:

In [77]:
#For region and insurance
RandI = insurance[["region", "charges"]]

In [78]:
#For region and children
RandC = insurance[["region", "children"]]

In [79]:
#For smoker status and region
RandS = insurance[["region", "smoker"]]

#### 1st Problem: How different regions affect insurance costs.

In [80]:
#We first find the total unique locations of the dataset
UniqueLocations = list(insurance["region"].unique())
print(UniqueLocations)

['southwest', 'southeast', 'northwest', 'northeast']


In [81]:
#We can then allocate each unique value with their own datasets
SW = RandI.loc[RandI["region"] == "southwest", "charges"]
SE = RandI.loc[RandI["region"] == "southeast", "charges"]
NW = RandI.loc[RandI["region"] == "northwest", "charges"]
NE = RandI.loc[RandI["region"] == "northeast", "charges"]

In [82]:
#We can then find the average for each of these values 
SW_avg = round(SW.sum()/len(SW), 2)
SE_avg = round(SE.sum()/len(SE), 2)
NW_avg = round(NW.sum()/len(NW), 2)
NE_avg = round(NE.sum()/len(NE), 2)

In [83]:
#Turning to a dictionary:
average_cost = [{"Region": "Southwest", "Average": SW_avg}, 
                {"Region": "Southeast", "Average": SE_avg}, 
                {"Region": "Northwest", "Average": NW_avg},
                {"Region": "Northeast", "Average": NE_avg},
]

def get_money(average_cost):
    return average_cost.get("Average")

average_cost.sort(key=get_money, reverse=True)
print(average_cost)

[{'Region': 'Southeast', 'Average': 14735.41}, {'Region': 'Northeast', 'Average': 13406.38}, {'Region': 'Northwest', 'Average': 12417.58}, {'Region': 'Southwest', 'Average': 12346.94}]


Therefore, we can see that based on the region, the Southeast has the highest cost when it comes to insurance, while living in the Southwest has the lowest cost in terms of insurance. In order to get the conclusion, we first have to see other variables, however. 

#### 2nd Problem: How regions affect number of children and correlation to charges

In [84]:
# Since we already know the different unique locations of the dataset, we can focus on the average no. of children
SW_C = RandC.loc[RandC["region"] == "southwest", "children"]
SE_C = RandC.loc[RandC["region"] == "southeast", "children"]
NW_C = RandC.loc[RandC["region"] == "northwest", "children"]
NE_C = RandC.loc[RandC["region"] == "northeast", "children"]

In [85]:
SW_C_avg = round(SW_C.sum()/len(SW_C), 2)
SE_C_avg = round(SE_C.sum()/len(SE_C), 2)
NW_C_avg = round(NW_C.sum()/len(NW_C), 2)
NE_C_avg = round(NE_C.sum()/len(NE_C), 2)
print(SW_C_avg, SE_C_avg, NW_C_avg, NE_C_avg)

1.14 1.05 1.15 1.05


As we can see, there are different values of average children of each regions. However, if we want to accurately get a better description of the correlation between region, children and insurance values, we need to creat a regression model. To do so, we can do the following:

In [86]:
# Putting the different columns in a new dataframe
RIC = insurance[["region", "children", "charges"]]

In [93]:
# Replace each value of the column
RIC["region"].replace(to_replace = {"southwest": 1, "southeast": 2, "northwest": 3, "northeast": 4}, inplace=True)

In [99]:
# Creating the linear regression model
X = RIC[["region", "children"]]
y = RIC["charges"]

RIC_regr = linear_model.LinearRegression()
RIC_regr.fit(X,y)

region_weight = RIC_regr.coef_[0]
children_weight = RIC_regr.coef_[1]

print(region_weight)
print(children_weight)

80.41602393563173
684.3106281072279


Based on our model of using both region and number of children, we can see that region has a a lower value than the impact of children towards the value of the insurance. This could be due to the fact that in comparison to the location of the specific person, having a children puts a bigger burden on the cost of insurance due to many basic needs, as well as an increase of consumption due to more mouths to feed. 

#### 3rd problem: Correlation of smoking to region as well as value

In [103]:
# We can then find the correlation of smoker to the region
RandS["smoker"].replace(to_replace = {"no": 0, "yes": 1}, inplace=True)


SW_S = RandS.loc[RandS["region"] == "southwest", "smoker"]
SE_S = RandS.loc[RandS["region"] == "southeast", "smoker"]
NW_S = RandS.loc[RandS["region"] == "northwest", "smoker"]
NE_S = RandS.loc[RandS["region"] == "northeast", "smoker"]

In [104]:
# Finding the average amount of people for each region
SW_S_avg = round(SW_S.sum()/len(SW_S), 2)
SE_S_avg = round(SE_S.sum()/len(SE_S), 2)
NW_S_avg = round(NW_S.sum()/len(NW_S), 2)
NE_S_avg = round(NE_S.sum()/len(NE_S), 2)
print(SW_S_avg, SE_S_avg, NW_S_avg, NE_S_avg)

0.18 0.25 0.18 0.21


From the averages above, we can see that different locations have different amounts of smokers, with being living in the southeast region having the most amount of smokers. We can see that there are some correlation between being a smoker and living in a place. However, if we try to see it in the scope of insurance cost, we need to apply multivariable linear regression 

In [106]:
# Putting the different columns in a new dataframe
RIS = insurance[["region", "charges", "smoker"]]

# Replacing the values of region and smoker:
RIS["region"].replace(to_replace = {"southwest": 1, "southeast": 2, "northwest": 3, "northeast": 4}, inplace=True)
RIS["smoker"].replace(to_replace = {"no": 0, "yes": 1}, inplace=True)

# Linear regression model:
X1 = RIS[["region", "smoker"]]
y1 = RIS["charges"]

RIS_regr = linear_model.LinearRegression()
RIS_regr.fit(X1,y1)

region_weight = RIS_regr.coef_[0]
smoker_weight = RIS_regr.coef_[1]

print(region_weight)
print(smoker_weight)

49.22888378068346
23615.669716591332


Based on the result, we can see that being a smoker affects your insurance cost by a whopping value that is much more than the weight of the region. Therefore, compared to region, being a smoker affects the insurance cost due to higher medical bills, as well as a greater chance of death. Therefore, for smokers particularly, it might be a bad thing to continue the status. 

#### 4th problem: How does all the parameters correlate to one another? 

In [108]:
# First, we need to replace the values of string to an integer number
insurance["region"].replace(to_replace = {"southwest": 1, "southeast": 2, "northwest": 3, "northeast": 4}, inplace=True)
insurance["smoker"].replace(to_replace = {"no": 0, "yes": 1}, inplace=True)
insurance["sex"].replace(to_replace = {"male": 1, "female": 2}, inplace=True)

# We then can set the X and y values:
X_final = insurance.drop(["charges"], axis=1)
y_final = insurance["charges"]

In [109]:
# Applying linear regression model from SKlearn
insurance_regr = linear_model.LinearRegression()
insurance_regr.fit(X_final,y_final)

print(insurance_regr.coef_)

[  257.28807486   131.11057962   332.57013224   479.36939355
 23820.43412267   353.64001656]


In [110]:
# Assigning each weightings:
age_w = insurance_regr.coef_[0]
sex_w = insurance_regr.coef_[1]
bmi_w = insurance_regr.coef_[2]
chi_w = insurance_regr.coef_[3]
smo_w = insurance_regr.coef_[4]
reg_w = insurance_regr.coef_[5]

print(age_w, sex_w, bmi_w, chi_w, smo_w, reg_w)

257.2880748580624 131.11057962208807 332.57013224229644 479.36939354512776 23820.434122672934 353.6400165588391


In [111]:
# Checking the accuracy of the model 
print(insurance_regr.score(X_final, y_final))

0.7507372027994937


Based on the results of the coefficient above, we can say that different parameters from the data provided have different weights. However, based on the R^2 value above, we can safely say that the model is quite inaccurate, as there are some correlation, but not a strong one. Nevertheless, some conclusions we can draw from this data analysis is that being a smoker affects insurance costs by a whole margin compared to other variables. This can be due to the direct impact smoking has to the health of the person Another conclusion we can draw is that having a different sex doesn't necessarily mean that insurance costs will be higher. However, if there are no correlations between sexes, it means that insurance costs should be equal to one another (weighting of being closer to 0). This shows that there is still some levels of difference betwen the insurance costs of male and female, and supports the fact that some inequality between gender still exists. Other than those two prominent conclusions, this is the end of this data analysis.

# Thank you for reading the file ＜（＾－＾）＞