# U.S. Medical Insurance Costs

## Project Scope
### Goals: 
 	1.	Percentage Analysis of Smokers by Region:
Calculate the percentage of smokers and non-smokers for each region.
	
	2.	Average Charges for Smokers and Non-Smokers:
Determine the average insurance charges for smokers and non-smokers, broken down by region.
	
	3.	Correlation Between Age and Smoking Habit:
Check if there’s a statistical relationship (e.g., correlation coefficient) between age and smoking status.
	
		Preconditions

		Before calculating  r_{pb} , ensure the following:
			1.	Binary Variable (Smoking Habit):
			•	Smoking status must be coded as a binary variable (e.g., 1 for smoker, 0 for non-smoker).
			2.	Continuous Variable (Age):
			•	The age data should be numerical and continuous (not grouped into ranges).
			3.	No Extreme Outliers:
			•	Significant outliers in age might distort the correlation and should be addressed.
			4.	Sufficient Sample Size:
			•	Both groups (smokers and non-smokers) should have enough samples for a meaningful comparison.

	4.	BMI Impact on Smoking and Insurance Costs:
Explore how BMI varies between smokers and non-smokers and how it influences insurance charges.
	
	5.	Children’s Influence on Costs:
Analyze whether the number of children affects insurance costs differently for smokers versus non-smokers.


Import data

In [1]:
import csv

with open("../data/processed/insurance.csv") as data:
    records = csv.DictReader(data)
    data = [row for row in records]
        

Calculate the percentage of smokers and non-smokers for each region.
- total in each region

- percentage of smokes and non-smokes

In [None]:
# number of smokers, non-smokers and the total number of entries
total_entries = len(data)
smokers = [data[i]["smoker"] for i, _ in enumerate(data) if data[i]["smoker"] == "yes"]
non_smokers = [data[i]["smoker"] for i, _ in enumerate(data) if data[i]["smoker"] == "no"]
num_smokers = len(smokers)
num_non_smokers = len(non_smokers)

# set of regions checks how many unique regions we have
regions = {data[i]["region"] for i, _ in enumerate(data)}
regions

# four lists to represent NE, NW, SE, SW
ne = [data[i]["region"] for i, _ in enumerate(data) if data[i]["region"] == "northeast"]
nw = [data[i]["region"] for i, _ in enumerate(data) if data[i]["region"] == "northwest"]
se = [data[i]["region"] for i, _ in enumerate(data) if data[i]["region"] == "southeast"]
sw = [data[i]["region"] for i, _ in enumerate(data) if data[i]["region"] == "southwest"]
# create 8 dict lists from data dict 
ne_smokers = []
ne_non_smokers = []
nw_smokers = []
nw_non_smokers = []
se_smokers = []
se_non_smokers = []
sw_smokers = []
sw_non_smokers = []


for i, entry in enumerate(data):
    if data[i]["region"] == "northeast" and data[i]["smoker"] == "yes":
        ne_smokers.append(entry)
    elif data[i]["region"] == "northeast" and data[i]["smoker"] == "no":
        ne_non_smokers.append(entry)
    if data[i]["region"] == "northwest" and data[i]["smoker"] == "yes":
        nw_smokers.append(entry)
    elif data[i]["region"] == "northwest" and data[i]["smoker"] == "no":
        nw_non_smokers.append(entry)
    if data[i]["region"] == "southeast" and data[i]["smoker"] == "yes":
        se_smokers.append(entry)
    elif data[i]["region"] == "southeast" and data[i]["smoker"] == "no":
        se_non_smokers.append(entry)
    if data[i]["region"] == "southwest" and data[i]["smoker"] == "yes":
        sw_smokers.append(entry)
    elif data[i]["region"] == "southwest" and data[i]["smoker"] == "no":
        sw_non_smokers.append(entry)

percentages_smoke_non_smoke = {
    "northeast": [],
    "northwest": [],
    "southeast": [],
    "southweast": []
}

percentages_smoke_non_smoke["northeast"].append([round(len(ne_smokers)/len(ne), 2), round(len(ne_non_smokers)/len(ne), 2)])
percentages_smoke_non_smoke["northwest"].append([round(len(nw_smokers)/len(nw), 2), round(len(nw_non_smokers)/len(nw), 2)])
percentages_smoke_non_smoke["southeast"].append([round(len(se_smokers)/len(se), 2), round(len(se_non_smokers)/len(se), 2)])
percentages_smoke_non_smoke["southweast"].append([round(len(sw_smokers)/len(sw), 2), round(len(sw_non_smokers)/len(sw), 2)])

percentages_smoke_non_smoke


{'northeast': [[0.21, 0.79]],
 'northwest': [[0.18, 0.82]],
 'southeast': [[0.25, 0.75]],
 'southweast': [[0.18, 0.82]]}

	2.	Average Charges for Smokers and Non-Smokers:
Determine the average insurance charges for smokers and non-smokers, broken down by region.

In [4]:
ne_smokers_charges = [float(entry["charges"]) for entry in ne_smokers]
ne_smokers_avg_charges = round(sum(ne_smokers_charges)/len(ne_smokers_charges), 2)
ne_non_smokers_charges = [float(entry["charges"]) for entry in ne_non_smokers]
ne_non_smokers_avg_charges = round(sum(ne_non_smokers_charges)/len(ne_non_smokers_charges), 2)
ne_delta_smoke_charges = round(ne_non_smokers_avg_charges - ne_smokers_avg_charges, 2)

nw_smokers_charges = [float(entry["charges"]) for entry in nw_smokers]
nw_smokers_avg_charges = round(sum(nw_smokers_charges)/len(nw_smokers_charges), 2)
nw_non_smokers_charges = [float(entry["charges"]) for entry in nw_non_smokers]
nw_non_smokers_avg_charges = round(sum(nw_non_smokers_charges)/len(nw_non_smokers_charges), 2)
nw_delta_smoke_charges = round(nw_non_smokers_avg_charges - nw_smokers_avg_charges, 2)

se_smokers_charges = [float(entry["charges"]) for entry in se_smokers]
se_smokers_avg_charges = round(sum(se_smokers_charges)/len(se_smokers_charges), 2)
se_non_smokers_charges = [float(entry["charges"]) for entry in se_non_smokers]
se_non_smokers_avg_charges = round(sum(se_non_smokers_charges)/len(se_non_smokers_charges), 2)
se_delta_smoke_charges = round(se_non_smokers_avg_charges - se_smokers_avg_charges, 2)

sw_smokers_charges = [float(entry["charges"]) for entry in sw_smokers]
sw_smokers_avg_charges = round(sum(ne_smokers_charges)/len(sw_smokers_charges), 2)
sw_non_smokers_charges = [float(entry["charges"]) for entry in sw_non_smokers]
sw_non_smokers_avg_charges = round(sum(sw_non_smokers_charges)/len(sw_non_smokers_charges), 2)
sw_delta_smoke_charges = round(sw_non_smokers_avg_charges - sw_smokers_avg_charges, 2)

ne_delta_smoke_charges, nw_delta_smoke_charges, se_delta_smoke_charges, sw_delta_smoke_charges


(-20508.01, -21635.54, -26812.78, -26258.77)

3.	Correlation Between Age and Smoking Habit:
Check if there’s a statistical relationship (e.g., correlation coefficient) between age and smoking status.
	
		Preconditions

		Before calculating  r_{pb} , ensure the following:
			1.	Binary Variable (Smoking Habit):
			•	Smoking status must be coded as a binary variable (e.g., 1 for smoker, 0 for non-smoker).
			2.	Continuous Variable (Age):
			•	The age data should be numerical and continuous (not grouped into ranges).
			3.	No Extreme Outliers:
			•	Significant outliers in age might distort the correlation and should be addressed.
			4.	Sufficient Sample Size:
			•	Both groups (smokers and non-smokers) should have enough samples for a meaningful comparison.

In [None]:
age_smoking = [[entry["age"], entry["smoker"]] for entry in data]
for entry in age_smoking:
    if entry[1] == "yes":
        entry[1] = 1
    elif entry[1] == "no":
        entry[1] = 0
    else:
        entry[1] == "n/a"
    entry[0] = int(entry[0])
    


Interquartile Range (IQR) Method

	•	Steps:
	1.	Calculate the IQR:  \text{IQR} = Q3 - Q1 , where  Q1  is the first quartile (25th percentile) and  Q3  is the third quartile (75th percentile).
	2.	Define the bounds for outliers:
	•	Lower Bound:  Q1 - 1.5 \times \text{IQR} 
	•	Upper Bound:  Q3 + 1.5 \times \text{IQR} 
	3.	Any value outside these bounds is considered an outlier.

In [6]:
# Interquartile Range (IQR) Method
def median(a):
    if len(a) % 2 != 0:
        return a[len(a) // 2]
    result = ((a[len(a) // 2] + a[len(a) // 2 - 1]))/ 2
    return int(result)


def quartiles(arr):
    # Write your code here
    arr = sorted(arr)
    q1 = []
    q2 = []
    q3 = []
    n = len(arr)
    if n % 2 == 0:
        for i, e in enumerate(arr):
            if i < n // 2:
                q1.append(e)
            if i >= n // 2:
                q3.append(e)
        return median(q1), median(arr), median(q3)
    else:
        for i, e in enumerate(arr):
            if i < n // 2:
                q1.append(e)
            if i == n // 2:
                q2.append(e)
            if i > n // 2:
                q3.append(e)
        return median(q1), median(q2), median(q3)

sorted_age_smoking = sorted(age_smoking)
unzipped_age = [entry[0] for entry in sorted_age_smoking]
unzipped_smoking = [entry[1] for entry in sorted_age_smoking]

q1, q2, q3 = quartiles(unzipped_age)
print(q1, q2, q3)
iqr = q3 - q1
print(iqr)
lower_bound = q1 - 1.5 * iqr
upper_bound = q1 + 1.5 * iqr
print(lower_bound, upper_bound)

# no value is outside of the upper and lower bound hence there are no outliers in our dataset



27 39 51
24
-9.0 63.0


### Point Biserial Correlation Coefficient

In [11]:
# checking for statistical relationship (correlation coefficient) between smoking and age
#	 \bar{X}_1 : Mean age of smokers.
	# •	 \bar{X}_0 : Mean age of non-smokers.
	# •	 s : Standard deviation of the age variable.
	# •	 n_1 : Number of smokers.
	# •	 n_0 : Number of non-smokers.
	# •	 n : Total number of observations.
 # stdev = np.std(data, ddof=1)
 
 
import numpy as np
smokers_age = [int(data[i]["age"]) for i, _ in enumerate(data) if data[i]["smoker"] == "yes"]
non_smokers_age = [int(data[i]["age"]) for i, _ in enumerate(data) if data[i]["smoker"] == "no"]
non_smokers_mean = np.mean(non_smokers_age)
smokers_mean = np.mean(smokers_age)
age_stdev = np.std(unzipped_age, ddof=0)
num_of_observations = len(data)

pbc_coeff = ((smokers_mean - non_smokers_mean) / age_stdev) * ((num_smokers * num_non_smokers) / (num_of_observations * (num_of_observations - 1))) ** 0.5
pbc_coeff








-0.025028106089439197

*No significant correlation between age and smoking has been proven as pbc_coefficient is close to zero*

	4.	BMI Impact on Smoking and Insurance Costs:
Explore how BMI varies between smokers and non-smokers and how it influences insurance charges.

1. Steps for gauging variance:

	1.	Compute the mean BMI ( \mu ).
	2.	Subtract the mean from each BMI value to find the deviations.
	3.	Square the deviations and compute their average (adjusted by  n - 1  for a sample).

2. Gauge BMI’s Influence on Insurance Charges

## Step 1: Calculate Pearson Correlation Coefficient ( r )

The Pearson correlation coefficient quantifies the linear relationship between BMI and insurance charges.

In [35]:
# bmi variance
bmis = [float(data[i]["bmi"]) for i, entry in enumerate(data)]
bmi_mean = np.mean(bmis)
dev_sum = 0
for e in bmis:
    dev_sum += (e - bmi_mean) ** 2
bmi_var = dev_sum / (len(bmis) - 1)
print(f"BMI variance: {bmi_var:.2f}")

# Pearson Correlation Coefficient (pcc_r) to quantify the linear relationship bw BMI and insurance charges
insurance_charges = [float(data[i]["charges"]) for i, entry in enumerate(data)]
charges_mean = np.mean(insurance_charges)

# top sum of the equation
top_sum = 0
for i, _ in enumerate(bmis):
    top_sum = (bmis[i] - bmi_mean) * (insurance_charges[i] - charges_mean)

left_bottom_sum = 0
for var in bmis:
    left_bottom_sum += (var - bmi_mean) ** 2
right_bottom_sum = 0
for var in insurance_charges:
    right_bottom_sum += (var - charges_mean) ** 2

pcc_r = top_sum / ((left_bottom_sum * right_bottom_sum) ** 0.5)

print(f"Correlation is effectively non-existent: {pcc_r}")



BMI variance: 37.19
Correlation is effectively non-existent: -0.00025612375974814555


## Step 2: Fit a Linear Regression Model

y = b0 + b1 * x + e

In [40]:
# calculate slope b1

b1 = top_sum / left_bottom_sum

b0 = charges_mean - b1 * bmi_mean

b1, b0

(-0.5086202921386634, 13286.018291010656)

Resulting linear regression equation: 

y = 13286 - 0.5 * x

	5.	Children’s Influence on Costs:
Analyze whether the number of children affects insurance costs differently for smokers versus non-smokers.
	
		- for each children number calculate mean insurance costs for smokers and non-smokers, and
		
		- compare the resulting values

In [116]:
# build a dictionary with children num as keys and values as dictionaries of corresponding persons

children_num_list = [entry["children"] for i, entry in enumerate(data)]
max(children_num_list)
# max 5 children

children_num_dict = {
    0: [],
    1: [],
    2: [],
    3: [],
    4: [],
    5: []
}
for index, entry in enumerate(data):
    if entry["children"] == "0":
        children_num_dict[0].append((entry["charges"], entry["smoker"]))
    if entry["children"] == "1":
        children_num_dict[1].append((entry["charges"], entry["smoker"]))
    if entry["children"] == "2":
        children_num_dict[2].append((entry["charges"], entry["smoker"]))
    if entry["children"] == "3":
        children_num_dict[3].append((entry["charges"], entry["smoker"]))
    if entry["children"] == "4":
        children_num_dict[4].append((entry["charges"], entry["smoker"]))
    if entry["children"] == "5":
        children_num_dict[5].append((entry["charges"], entry["smoker"]))


# build lists for costs of smokers and non-smokers per each children group
zero_children_lst_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[0])) if key == 0 and value[i][1] == "yes"]
zero_children_lst_non_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[0])) if key == 0 and value[i][1] == "no"]

one_children_lst_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[1])) if key == 1 and value[i][1] == "yes"]
one_children_lst_non_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[1])) if key == 1 and value[i][1] == "no"]


two_children_lst_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[2])) if key == 2 and value[i][1] == "yes"]
two_children_lst_non_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[2])) if key == 2 and value[i][1] == "no"]


three_children_lst_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[3])) if key == 3 and value[i][1] == "yes"]
three_children_lst_non_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[3])) if key == 3 and value[i][1] == "no"]


four_children_lst_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[4])) if key == 4 and value[i][1] == "yes"]
four_children_lst_non_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[4])) if key == 4 and value[i][1] == "no"]


five_children_lst_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[5])) if key == 5 and value[i][1] == "yes"]
five_children_lst_non_smokers = [float(value[i][0]) for key, value in children_num_dict.items() for i in range(len(children_num_dict[5])) if key == 5 and value[i][1] == "no"]


# # append mean insurance costs for smokers and non smokers to the children num dict
children_num_dict[0].extend([np.mean(zero_children_lst_smokers), np.mean(zero_children_lst_non_smokers)])
children_num_dict[1].extend([np.mean(one_children_lst_smokers), np.mean(one_children_lst_non_smokers)])
children_num_dict[2].extend([np.mean(two_children_lst_smokers), np.mean(two_children_lst_non_smokers)])
children_num_dict[3].extend([np.mean(three_children_lst_smokers), np.mean(three_children_lst_non_smokers)])
children_num_dict[4].extend([np.mean(four_children_lst_smokers), np.mean(four_children_lst_non_smokers)])
children_num_dict[5].extend([np.mean(five_children_lst_smokers), np.mean(five_children_lst_non_smokers)])

for key, value in children_num_dict.items():
    print(f"{key} children: mean of insurance costs for: smokers = {value[-2]:.2f}; non-smokers = {value[-1]:.2f}")

0 children: mean of insurance costs for: smokers = 31341.36; non-smokers = 7611.79
1 children: mean of insurance costs for: smokers = 31822.65; non-smokers = 8303.11
2 children: mean of insurance costs for: smokers = 33844.24; non-smokers = 9493.09
3 children: mean of insurance costs for: smokers = 32724.92; non-smokers = 9614.52
4 children: mean of insurance costs for: smokers = 26532.28; non-smokers = 12121.34
5 children: mean of insurance costs for: smokers = 19023.26; non-smokers = 8183.85


What do the above gauges tell me about the data?

Key Observations:

	1.	Significant Cost Disparity Between Smokers and Non-Smokers:
	•	For all child counts, smokers pay substantially higher insurance costs than non-smokers. This aligns with the expectation that smoking is a major health risk, leading to higher insurance premiums.
	•	The gap is stark:
	•	For 0 children, smokers pay about 4 times more on average than non-smokers.
	•	For 1 and 2 children, the difference remains high, though slightly less pronounced.
	2.	Cost Trends as the Number of Children Increases:
	•	For smokers:
	•	Costs initially increase with more children (peaking at 2 children,  \sim33,844 ), then decline notably as the number of children reaches 4 or 5.
	•	This could indicate:
	•	Discounts or adjustments offered for families with more dependents.
	•	Potential data anomalies or fewer smokers with larger families (causing sample size issues).
	•	For non-smokers:
	•	Costs rise steadily up to 4 children ( \sim12,121 ), reflecting a more consistent relationship between the number of dependents and costs.
	3.	Lower Costs for Smokers with More Children (4 or 5):
	•	Smokers with 4 or 5 children incur notably lower insurance costs than smokers with fewer children.
	•	Possible explanations:
	•	Insurers may apply family-specific pricing structures that offset smoking penalties for larger households.
	•	Smokers with larger families might have other mitigating factors (e.g., lower BMI, younger age, different regions).