# Project Scoping

The dataset has 1339 rows with no missing data and contains the following columns:
* age
* sex
* bmi
* children
* smoker
* region
* charges

I want to know answers to the following questions:

* does BMI correlate with charges? Is the correlation significant? Can we describe this with a line of best fit?
* How do the average costs for males and females compare?
* Which region paid the most in charges?
* What region had the highest per capita charges?
* How do the average costs for males and females compare?

I begin by getting the dataset as DictReader object, and then organize the data into a tabular dictionary structure with keys as columns and values retained with index corresponding to row - 1:

In [118]:
import csv
import math
insurance_data = {
    'rows': []
}

with open(r"C:\Users\bobro\Downloads\python-portfolio-project-starter-files\python-portfolio-project-starter-files\insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
       insurance_data['rows'].append(row)

clean_data ={
    'age': [],
    'sex': [],
    'bmi': [],
    'children': [],
    'smoker': [],
    'region': [],
    'charges': [],
    }

i=0
for row in insurance_data['rows']:
    clean_data['age'].append(insurance_data['rows'][i]['age'])
    clean_data['sex'].append(insurance_data['rows'][i]['sex'])
    clean_data['bmi'].append(insurance_data['rows'][i]['bmi'])
    clean_data['children'].append(insurance_data['rows'][i]['children'])
    clean_data['smoker'].append(insurance_data['rows'][i]['smoker'])
    clean_data['region'].append(insurance_data['rows'][i]['region'])
    clean_data['charges'].append(insurance_data['rows'][i]['charges'])
    i += 1




print(clean_data.keys())



dict_keys(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'])


Now we have organized our data into a dictionary with a key for each column and a value in each key for each row. We are ready to analyze this data. Below we begin our analysis:

# Charges by Region

First, what region paid the most in charges?

In [109]:
charges_by_region = {
    'southwest': 0,
    'southeast': 0,
    'northwest': 0,
    'northeast': 0,
}
i=0
for value in clean_data['region']:
    if value == 'southwest':
        charges_by_region['southwest'] += float(clean_data['charges'][i])
    elif value == 'southeast':
        charges_by_region['southeast'] += float(clean_data['charges'][i])
    elif value == 'northwest':
        charges_by_region['northwest'] += float(clean_data['charges'][i])
    elif value == 'northeast':
        charges_by_region['northeast'] += float(clean_data['charges'][i])
    else: print('Error')
    i += 1

for item in charges_by_region:
    charges_by_region[item] = round(charges_by_region[item], 2)

print(charges_by_region)


{'southwest': 4012754.65, 'southeast': 5363689.76, 'northwest': 4035712.0, 'northeast': 4343668.58}


We see that by totals, the southeast paid by far the most in charges.

# Per Capita Charges By Region

I'm curious how this might be affected by discrepencies in regional sample size. Let's aggregate a count for records in these regions and calculate a per capita cost:

In [110]:
record_count_by_region = {
    'southwest': 0,
    'southeast': 0,
    'northwest': 0,
    'northeast': 0, 
}

for value in clean_data['region']:
    record_count_by_region[value] += 1

print('People in study by region: \n', record_count_by_region, '\n \n')

regional_per_capita_charges = {
    'southwest': 0,
    'southeast': 0,
    'northwest': 0,
    'northeast': 0,
}

for item in charges_by_region:
    regional_per_capita_charges[item] = round((charges_by_region[item] / record_count_by_region[item]), 2)

print('Per capita insurance cost by region: \n', regional_per_capita_charges)


People in study by region: 
 {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324} 
 

Per capita insurance cost by region: 
 {'southwest': 12346.94, 'southeast': 14735.41, 'northwest': 12417.58, 'northeast': 13406.38}


Great! We've actually confirmed that for some reason, people in our dataset pay more for insurance when they live in the southeast.

# Average Cost: Males vs. Females

I want to see how the average cost of insurance compares between Males and Females. I calculate these below:

In [111]:
Female_vs_Male_Aggregates = {
    'Sum Cost Females': 0,
    'Count Females': 0,
    'Sum Cost Males': 0,
    'Count Males': 0,
}

i=0
for item in clean_data['sex']:
    if item == 'female':
        Female_vs_Male_Aggregates['Sum Cost Females'] += float(clean_data['charges'][i])
        Female_vs_Male_Aggregates['Count Females'] += 1
    elif item == 'male':
        Female_vs_Male_Aggregates['Sum Cost Males'] += float(clean_data['charges'][i])
        Female_vs_Male_Aggregates['Count Males'] += 1
    i += 1

Female_vs_Male_Aggregates['Average Cost Females'] = round(Female_vs_Male_Aggregates['Sum Cost Females'] / Female_vs_Male_Aggregates['Count Females'], 2)
Female_vs_Male_Aggregates['Average Cost Males'] = round(Female_vs_Male_Aggregates['Sum Cost Males'] / Female_vs_Male_Aggregates['Count Males'], 2)

print('The average cost for males in our dataset was', Female_vs_Male_Aggregates['Average Cost Males'], 'compared to the average cost for females in our dataset which is', Female_vs_Male_Aggregates['Average Cost Females'] )


The average cost for males in our dataset was 13956.75 compared to the average cost for females in our dataset which is 12569.58


I know I'm going to calculate BMI correlation to price next, and since we just calculated these averages for males and females, I want to anticipate a potential insight: I want to know whether the average cost for each gender might be influenced by the distribution of BMI within that gender. 

I'll calculate the mean, median, and standard deviation of bmi for each sex below:

In [112]:
def take_mean(list_of_numbers):
    mean = sum(list_of_numbers) / len(list_of_numbers)
    return mean

def take_median(list_of_numbers):
    mid = (len(list_of_numbers) - 1) // 2
    median = (list_of_numbers[mid])
    return median

def take_std_dev(list_of_numbers):
    list_mean = take_mean(list_of_numbers)
    sum_distances_sq = 0
    for number in list_of_numbers:
        distance_to_mean_sq = abs(number - list_mean)**2
        sum_distances_sq += distance_to_mean_sq
    std_dev = math.sqrt(sum_distances_sq / len(list_of_numbers))
    return std_dev
        

bmi_distribution_lists = {
    'male BMIs': [],
    'female BMIs': [],

}
i = 0
for item in clean_data['sex']:
    if item == 'male':
        bmi_distribution_lists['male BMIs'].append(float(clean_data['bmi'][i]))
    elif item == 'female':
        bmi_distribution_lists['female BMIs'].append(float(clean_data['bmi'][i]))
    i += 1
bmi_distribution_lists['male BMIs'].sort()
bmi_distribution_lists['female BMIs'].sort()



bmi_distribution_stats ={
'male bmi mean': take_mean(bmi_distribution_lists['male BMIs']),
'male bmi median': take_median(bmi_distribution_lists['male BMIs']),
'male bmi std dev': take_std_dev(bmi_distribution_lists['male BMIs']),
'female bmi mean': take_mean(bmi_distribution_lists['female BMIs']),
'female bmi median': take_median(bmi_distribution_lists['female BMIs']),
'female bmi std dev': take_std_dev(bmi_distribution_lists['female BMIs'])
}

print(bmi_distribution_stats)






{'male bmi mean': 30.943128698224843, 'male bmi median': 30.685, 'male bmi std dev': 6.135891193330869, 'female bmi mean': 30.377749244713023, 'female bmi median': 30.1, 'female bmi std dev': 6.041454877245923}


BMI seems to be distributed in a remarkably similar way across our data set. Means are very similar to medians, which means outliers are not likely influencing the data set, and both statistics are tightly grouped for males and females. This means we can analyze costs using sex and BMI as a variable, and we won't be conflating either of these variable's correlations.

# Is BMI correlated to charges

First, we should make bins and categorize people into bins based on their BMI. Then we should take the average charges for people in those bins and then see if there is any clear correlation. By making bins, we make the linear regression we will perform soon more robust to outliers and extreneous variables like the extremely high cost of insurance for smokers. If more smokers fall in one category than another, they will be considered in the average but the line will be based on those averages instead of skewing towards a group of smokers in a few blocks, or being pulled in any direction by a potential outlier.

In [113]:
bmi_aggregates = {
    '25+ Count': 0,
    '25+ Total Cost': 0,
    '24 Count': 0,
    '24 Total Cost': 0,
    '23 Count': 0,
    '23 Total Cost': 0,
    '22 Count': 0,
    '22 Total Cost': 0,
    '21 Count': 0,
    '21 Total Cost': 0,
    '20 Count': 0,
    '20 Total Cost': 0,
    '19 Count': 0,
    '19 Total Cost': 0,
    '18 Count': 0,
    '18 Total Cost': 0,
    '17 Count': 0,
    '17 Total Cost': 0,
    '16 Count': 0,
    '16 Total Cost': 0,
    '15- Count': 0,
    '15- Total Cost': 0,
}

bmi_avg_charges = {}

i=0
for item in clean_data['bmi']:
    if float(item) >= 25:
        bmi_aggregates['25+ Count'] += 1
        bmi_aggregates['25+ Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 24:
        bmi_aggregates['24 Count'] += 1
        bmi_aggregates['24 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 23:
        bmi_aggregates['23 Count'] += 1
        bmi_aggregates['23 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 22:
        bmi_aggregates['22 Count'] += 1
        bmi_aggregates['22 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 21:
        bmi_aggregates['21 Count'] += 1
        bmi_aggregates['21 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 20:
        bmi_aggregates['20 Count'] += 1
        bmi_aggregates['20 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 19:
        bmi_aggregates['19 Count'] += 1
        bmi_aggregates['19 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 18:
        bmi_aggregates['18 Count'] += 1
        bmi_aggregates['18 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 17:
        bmi_aggregates['17 Count'] += 1
        bmi_aggregates['17 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) >= 16:
        bmi_aggregates['16 Count'] += 1
        bmi_aggregates['16 Total Cost'] += float(clean_data['charges'][i])
    elif float(item) <= 16:
        bmi_aggregates['15- Count'] += 1
        bmi_aggregates['15- Total Cost'] += float(clean_data['charges'][i])
    i+=1
try: 
    bmi_avg_charges['25+ Avg Cost'] = round(bmi_aggregates['25+ Total Cost'] / bmi_aggregates['25+ Count'], 2)
    bmi_avg_charges['24 Avg Cost'] = round(bmi_aggregates['24 Total Cost'] / bmi_aggregates['24 Count'], 2)
    bmi_avg_charges['23 Avg Cost'] = round(bmi_aggregates['23 Total Cost'] / bmi_aggregates['23 Count'], 2)
    bmi_avg_charges['22 Avg Cost'] = round(bmi_aggregates['22 Total Cost'] / bmi_aggregates['22 Count'], 2)
    bmi_avg_charges['21 Avg Cost'] = round(bmi_aggregates['21 Total Cost'] / bmi_aggregates['21 Count'], 2)
    bmi_avg_charges['20 Avg Cost'] = round(bmi_aggregates['20 Total Cost'] / bmi_aggregates['20 Count'], 2)
    bmi_avg_charges['19 Avg Cost'] = round(bmi_aggregates['19 Total Cost'] / bmi_aggregates['19 Count'], 2)
    bmi_avg_charges['18 Avg Cost'] = round(bmi_aggregates['18 Total Cost'] / bmi_aggregates['18 Count'], 2)
    bmi_avg_charges['17 Avg Cost'] = round(bmi_aggregates['17 Total Cost'] / bmi_aggregates['17 Count'], 2)
    bmi_avg_charges['16 Avg Cost'] = round(bmi_aggregates['16 Total Cost'] / bmi_aggregates['16 Count'], 2)
    bmi_avg_charges['15- Avg Cost'] = round(bmi_aggregates['15- Total Cost'] / bmi_aggregates['15- Count'], 2)
except: print('error: Divide by 0')
   
print(bmi_avg_charges)



{'25+ Avg Cost': 13940.24, '24 Avg Cost': 12537.9, '23 Avg Cost': 9773.29, '22 Avg Cost': 10998.38, '21 Avg Cost': 9906.55, '20 Avg Cost': 7693.38, '19 Avg Cost': 8965.81, '18 Avg Cost': 10701.77, '17 Avg Cost': 8511.96, '16 Avg Cost': 4904.0, '15- Avg Cost': 1694.8}


There is a pretty obvious positive correlation between bmi and insurance cost. generally, as bmi increases so does insurance cost.

# Linear Regression for Correlation

Let's use a linear regression to test our intuition:




In [114]:
def get_y(m, x, b):
    y = m*x + b
    return y

def get_error(m, b, point):
    x_point = point[0]
    y_point = point[1]
    y_line =get_y (m, x_point, b)
    y_dist = abs(y_point - y_line)
    return y_dist

i=25
points = []
for item in bmi_avg_charges:
    new_point = (i, bmi_avg_charges[item])
    points.append(new_point)
    i -= 1


def get_total_error(m, b, point_list):
    total_error = 0
    for point in point_list:
        total_error = total_error + get_error(m, b, point)
    return total_error



def simple_linear_regression(point_list, slope_max = 1000, slope_min = -1000, intercept_max = 2000, intercept_min = -2000):
    smallest_error = float('inf')
    best_slope = 0
    best_intercept = 0

    possible_slopes = [i for i in range(slope_min, slope_max + 1)]
    possible_intercepts = [i for i in range(intercept_min, intercept_max + 1)]
    for slope in possible_slopes:
        for intercept in possible_intercepts:
            test_error = get_total_error(slope, intercept, point_list)
            if test_error < smallest_error:
                best_slope = slope
                best_intercept = intercept
                smallest_error = test_error
    return {
        'points_evaluated': point_list,
        'slope': best_slope,
        'intercept': best_intercept,
        'total_error': smallest_error
    }

bmi_best_fit = simple_linear_regression(points)

print(bmi_best_fit)

    

{'points_evaluated': [(25, 13940.24), (24, 12537.9), (23, 9773.29), (22, 10998.38), (21, 9906.55), (20, 7693.38), (19, 8965.81), (18, 10701.77), (17, 8511.96), (16, 4904.0), (15, 1694.8)], 'slope': 577, 'intercept': -1997, 'total_error': 16371.419999999998}


So we can see that a line of best fit would correlate each bmi point as worth about $577 worth of charges.