# U.S. Medical Insurance Costs

[!IMPORTANT]

> This is the first project created after completion of Data Science Fundamentals module of Data Scientist Career Path at Codecademy platform. The studied module covered only Python fundamentals excluding any Python Data Science libraries such as pandas, numpy, matplotlib etc. This project was done to skill-up the following Python elements:
> 1. Python functions
> 2. Python control flows
> 3. Python lists
> 4. Python dictionaries
> 5. Python loops
> 6. Python string manipulation

The initial load of the data with the use of csv module and open method.

In [36]:
#Importing necessary libraries and modules:
import csv

In [59]:
#Opening the document as csv for investigating the data
with open('insurance.csv', 'r') as insurance:
    csv_data = insurance.read()

## Initial data investigation:
1. There are 7 variables in Medical Insurance Costs dataset:
* age - type of the numerical discrete variable
* sex - type of the categorical dichotomous variable with possible values female, male
* bmi - type of the numerical continous variable
* children - type of the numerical discrete variable
* smoker - type of the categorical dichotomous variable with possible values of yes and no
* region - type of the categorical nominal variable
* charges - type of the numerical continuos variable

2. There is no missing data in this dataset.


In [38]:
#Creating the empty lists for all variables in dataset
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

#Filling the lists with data from dataset
with open('insurance.csv',newline='') as medical_costs_csv:
    reader = csv.DictReader(medical_costs_csv, delimiter=',')
    for row in reader:
        age.append(int(row['age']))
        sex.append(row['sex'])
        bmi.append(float(row['bmi']))
        children.append(int(row['children']))
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(float(row['charges']))

In [39]:
#Calculating the frequency and propotion of males and females in the dataset
number_of_females = 0
number_of_males = 0
for s in sex:
    if s == 'female':
        number_of_females += 1
    else:
        number_of_males += 1
total = number_of_females + number_of_males
if total > 0 :
    print('The frequency of women in dataset is ' + str(number_of_females))
    print('The frequency of men in dataset is ' + str(number_of_males))
    print('The proportion of women in whole dataset ' + str(number_of_females / total))
    print('The proportion of men in whole dataset ' + str(number_of_males / total))

The frequency of women in dataset is 662
The frequency of men in dataset is 676
The proportion of women in whole dataset 0.4947683109118087
The proportion of men in whole dataset 0.5052316890881914


As per above analysis looks like there propotions for women in men in dataset are almost equally balanced with a little advantage of men.

In [40]:
#Analyzing the distribution of the number of children
children_distribution_dictionary = {}
for c in children:
    if children_distribution_dictionary.get(c) == None:
        children_distribution_dictionary[c] = 1
    else:
        children_distribution_dictionary[c] +=1
print(children_distribution_dictionary)


{0: 574, 1: 324, 3: 157, 2: 240, 5: 18, 4: 25}


As per above results almost half of the people in dataset has no children at all. The maximum number of children is 5 and apply only to 18 people from this dataset. We can observe that the data is probably right skewed with people having 4 and 5 children as outliners.

In [50]:
#Analyzing what is the difference in cost of the insurance for people with no children and people with 5 children

number_of_people_without_children = 0
number_of_people_with_5 = 0
sum_of_charges_0_child = 0.0
sum_of_charges_5_child = 0.0
for c, ch in zip(children, charges):
    if c == 0:
        number_of_people_without_children += 1
        sum_of_charges_0_child += ch
    elif c == 5:
        number_of_people_with_5  += 1
        sum_of_charges_5_child += ch
if number_of_people_without_children > 0:
    print('Average medical insurance charges for people without children ' + str(sum_of_charges_0_child/number_of_people_without_children))
if number_of_people_with_5 > 0:   
    print('Average medical insurance charges for people having 5 children ' + str(sum_of_charges_5_child/number_of_people_with_5))


Average medical insurance charges for people without children 12365.975601635882
Average medical insurance charges for people having 5 children 8786.035247222222


The above analysis is showing that the average medical insurance charges are generally higher for people without the children than for people having 5 children. This reflects only this variable and do not count for confounders like. BMI, age, smoking.

In [55]:
#Analyzing top 10 highest insurance costs in the dataset
top_10_charges = sorted(zip(charges, age, bmi, smoker, children), reverse=True)[:10]
print("Top 10 highest insurance costs in the dataset:"      )
for charge, age_val, bmi_val, smoker_val, children_val in top_10_charges:
    print(f"Charge: {charge}, Age: {age_val}, BMI: {bmi_val}, Smoker: {smoker_val}, Children: {children_val}")

Top 10 highest insurance costs in the dataset:
Charge: 63770.42801, Age: 54, BMI: 47.41, Smoker: yes, Children: 0
Charge: 62592.87309, Age: 45, BMI: 30.36, Smoker: yes, Children: 0
Charge: 60021.39897, Age: 52, BMI: 34.485, Smoker: yes, Children: 3
Charge: 58571.07448, Age: 31, BMI: 38.095, Smoker: yes, Children: 1
Charge: 55135.40209, Age: 33, BMI: 35.53, Smoker: yes, Children: 0
Charge: 52590.82939, Age: 60, BMI: 32.8, Smoker: yes, Children: 0
Charge: 51194.55914, Age: 28, BMI: 36.4, Smoker: yes, Children: 1
Charge: 49577.6624, Age: 64, BMI: 36.96, Smoker: yes, Children: 2
Charge: 48970.2476, Age: 59, BMI: 41.14, Smoker: yes, Children: 1
Charge: 48885.13561, Age: 44, BMI: 38.06, Smoker: yes, Children: 0


As per above top 10 highest insurance costs we can get into following conlusions:
1. Smokers are being charged more for medical insurance than non-smokres.
2. Higher weight also has a correlation with the cost of medical insurance - all of the observations have BMI in Obesity range.
3. Number of children does not relevant in terms of the cost.


In [52]:
age_stats = {}
for idx in range(0,len(charges)):
    a = age[idx]
    bmi_val = bmi[idx]
    cost_val = charges[idx]

    if a not in age_stats:
        age_stats[a] = {
            'min_bmi': bmi_val,
            'min_cost': cost_val,
            'max_bmi': bmi_val,
            'min_bmi': cost_val
        }
    else:
        if bmi_val < age_stats[a]['min_bmi']:
            age_stats[a]['min_bmi'] = bmi_val
            age_stats[a]['min_cost'] = cost_val
        if bmi_val > age_stats[a]['max_bmi']:
            age_stats[a]['max_bmi'] = bmi_val
            age_stats[a]['max_cost'] = cost_val

age_biggest_range = 0
biggest_range = 0

#Finding the biggest BMI and smallest BMI and BMI range for all ages in these dataset:
for a, stats in age_stats.items():
    min_bmi = stats['min_bmi']
    max_bmi = stats['max_bmi']
    bmi_range = max_bmi - min_bmi
    if bmi_range > biggest_range:
        biggest_range = bmi_range
        age_biggest_range = a
if age_biggest_range != 0:
    print("The biggest BMI difference for people weight in this dataset has been reported for age {} and equals {}".format(age_biggest_range,biggest_range))

#Finding the biggest insurance cost difference for all ages in these dataset:
age_biggest_range_cost = 0
biggest_range_cost = 0
for a, stats in age_stats.items():
    min_cost = stats['min_cost']
    max_cost = stats['max_cost']
    if (max_cost - min_cost) > biggest_range_cost:
        biggest_range_cost = max_cost - min_cost
        age_biggest_range_cost = a
if age_biggest_range_cost != 0:
    print("The biggest insurance cost difference for people weight in this dataset has been reported for age {} and equals {}".format(age_biggest_range_cost,biggest_range_cost))

The biggest BMI difference for people weight in this dataset has been reported for age 18 and equals 37.17
The biggest insurance cost difference for people weight in this dataset has been reported for age 54 and equals 52756.71611


Based on the above results the biggest variance for insurance cost has been detected for people in age 54. Fruther analysis for that group will be preformed to determine if the BMI value influence the charges for medical insurance.

Following are the generic ranges for BMI:
1. < 18,5 - underweight
2. 18,5 - 24,9 - healthy weight
3. 25,0 - 29,9 - overweight
4. \> 30,0 - obese

In [46]:
#Define function for determine the BMI category for each person in dataset:
def bmi_category(bmi_value):
    if bmi_value < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi_value < 24.9:
        return 'Normal weight'
    elif 25.0 <= bmi_value < 29.9:
        return 'Overweight'
    else:
        return 'Obesity'

In [58]:
#Analyzing the average insurance cost for 18 years old people based on their BMI category
category_sums = {'Obesity': 0.0, 'Overweight': 0.0, 'Normal weight': 0.0, 'Underweight': 0.0}
category_counts = {'Obesity': 0, 'Overweight': 0, 'Normal weight': 0, 'Underweight': 0}

for a, b, ch in zip(age, bmi, charges):
    if a == 54:
        cat = bmi_category(b)
        category_sums[cat] += ch
        category_counts[cat] += 1

for cat in category_sums:
    if category_counts[cat] > 0:
        print('Average charges for 54yo, {}: {}'.format(cat, category_sums[cat] / category_counts[cat]))

Average charges for 54yo, Obesity: 21223.057896111113
Average charges for 54yo, Overweight: 16947.618476
Average charges for 54yo, Normal weight: 11697.233360000002


As per above analysis, there are no underweight 54 years old people in the dataset, but the highest BMI, the bigger insurance cost, which is consistent with general understanding of health risks associated with high BMI.