# U.S. Medical Insurance Costs

An unguided Python project investigating a medical insurance costs dataset provided by Codeacademy in its *Data Scientist – Natural Language Processing* Career Path.

To demonstrate a basic understanding of Python fundamentals, I'll be working with the information in the dataset and performing my own independent analysis. Skills to be demonstrated include:
- Working locally through Jupyter Notebooks
- Importing a dataset into my program
- Performing data cleaning and extraction for analysis
- Analyze a dataset by building out functions
- Use of libraries to assist in my analysis

Here are the of the questions I'll attempt to answer using this dataset according to different areas of interest:
- Demographics 
  - What is the average age of patients?
  - Are men and women equally represented in the dataset?
  - What is the age distribution of the sample?
- Insurance costs
  - What is the range of costs payed by patients in this dataset?
  - How much do women pay compared to men on average?
  - How much does smoking increase the cost of medical insurance?
  - Are there geographical differences in the costs for U.S. medical insurance?

### Ideas to perfect portfolio project
- [ ] Create an average function, that takes in a list of numbers and calculates its average using `sum()` and `len()`.
- [ ] Explain the code using comments.

# Importing and Preparing the Dataset

## Save Dataset Variables

In [38]:
import csv

with open('insurance.csv', 'r') as insurance_data:
    data = csv.DictReader(insurance_data)

    age = []
    sex = []
    bmi = []
    children = []
    smoker = []
    region = []
    charges = []

    for row in data:
        age.append(int(row['age']))
        sex.append(row['sex'])
        bmi.append(float(row['bmi']))
        children.append(int(row['children']))
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(round(float(row['charges']),2))

# Prints a sample of the dataset with an index before the corresponding variables
for i in range(0,10):
    print(i, ":", age[i], sex[i], bmi[i], children[i], smoker[i], region[i], charges[i])

0 : 19 female 27.9 0 yes southwest 16884.92
1 : 18 male 33.77 1 no southeast 1725.55
2 : 28 male 33.0 3 no southeast 4449.46
3 : 33 male 22.705 0 no northwest 21984.47
4 : 32 male 28.88 0 no northwest 3866.86
5 : 31 female 25.74 0 no southeast 3756.62
6 : 46 female 33.44 1 no southeast 8240.59
7 : 37 female 27.74 3 no northwest 7281.51
8 : 37 male 29.83 2 no northeast 6406.41
9 : 60 female 25.84 0 no northwest 28923.14


# Analysis

## Demographics

### Average Age of Patients

In [11]:
## Calculate average using loops

# sum_of_ages = 0
# for i in age:
#    sum_of_ages += i

## Calculate average using the sum() function (the easier way) <---
sum_of_ages = sum(age)

age_average = sum_of_ages / len(age)

print("The average age in the sample is " + str(round(age_average,2)))

The average age in the sample is 39.21


### Representation of Men to Women in Sample

In [5]:
def sex_distribution(field_name):
    """The function takes the field name as an argument from which the data is taken.
    The program then calculates the amount of both sexes and prints out an analysis of
    their distribution.
    """
    num_men = 0
    num_women = 0
    total_sample = len(sex)
    
    for s in sex:
        if s == 'male': num_men += 1
        if s == 'female': num_women += 1
    diff = abs(num_men - num_women)
    diff_percentage = round((diff/total_sample)*100, 2)
    
    print("Summary:\nThe sample consists of {} men and {} women,".format(num_men, num_women) + 
        " representing a " + str(diff_percentage) + "% difference.")

    if diff_percentage < 10:
        print("There's a fairly good distribution of men and women in the sample.")
    if diff_percentage > 10 and num_women > num_men:
        print("There are significantly more women in the sample.")
    if diff_percentage > 10 and num_women < num_men:
        print("There are significantly more men in the sample.")
    if diff_percentage == 0:
        print("There is the same amount of men and women the sample.")



sex_distribution(sex)


Summary:
The sample consists of 676 men and 662 women, representing a 1.05% difference.
There's a fairly good distribution of men and women in the sample.


### Age Distribution

In [16]:
def analyze_age(age_field):
    """This function returns a descriptive analysis of the age field, including its distribution, range, and average"""

    print("Data Summary:\n")

    # Range
    range = min(age_field), max(age_field)
    print("Range: The ages range from {} to {} years old.".format(range[0], range[1]))

    # Age Distribution
    unique_age = list(set(age_field))
    age_distribution = {}

    print("Age Distribution:")
    for age in unique_age:
        age_count = age_field.count(age)
        age_count_percentage = round((age_count / len(age_field)) * 100, 2)
        age_distribution[age] = [str(age_count), str(age_count_percentage) + "%"]
    for k, v in age_distribution.items():
        print(k, ":", v)

    # Average
    average_age = sum(age_field) / len(age_field)

    # Median
    age_field.sort()
    if len(age_field) % 2:
        median = age_field[len(age_field)//2]
    else:
        median = sum(age_field[len(age_field) - 1:len(age_field) + 1]) / 2

    print("Average: The average age from this sample is {}, with a median of {}.".format(round(average_age, 2), round(median, 2)))



analyze_age(age)


Data Summary:

Range: The ages range from 18 to 64 years old.
Age Distribution:
18 : ['69', '5.16%']
19 : ['68', '5.08%']
20 : ['29', '2.17%']
21 : ['28', '2.09%']
22 : ['28', '2.09%']
23 : ['28', '2.09%']
24 : ['28', '2.09%']
25 : ['28', '2.09%']
26 : ['28', '2.09%']
27 : ['28', '2.09%']
28 : ['28', '2.09%']
29 : ['27', '2.02%']
30 : ['27', '2.02%']
31 : ['27', '2.02%']
32 : ['26', '1.94%']
33 : ['26', '1.94%']
34 : ['26', '1.94%']
35 : ['25', '1.87%']
36 : ['25', '1.87%']
37 : ['25', '1.87%']
38 : ['25', '1.87%']
39 : ['25', '1.87%']
40 : ['27', '2.02%']
41 : ['27', '2.02%']
42 : ['27', '2.02%']
43 : ['27', '2.02%']
44 : ['27', '2.02%']
45 : ['29', '2.17%']
46 : ['29', '2.17%']
47 : ['29', '2.17%']
48 : ['29', '2.17%']
49 : ['28', '2.09%']
50 : ['29', '2.17%']
51 : ['29', '2.17%']
52 : ['29', '2.17%']
53 : ['28', '2.09%']
54 : ['28', '2.09%']
55 : ['26', '1.94%']
56 : ['26', '1.94%']
57 : ['26', '1.94%']
58 : ['25', '1.87%']
59 : ['25', '1.87%']
60 : ['23', '1.72%']
61 : ['23', '1.72

### Location of Patients

In [None]:
# Counters for regions
patient_locations = {'northeast': 0, 'northwest': 0, 'southeast': 0, 'southwest': 0}

# Iterate through items and update counters
for r in region:
    if r == 'northeast': patient_locations['northeast'] += 1
    elif r == 'northwest': patient_locations['northwest'] += 1
    elif r == 'southeast': patient_locations['southeast'] += 1
    elif r == 'southwest': patient_locations['southwest'] += 1

# Calculate percentages
locations_percentages = {'northeast': 0, 'northwest': 0, 'southeast': 0, 'southwest': 0}

locations_percentages['northeast'] = round((patient_locations['northwest'] / len(region))*100, 2)
locations_percentages['northwest'] = round((patient_locations['northwest'] / len(region))*100, 2)
locations_percentages['southeast'] = round((patient_locations['southeast'] / len(region))*100, 2)
locations_percentages['southwest'] = round((patient_locations['southwest'] / len(region))*100, 2)

# print(locations_percentages['northeast'])


patient_locations = list(patient_locations.items())

# Print-out (too lazy to write out every region)
print("There are " + str(patient_locations[0][1]) + " or " + str(locations_percentages['northeast']) + "% patients living in the " + patient_locations[0][0], end=".\n")
print("There are " + str(patient_locations[1][1]) + " or " + str(locations_percentages['northwest']) + "% patients living in the " + patient_locations[1][0], end=".\n")
print("There are " + str(patient_locations[2][1]) + " or " + str(locations_percentages['southeast']) + "% patients living in the " + patient_locations[2][0], end=".\n")
print("There are " + str(patient_locations[3][1]) + " or " + str(locations_percentages['southwest']) + "% patients living in the " + patient_locations[3][0], end=".")

There are 324 or 24.29% patients living in the northeast.
There are 325 or 24.29% patients living in the northwest.
There are 364 or 27.2% patients living in the southeast.
There are 325 or 24.29% patients living in the southwest.

## Insurance costs

### Range and Average of Charges

In [81]:
# Range of costs payed by patients
cost_range = min(charges), max(charges)

# Average costs
average_cost = round(sum(charges) / len(charges), 2)

print("The costs of medical insurance range between {} and {} dollars, with an overall average of {} dollars.".format(cost_range[0], cost_range[1], average_cost))

The costs of medical insurance range between 1121.87 and 63770.43 dollars, with an overall average of 13270.42 dollars.


### Insurance Charges to Women vs Men

In [13]:
# Empty lists to save charges in
charges_female = []
charges_male = []

# List of charges to men and women
for i in range(0,len(sex)):
    if sex[i] == 'female': charges_female.append(float(charges[i]))
    if sex[i] == 'male': charges_male.append(float(charges[i]))

# Calculate averages based on lists sum and length
avg_female_charges = round(sum(charges_female) / len(charges_female), 2)
avg_male_charges = round(sum(charges_male) / len(charges_male), 2)

# Print-out
print("The average insurance costs for women is {} dollars.".format(avg_female_charges))
print("The average insurance costs for men is {} dollars.".format(avg_male_charges))
if avg_female_charges > avg_male_charges:
    print("Based on this database, women pay more for medical insurance than men.")
else:
    print("Based on this database, men pay more for medical insurance than women.")

The average insurance costs for women is 12569.58 dollars.
The average insurance costs for men is 13956.75 dollars.
Based on this database, men pay more for medical insurance than women.


### Role of Smoking on Cost of Medical Insurance

In [29]:
# Calculate average cost for smokers and non-smokers
charges_smoker = []
charges_nonsmoker = []

for i in range(0,len(smoker)):
    if smoker[i] == 'yes': charges_smoker.append(float(charges[i]))
    elif smoker[i] == 'no': charges_nonsmoker.append(float(charges[i]))

avg_cost_smoker = round(sum(charges_smoker) / len(charges_smoker), 2)
avg_cost_nonsmoker = round(sum(charges_nonsmoker) / len(charges_nonsmoker), 2)
diff_smoker_vs_nonsmoker = abs(avg_cost_smoker - avg_cost_nonsmoker)

# Print-out
verb = ["increases" if avg_cost_smoker > avg_cost_nonsmoker else "decreases"]
    # The 'verb' variable changes the sentence according to whether smoking 'increases' or 'decreases'
    # the cost of insurance.

print("According to this database, smoking " + verb[0] + " the cost of medical insurance by " 
        + str(diff_smoker_vs_nonsmoker) + " dollars on average")

According to this database, smoking increases the cost of medical insurance by 23615.96 dollars on average


### Geographical differences in cost

In [83]:
# Empty lists to save costs to
northeast_cost = []
northwest_cost = []
southeast_cost = []
southwest_cost = []

# Iterate through items and save costs in their respective region lists
for i in range(0,len(region)):
    if region[i] == 'northeast':
        northeast_cost.append(round(float(charges[i]), 2))
    elif region[i] == 'northwest':
        northwest_cost.append(round(float(charges[i]), 2))
    elif region[i] == 'southeast':
        southeast_cost.append(round(float(charges[i]), 2))
    elif region[i] == 'southwest':
        southwest_cost.append(round(float(charges[i]), 2))

# Calculate the averages
northeast_cost = round(sum(northeast_cost) / len(northeast_cost), 2)
northwest_cost = round(sum(northwest_cost) / len(northwest_cost), 2)
southeast_cost = round(sum(southeast_cost) / len(southeast_cost), 2)
southwest_cost = round(sum(southwest_cost) / len(southwest_cost), 2)

#Print-out
print("There are the average costs of medical insurance per region:" 
        + "\nNortheast: " + str(northeast_cost) + " dollars."
        + "\nNorthwest: " + str(northwest_cost) + " dollars."
        + "\nSoutheast: " + str(southeast_cost) + " dollars."
        + "\nSouthwest: " + str(southwest_cost) + " dollars."
    )

There are the average costs of medical insurance per region:
Northeast: 13406.38 dollars.
Northwest: 12417.58 dollars.
Southeast: 14735.41 dollars.
Southwest: 12346.94 dollars.


## Conclusions

- The average age in the sample is 39.21.
- The sample consists of 676 men and 662 women, representing a 1.05% difference, which means there's a fairly good distribution of men and women in the sample.
- The ages in the sample range from 18 to 64 years old, with an average of 39.21 and a median of 32.
- The patients are distributed among the four locations as follows:
  - There are 324 or 24.29% patients living in the northeast.
  - There are 325 or 24.29% patients living in the northwest.
  - There are 364 or 27.2% patients living in the southeast.
  - There are 325 or 24.29% patients living in the southwest.
- The costs of medical insurance range between 1121.87 and 63770.43 dollars, with an overall average of 13270.42 dollars.
- The average insurance costs for women is 12569.58 dollars, while the average insurance costs for men is 13956.75 dollars.Based on this database, men pay more for medical insurance than women.
- According to this database, smoking increases the cost of medical insurance by 23615.96 dollars on average
- There are the average costs of medical insurance per region:
    - Northeast: 13406.38 dollars.
    - Northwest: 12417.58 dollars.
    - Southeast: 14735.41 dollars.
    - Southwest: 12346.94 dollars.