# U.S. Medical Insurance Costs

In this project, a CSV file with medical insurance costs will be investigated using Python. The goal with this project will be to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset.
Some of the analysis will be:
* Find out the average age of patients in the dataset
* Analyze where a majority of individuals are from 
* Look at different costs between smokers and non-smokers 
* What the average age is for someone who has at least one child 
* Return the number of males vs females 
* Return the average cost of the patients 
* Analyze mean and standard deviation (on ages, for example) 
* Create a dictionary that contains all patient information

Extra:
* Organize findings into dictionaries, lists or another datatype

# **Look over the dataset**

This project contains a CSV file containing all the information about medical insurance for each patient.
Some notes:
* Column names: age, sex, bmi, children, smoker, region, charges
* There is no missing data.
* Some columns are numerical (age, bmi, children, charges) while some are categorical (sex, smoker, region).
* There are seven columns.

In [2]:
import csv

In [3]:
Age = []
Sex = []
Bmi = []
Children = []
Smoker = []
Region = []
Charges = []

In [4]:
#create a function to iterate through the insurance.csv file
def insurance_info (list, insurance_file, column_name):
    with open(insurance_file) as insurance_csv:
        insurance = csv.DictReader(insurance_csv)
        for row in insurance:
            list.append(row[column_name])
    return list

In [5]:
#Create lists for all the insurance information
age_list = insurance_info(Age, "insurance.csv", "age")
sex_list = insurance_info(Sex, "insurance.csv", "sex")
bmi_list = insurance_info(Bmi, "insurance.csv", "bmi")
children_list = insurance_info(Children, "insurance.csv", "children")
smoker_list = insurance_info(Smoker, "insurance.csv", "smoker")
region_list = insurance_info(Region, "insurance.csv", "region")
charges_list = insurance_info(Charges, "insurance.csv", "charges")

Patient_info = list(zip(Age, Sex, Bmi, Children, Smoker, Region, Charges))
print(Patient_info[0: 3])

[('19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924'), ('18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523'), ('28', 'male', '33', '3', 'no', 'southeast', '4449.462')]


## Organize patient information in a dictionary

In [6]:
Patient_number = range(1, 1339)

In [7]:
Patient = [{"age": a, "sex": s, "bmi": b, "children": c, "smoker": sm, "region": r, "charges": ch} for a, s, b, c, sm, r, ch in zip(Age, Sex, Bmi, Children, Smoker, Region, Charges)]
Patient_final = {number: i for number, i in zip(Patient_number, Patient)}

In [8]:
example = dict(list(Patient_final.items())[0: 3]) 
print(example)

{1: {'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, 2: {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, 3: {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}}


# **Analyze Dataset**

## Find out the average age of patients in the dataset

In [9]:
#Create function to calculate the average age and the minimum and maximum age:
def age_average(Age):
    age_count = 0
    for row in Age:
        age_count += int(row)
    return age_count/len(Age)
    return min(Age), max(Age)

older_age = max(Age)
younger_age = min(Age)
average = int(age_average(Age))

print ("The age average is", average, "years. The older patients have", older_age, "years and the younger patients have", younger_age,"years.")

The age average is 39 years. The older patients have 64 years and the younger patients have 18 years.


## Analyze where a majority of individuals are from

In [10]:
#Create a function to find the regions that patients came from
def region_origin (Region):
    list_of_regions = []
    for region in Region:
        if region not in list_of_regions:
            list_of_regions.append(region)
        else:
            pass
    return list_of_regions
regions = list(region_origin(Region))
print("Patients are from the following regions:", ", ".join(regions),".")

Patients are from the following regions: southwest, southeast, northwest, northeast .


In [11]:
#Create a function to count how many patients are from each region
def region_count (Region):
    Southwest = 0
    Southeast = 0
    Northwest = 0
    Northeast = 0
    for region in Region:
        if region == "southwest":
            Southwest += 1
        elif region == "southeast":
            Southeast += 1
        elif region == "northwest":
            Northwest += 1
        else:
            Northeast +=1
    return Southwest, Southeast, Northwest, Northeast

Southwest, Southeast, Northwest, Northeast = region_count(Region)

print("There are " + str(Southwest) + " patients from Southwest, " + str(Southeast) + " from Southeast, " + str(Northwest) + " from Northwest and " + str(Northeast) + " from Northeast.")

There are 325 patients from Southwest, 364 from Southeast, 325 from Northwest and 324 from Northeast.


## Look at different costs between smokers and non-smokers

In [12]:
#Define a function to calculate the number of smokers and group it by sex:
def smoker_analysis_by_sex(Patient_info):
    female_smokers = 0
    female_smokers_charges = 0
    male_smokers = 0
    male_smokers_charges = 0
    for i in Patient_info:
        if (i[1] == "female") and (i[4] == "yes"):
            female_smokers += 1
            female_smokers_charges += float(i[-1])
        elif (i[1] == "female") and (i[4] == "no"):
            female_smokers = female_smokers
        elif (i[1] == "male") and (i[4] == "yes"):
            male_smokers += 1
            male_smokers_charges += float(i[-1])
        else:
            male_smokers = male_smokers  
    return female_smokers, female_smokers_charges, male_smokers, male_smokers_charges
    
female_smokers, female_smokers_charges, male_smokers, male_smokers_charges = smoker_analysis_by_sex(Patient_info)
avg_female_smoker_cost = round(female_smokers_charges / female_smokers, 2)
avg_male_smoker_cost = round(male_smokers_charges / male_smokers, 2)

print("There are " + str(female_smokers) + " female smokers with an average insurance cost of " + str(avg_female_smoker_cost) + "$ and " + str(male_smokers) + " male smokers with an average insurance cost of " + str(avg_male_smoker_cost) + "$.")

There are 115 female smokers with an average insurance cost of 30679.0$ and 159 male smokers with an average insurance cost of 33042.01$.


In [13]:
#Define a function to calculate the number of smokers and the average insurance_cost:
def smoker_analysis (Patient_info):
    smokers = 0
    smokers_cost = 0
    non_smokers = 0
    non_smokers_cost = 0
    for i in Patient_info:
        if i[4] == "yes":
            smokers += 1
            smokers_cost += float(i[-1])
        if i[4] == "no":
            non_smokers += 1 
            non_smokers_cost += float(i[-1])
    return smokers, non_smokers, smokers_cost, non_smokers_cost
        
smokers, non_smokers, smokers_cost, non_smokers_cost = smoker_analysis(Patient_info)

avg_smoker_cost = round(smokers_cost / smokers, 2)
avg_non_smoker_cost = round(non_smokers_cost / non_smokers, 2)

print("There are " + str(smokers) + " smokers and " + str(non_smokers) + " non-smokers.")
print("The insurance average cost for a smoker is " + str(avg_smoker_cost) + "$ while the average insurance cost for a non-smoker patient is " + str(avg_non_smoker_cost) + "$.")

There are 274 smokers and 1064 non-smokers.
The insurance average cost for a smoker is 32050.23$ while the average insurance cost for a non-smoker patient is 8434.27$.


## What the average age is for someone who has at least one child

In [14]:
#Create function to calculate the number of children each patient has
def patient_with_children(Patient_info):
    no_child = 0
    one_child = 0
    two_children = 0
    more_children = 0
    count_patient = 0
    total_age_patient = 0
    for i in Patient_info:
        if int(i[3]) == 0:
            no_child += 1
        elif (int(i[3]) > 0) and (int(i[3]) <= 1):
            one_child += 1
        elif (int(i[3]) > 1) and (int(i[3]) <= 2):
            two_children += 1
        elif int(i[3]) > 2:
            more_children += 1
        if int(i[3]) >= 1:
            count_patient += 1
            total_age_patient += int(i[0])
    return no_child, one_child, two_children, more_children, count_patient, total_age_patient

no_child, one_child, two_children, more_children, count_patient, total_age_patient = patient_with_children (Patient_info)
age_avg_with_children = round(total_age_patient / count_patient, 1)

print("There are " + str(count_patient) + " patients with children and the average age for a patient with one or more children is " + str(age_avg_with_children) + ".")
print("Number of patients with:\n  No child: {}\n  One child: {}\n  Two children: {}\n  More than two children: {}".format(no_child, one_child,two_children, more_children))

There are 764 patients with children and the average age for a patient with one or more children is 39.8.
Number of patients with:
  No child: 574
  One child: 324
  Two children: 240
  More than two children: 200


## Return the number of males vs females

In [15]:
#Determine the number of females and males on the dataset
def males_vs_females(Sex):
    female_count = 0
    male_count = 0
    for sex in Sex:
        if sex == "female":
            female_count += 1
        else:
            male_count +=1
    return female_count, male_count

female_count, male_count = males_vs_females(Sex)

print("The population of this dataset consists of {} women and {} men.".format(female_count, male_count))


The population of this dataset consists of 662 women and 676 men.


## Return the average cost of the patients

In [16]:
#Calculate the average insurance cost of the patients:
def average_cost(Patient_info):
    cost = 0
    count_patient = 0
    max_charge = 0
    min_charge = max_charge
    max_charge_info = []
    min_charge_info = []
    for i in Patient_info:
        count_patient += 1
        cost += float(i[-1])
        if float(i[-1]) > max_charge:
            max_charge = float(i[-1])
            min_charge = max_charge
            max_charge_info = i[0:6]
        if float(i[-1]) < min_charge:
            min_charge = float(i[-1])
            min_charge_info = i[0:6]
    return cost, count_patient, max_charge, min_charge, max_charge_info, min_charge_info

cost, count_patient, max_charge, min_charge, max_charge_info, min_charge_info = average_cost(Patient_info)
avg_cost = round(cost / count_patient, 2)

print("The average insurance cost for each patient is ", str(avg_cost),"$.")
print("The minimum insurance cost is {}$ and the maximum insurance cost is {}$.".format(round(min_charge, 2), round(max_charge, 2)))
print("These are the patient information for:\n  Minimum insurance cost:{}\n  Maximum insurance cost:{}".format(min_charge_info, max_charge_info))

The average insurance cost for each patient is  13270.42 $.
The minimum insurance cost is 1121.87$ and the maximum insurance cost is 63770.43$.
These are the patient information for:
  Minimum insurance cost:('18', 'male', '23.21', '0', 'no', 'southeast')
  Maximum insurance cost:('54', 'female', '47.41', '0', 'yes', 'southeast')


In [17]:
#Calculate the median of insurance cost:
def cal_median(Charges):
    charge = [float(i[-1]) for i in Patient_info]
    charges = sorted(charge)
    n = len(charges)
    m = n // 2
    if n % 2 == 0:
        return (float(charges[m - 1]) + float(charges[m])) / 2
    else:
        return charges[m]
median = cal_median(Charges)
print("The median of insurance cost is",round(median, 2),"$.")

The median of insurance cost is 9382.03 $.


## Analyze mean and standard deviation for Age

In [18]:
#Create a function to analyze the mean and standard deviation
def mean_stddev(Patient_info):
    info = [float(i[0]) for i in Patient_info]
    mean = sum(info) / len(info)
    deviations = [(i - mean) ** 2 for i in info]
    import math
    std_dev = math.sqrt(sum(deviations)/len(deviations))
    
    return mean, std_dev


mean, std_dev = mean_stddev(Patient_info)

In [19]:
print("The average age is " + str(round(mean, 1)) + " and the standard deviation is " + str(round(std_dev, 1)))

The average age is 39.2 and the standard deviation is 14.0


In [20]:
#Calculate the median of ages:
def cal_median(Age):
    age = [float(i[0]) for i in Patient_info]
    ages = sorted(age)
    n = len(ages)
    m = n // 2
    if n % 2 == 0:
        return (float(ages[m - 1]) + float(ages[m])) / 2
    else:
        return ages[m]
median = cal_median(Age)
print("The median of patient ages is",round(median, 2))

The median of patient ages is 39.0


# Conclusion

In this project, the medical insurance costs of a representative dataset were investigated using Python. The medical information gathered for each patient to determine the insurance cost were age, sex, bmi, number of children, smoker status, region of origin.

The dataset has a population of 1338 patients, consisting of 662 women and 676 men.
The age average is 39 years with a standard deviation of 14.0, being the oldest patients 64 years old and the younger patients 18 years old.

Patients are spread around four regions in the USA: southwest, southeast, northwest, northeast.
There are 325 patients from Southwest, 364 from Southeast, 325 from Northwest and 324 from Northeast.

One key parameter for insurance cost is the smoking status of a patient. There are 274 smokers and 1064 non-smokers.
Regarding the smoker population, 115 are female with an average insurance cost of 30679.0$ and 159 are male smokers with an average insurance cost of 33042.01$.
Overall, the insurance average cost for a smoker is 32050.23$ while the average insurance cost for a non-smoker patient is 8434.27$.

Another key parameter for determing the insurance cost is the number of children. There are 764 patients with one or more children and the average age is 39.8 years old.
Number of patients with:
  No child: 574
  One child: 324
  Two children: 240
  More than two children: 200
 

Overall the average insurance cost for each patient is 13270.42$.
The minimum insurance cost is 1121.9$ and the maximum insurance cost is 63770.4$.
These are the patient information for:
  Minimum insurance cost:('18', 'male', '23.21', '0', 'no', 'southeast')
  Maximum insurance cost:('54', 'female', '47.41', '0', 'yes', 'southeast')

By this results it is notable the influence of age and smoking status on the insurance cost.
