# Portfolio Project: US Medical Insurance Costs

This project aims to analyse data in a CSV file using Python fundamentals. The CSV file contains data on patients and their medical insurance costs in the U.S. The attributes found in **insurance.csv** will be used to gain insight on the influence on medical costs given patient information.

To begin, the csv library will be needed to work with **insurance.csv** data.

The next step is to look over the data and check certain aspects in order to know how to import the data into a Python file. These aspects are:

* The names of columns
* Missing data
* Which columns are numerical or categorical

In [4]:
#import csv library
import csv 

In [5]:
with open('insurance.csv', mode='r') as file:
        csv_reader = csv.DictReader(file)

In [6]:
def print_first_10_rows(filename):
    # Open the CSV file
    with open(filename, mode='r') as file:
        csv_reader = csv.reader(file)
        
        # Get the header
        headers = next(csv_reader)
        print(f'Headers: {headers}')
        
        # Initialize a counter
        row_count = 0
        
        # Iterate over the rows in the CSV file
        for row in csv_reader:
            if row_count < 10:
                print(row)
                row_count += 1
            else:
                break

# Usage
filename = 'insurance.csv'
print_first_10_rows(filename)

Headers: ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924']
['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523']
['28', 'male', '33', '3', 'no', 'southeast', '4449.462']
['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061']
['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552']
['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216']
['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896']
['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056']
['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107']
['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692']


In [7]:
def check_missing_data(filename):
    # Open the CSV file
    with open(filename, mode='r') as file:
        csv_reader = csv.reader(file)
        
        # Get the header
        headers = next(csv_reader)
        
        # Initialize a list to keep track of rows with missing data
        missing_data_rows = []
        
        # Iterate over each row in the CSV file
        for row_num, row in enumerate(csv_reader, start=1):
            # Check each cell in the row
            for col_num, cell in enumerate(row):
                if cell.strip() == '':
                    missing_data_rows.append((row_num, headers[col_num]))
        
        return missing_data_rows

# Usage
filename = 'insurance.csv'
missing_data = check_missing_data(filename)
if missing_data:
    print("Missing data found in the following cells (row, column):")
    for row, col in missing_data:
        print(f"Row {row}, Column '{col}'")
else:
    print('No missing data found.')


No missing data found.


**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geographical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. 

To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.

In [86]:
def load_data_into_lists(filename):
    # Initialize a dictionary to hold the data lists
    patient_dict = {
        "age": [],
        "sex": [],
        "bmi": [],
        "children": [],
        "smoker": [],
        "region": [],
        "charges": []
    }
    
    # Open and read the CSV file
    with open(filename, mode='r') as file:
        csv_reader = csv.DictReader(file)
        
        # Iterate over each row in the CSV file
        for row in csv_reader:
            patient_dict["age"].append(int(row["age"]))
            patient_dict["sex"].append(row["sex"])
            patient_dict["bmi"].append(float(row["bmi"]))
            patient_dict["children"].append(int(row["children"]))
            patient_dict["smoker"].append(row["smoker"])
            patient_dict["region"].append(row["region"])
            patient_dict["charges"].append(float(row["charges"]))
    
    return patient_dict

# Print summary of the data
def print_summary(data, num_samples=5):
    """Print a summary of the data."""
    print("Data Summary:")
    for key, values in data.items():
        print(f"\n{key}:")
        print(f"  Number of entries: {len(values)}")
        print(f"  Sample values: {values[:num_samples]}")  # Show a sample of values

filename = 'insurance.csv'
data = load_data_into_lists(filename)
print_summary(data)

Data Summary:

age:
  Number of entries: 1338
  Sample values: [19, 18, 28, 33, 32]

sex:
  Number of entries: 1338
  Sample values: ['female', 'male', 'male', 'male', 'male']

bmi:
  Number of entries: 1338
  Sample values: [27.9, 33.77, 33.0, 22.705, 28.88]

children:
  Number of entries: 1338
  Sample values: [0, 1, 3, 0, 0]

smoker:
  Number of entries: 1338
  Sample values: ['yes', 'no', 'no', 'no', 'no']

region:
  Number of entries: 1338
  Sample values: ['southwest', 'southeast', 'southeast', 'northwest', 'northwest']

charges:
  Number of entries: 1338
  Sample values: [16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552]


# Data Analysis

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. This is where one must plan out what to investigate and how to perform the analysis. There are many aspects of the data that could be looked into. The following operations will be implemented:

* find average age of all patients
* find average annual medical insurance costs for all patients
* return the number of males vs. females counted in the dataset
* calculate the average insurance costs for smokers vs. non-smokers
* calculate the average insurance costs for males compared to females
* find the average bmi for males compared to females
* find out how many males smoke compared to females



### Average Age of All Patients

In [13]:
def calculate_average_age(patient_data):
    # Extract the list of ages
    ages = patient_data["age"]
    
    # Calculate the average age
    average_age = sum(ages) / len(ages)
    
    return average_age

filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
average_age = calculate_average_age(patient_data)

print(f"Average age of all patients: {average_age}")

Average age of all patients: 39.20702541106129


The average patient age is around 39 years old.

### Average Annual Medical Insurance Cost for all Patients

In [16]:
def calculate_average_insurance_cost(patient_data):
    # Extract the list of insurance charges
    insurance_charges = patient_data['charges']
    
    # Calculate the average insurance cost
    average_cost = sum(insurance_charges) / len(insurance_charges)
    
    rounded_average_cost = round(average_cost, 2)
    
    return rounded_average_cost

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
average_insurance_cost = calculate_average_insurance_cost(patient_data)

print(f"Average insurance cost: {average_insurance_cost} U.S. dollars.")

Average insurance cost: 13270.42 U.S. dollars.


### Average Insurance Cost for Smoker vs. Non-Smoker

In [18]:
def calculate_average_charges_by_smoker_status(patient_data):
    # Separate charges by smoker status
    smoker_charges = []
    non_smoker_charges = []
    
    for smoker, charge in zip(patient_data['smoker'], patient_data['charges']):
        if smoker == 'yes':
            smoker_charges.append(charge)
        elif smoker == 'no':
            non_smoker_charges.append(charge)
    
    # Calculate averages
    average_smoker_charges = sum(smoker_charges) / len(smoker_charges)
    average_non_smoker_charges = sum(non_smoker_charges) / len(non_smoker_charges)
    
    rounded_average_smoker_charges = round(average_smoker_charges, 2)
    rounded_average_non_smoker_charges = round(average_non_smoker_charges, 2)
    
    return rounded_average_smoker_charges, rounded_average_non_smoker_charges

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
average_smoker_charges, average_non_smoker_charges = calculate_average_charges_by_smoker_status(patient_data)

print(f"Average charges for smokers: {average_smoker_charges} U.S. dollars.")
print(f"Average charges for non-smokers: {average_non_smoker_charges} U.S. dollars.")

Average charges for smokers: 32050.23 U.S. dollars.
Average charges for non-smokers: 8434.27 U.S. dollars.


The average insurance costs for smokers is significantly higher than that for non-smokers, given the negative health implications of smoking.

### Males vs. Females

In [21]:
def count_genders(patient_data):
    # Extract the list of genders
    genders = patient_data["sex"]
    
    # Count the number of males and females
    num_males = genders.count("male")
    num_females = genders.count("female")
    
    return num_males, num_females

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
num_males, num_females = count_genders(patient_data)

print(f"Number of males: {num_males}")
print(f"Number of females: {num_females}")

Number of males: 676
Number of females: 662


### Average Insurance Costs for Males vs. Females

In [23]:
def calculate_average_charges(patient_data):
    # Separate charges by gender
    female_charges = []
    male_charges = []
    
    for sex, charge in zip(patient_data['sex'], patient_data['charges']):
        if sex == 'female':
            female_charges.append(charge)
        elif sex == 'male':
            male_charges.append(charge)
    
    # Calculate averages
    average_female_charges = sum(female_charges) / len(female_charges)
    average_male_charges = sum(male_charges) / len(male_charges)
    
    rounded_average_female_charges = round(average_female_charges, 2)
    rounded_average_male_charges = round(average_male_charges, 2)
    
    return rounded_average_female_charges, rounded_average_male_charges

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
average_female_charges, average_male_charges = calculate_average_charges(patient_data)

print(f"Average charges for females: {average_female_charges} U.S. dollars.")
print(f"Average charges for males: {average_male_charges} U.S. dollars.")

Average charges for females: 12569.58 U.S. dollars.
Average charges for males: 13956.75 U.S. dollars.


Despite there being 14 more males than females in this dataset, the average insurance cost for males is higher than that for females. The reason for this is not obvious yet, therefore this could be further explored by comparing the average age and bmi of males to females, checking to see if there is also a link between sex and smoking or sex and number of children in this dataset. This will be explored below.

### Average Ages of Males vs. Females

In [26]:
def calculate_average_ages_by_sex(patient_data):
    # Separate ages by gender
    female_ages = []
    male_ages = []
    
    for sex, age in zip(patient_data['sex'], patient_data['age']):
        if sex == 'female':
            female_ages.append(age)
        elif sex == 'male':
            male_ages.append(age)
    
    # Calculate averages
    average_female_age = sum(female_ages) / len(female_ages)
    average_male_age = sum(male_ages) / len(male_ages)
    
    return average_female_age, average_male_age

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
average_female_age, average_male_age = calculate_average_ages_by_sex(patient_data)

print(f"Average age for females: {average_female_age}")
print(f"Average age for males: {average_male_age}")

Average age for females: 39.503021148036254
Average age for males: 38.917159763313606


The average age of the males is lower than that of the females. This does not explain the difference in costs as we would assume that younger patients are healthier and therefore have lower insurance costs.

### Average BMI for Males vs. Females

In [29]:
def calculate_average_ages_by_bmi(patient_data):
    # Separate bmi by gender
    female_bmi = []
    male_bmi = []
    
    for sex, bmi in zip(patient_data['sex'], patient_data['bmi']):
        if sex == 'female':
            female_bmi.append(bmi)
        elif sex == 'male':
            male_bmi.append(bmi)
    
    # Calculate averages
    average_female_bmi = sum(female_bmi) / len(female_bmi)
    average_male_bmi = sum(male_bmi) / len(male_bmi)
    
    return average_female_bmi, average_male_bmi

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
average_female_bmi, average_male_bmi = calculate_average_ages_by_bmi(patient_data)

print(f"Average bmi for females: {average_female_bmi}")
print(f"Average bmi for males: {average_male_bmi}")

Average bmi for females: 30.377749244713023
Average bmi for males: 30.943128698224832


Males in this dataset have a slightly higher Body Mass Index than the females. This could be a contributing factor to the higher insurance cost for males.

### Who Smokes More, Males or Females?

Next, we will analyse the data further to see if more males in this dataset smoke more than the females as this could provide another reason for why their insurance costs are higher.

In [33]:
def number_of_smokers_by_sex(patient_data):
    # Separate by gender
    female_smokers = 0
    male_smokers = 0
    
    for sex, smoker in zip(patient_data['sex'], patient_data['smoker']):
        if sex == 'female' and smoker == 'yes':
            female_smokers += 1
        elif sex == 'male' and smoker == 'yes':
            male_smokers += 1
    
    return female_smokers, male_smokers

# Usage
filename = 'insurance.csv'
patient_data = load_data_into_lists(filename)
female_smokers, male_smokers = number_of_smokers_by_sex(patient_data)

print(f"Number of female smokers: {female_smokers}")
print(f"Number of male smokers: {male_smokers}")

Number of female smokers: 115
Number of male smokers: 159


We can see that 44 more males smoke than females in this dataset. This is likely to have a significant impact on the higher insurance costs for males as we previously seen that smoking has a huge impact on insurance costs.

# Conclusion 

In this project, data on U.S. medical insurance costs were analysed. The data was able to prove the assumption that patients who smoke have higher insurance costs compared to non-smokers given the health risks of smoking. The average age of the patients was 40 years old whereby males were slightly younger than the females. Despite this, it was found that men had a higher average insurance cost compared to females. Investigating this lead to a discovery that more of the males in the dataset were smokers which is likely to have had an impact on their higher insurance costs.