*U.S. Medical Insurance Costs*

In this project I would like to explore which regions have the highest insurance costs and what might be causing the higher medical insurance costs.

First, I need to prepare the data for analysis. It is currently in CSV format and will need to be imported and stored in list variables on python.

Secondly, I will then analyse the data to find corelations in information and possible trends. By splitting the insurance costs into groups and identifying comonalities between the patient information. I will also be able to identify the regions with the highest insurance costs in this way.

Lastly, I will summarize my findings and give a report on what I set out to find

In [2]:
# Import csv
import csv

In [3]:
# Create lists for the data
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
cost = []
insurance_data = []

In [4]:
# Populate Lists with data
with open('insurance.csv') as insurance_csv:
    insurance_info = csv.DictReader(insurance_csv)
    for row in insurance_info:
        insurance_data.append(row)
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        cost.append(row['charges'])

In [5]:
# counting how many males and females have medical insurance
def sex_count(sex):
    sex_male = 0
    sex_female = 0
    for gender in sex:
        if gender == 'male':
            sex_male += 1
        elif gender == 'female':
            sex_female += 1
    print('Male: {}'.format(sex_male))
    print( 'Female: {}'.format(sex_female))
sex_count(sex)



Male: 676
Female: 662


In [6]:
# identifying different regions
def unique_region(region):
    unique_region = []
    unique_count = []
    for place in region:
        if place not in unique_region:
            unique_region.append(place)
    # counting how many patients in each region
    for i in unique_region:
        unique_count.append(region.count(i))
    # create dictionary showing each region and number of patients
    region_count = {key: value for key, value in zip(unique_region, unique_count)}
    return region_count
unique_region = unique_region(region)
print(unique_region)
    

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


In [7]:
# calculate the min, average and max ages
def age_summary(age):
    age_total = 0
    for i in age:
        age_total += int(i)
    average_age = round(age_total/len(age))
    min_age = min(age)
    max_age = max(age)
    print('Minimum Age: {}'.format(min_age))
    print('Average Age: {}'.format(average_age))
    print('Maximum Age: {}'.format(max_age))
age_summary(age)

Minimum Age: 18
Average Age: 39
Maximum Age: 64


In [8]:
# calculate the average cost of insurance
def average_cost(cost):
    cost_total = 0
    for i in cost:
        cost_total += float(i)
    average_cost = round(cost_total/len(cost), 2)
    print('Average Insurance Cost: {}'.format(average_cost))
average_cost(cost)

Average Insurance Cost: 13270.42


In [9]:
# create function to calculate average insurance cost
def average_cost(data, region):
    cost_total = 0
    cost_len = 0
    for info in data:
        if info['region'] == region:
            cost_total += float(info['charges'])
            cost_len += 1
    average_cost = cost_total/cost_len
    return average_cost

# create fuction to iterate through unique regions
def average_cost_by_region(data, unique_region):
    keys = list(unique_region.keys())
    cost = []
    index = 0
    for i in unique_region:
        cost.append(round(average_cost(data ,keys[index]), 2))
        index += 1
    cost_by_region ={key: value for key, value in zip(keys, cost)}
    return cost_by_region

average_cost_by_region(insurance_data, unique_region)

{'southwest': 12346.94,
 'southeast': 14735.41,
 'northwest': 12417.58,
 'northeast': 13406.38}

In [10]:
# create function to calculate average bmi
def average_bmi(data, region):
    bmi_total = 0
    bmi_len = 0
    for info in data:
        if info['region'] == region:
            bmi_total += float(info['bmi'])
            bmi_len += 1
    average_bmi = bmi_total/bmi_len
    return average_bmi
# create fuction to iterate through unique regions
def average_bmi_by_region(data, unique_region):
    keys = list(unique_region.keys())
    bmi = []
    index = 0
    for i in unique_region:
        bmi.append(round(average_bmi(data ,keys[index]), 2))
        index += 1
    bmi_by_region ={key: value for key, value in zip(keys, bmi)}
    return bmi_by_region

average_bmi_by_region(insurance_data, unique_region)


{'southwest': 30.6, 'southeast': 33.36, 'northwest': 29.2, 'northeast': 29.17}

In [11]:
# number of smokers
smoker_count = 0 
for i in smoker:
    if i == 'yes':
        smoker_count += 1
print(smoker_count)

# counting number of smokers by region
def smokers_by_region(data, unique_region):
    keys = list(unique_region.keys())
    smokers = []
    index = 0
    for i in keys:
        smoker = []
        for info in data:
            if info['region'] == i:
                smoker.append(info['smoker'])
        count = 0
        for x in smoker:
            if x == 'yes':
                count += 1
        smokers.append(count)
    smokers_by_region = {key: value for key, value in zip(keys, smokers)}
    return smokers_by_region
    
smokers_by_region(insurance_data, unique_region)

274


{'southwest': 58, 'southeast': 91, 'northwest': 58, 'northeast': 67}

In [12]:
# the total number of patients with children
patient_with_children = 0 
for i in children:
    if int(i) > 0:
        patient_with_children += 1
print(patient_with_children)
        
# counting number of patients with children in each region
def children_by_region(data, unique_region):
    keys = list(unique_region.keys())
    children = []
    index = 0
    for i in keys:
        child = []
        for info in data:
            if info['region'] == i:
                child.append(int(info['children']))
        count = 0
        for x in child:
            if x > 0:
                count += 1
        children.append(count)
    children_by_region = {key: value for key, value in zip(keys, children)}
    return children_by_region

children_by_region(insurance_data, unique_region)

764


{'southwest': 187, 'southeast': 207, 'northwest': 193, 'northeast': 177}

**Summary of findings**

Working with a dataset with 1338 total patient entries with a gender split of:
    males: 676
    females: 662
    
The average age of the patients was 39 years 
The average insurance cost for a patient was 13270.42 Dollars
There was a total of 274 smokers and 764 patients had a child/children. 

The patient pool comprised of 4 regions:
    Southwest
    Southeast
    Northwest
    Northeast 

The number of patients in each region was:
    Southwest: 325
    Southeast: 364
    Northwest: 325
    Northeast: 324

The average insurance costs for each region was:
    Southwest: 12346,94 Dollars
    Southeast: 14735,41 Dollars
    Northwest: 12417,58 Dollars
    Northeast: 13406,38 Dollars

The average BMI for each region was:
    Southwest: 30,6
    Southeast: 33,36
    Northwest: 29,2
    Northeast: 29,17

The number of smokers in each region was:
    Southwest: 58
    Southeast: 91
    Northwest: 58
    Northeast: 67

The number of patients with a child/children was:
    Southwest: 187
    Southeast: 207
    Northwest: 193
    Northeast: 177

**Conclusion**

This project involved a comprehensive exploratory data analysis (EDA) of a medical insurance dataset containing 1,338 patient records. Using Python libraries such as Pandas, NumPy, and Matplotlib/Seaborn for data manipulation and visualization, key demographic and cost-related metrics were extracted and analyzed.

The dataset was relatively balanced in terms of gender (676 males vs. 662 females), with an overall mean age of 39. The average insurance cost per patient was calculated to be 13270,42 Dollars. Smoking status, regional location, BMI, and number of children were examined as potential factors influencing insurance charges.

Regional analysis revealed that the Southeast region had the highest average insurance cost ($14,735.41) and the highest average BMI (33.36), correlating with its higher smoker count (91). In contrast, the Northwest and Northeast had lower average BMIs (29.2 and 29.17 respectively) and more moderate insurance charges. Child dependency also varied across regions, with the Southeast again having the highest number of patients with children (207).

These findings suggest that regional differences, lifestyle factors (such as BMI and smoking), and dependents may significantly affect insurance premiums. This project served to strengthen proficiency in data wrangling, descriptive statistics, and data-driven inference using Python, providing a solid foundation for future work in predictive modeling or health risk analytics.
