# U.S. Medical Insurance Costs

## Data Exploration

1. What is the range of ages in the dataset? What is the average age?

In [176]:
import csv

ages = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        ages.append(int(row[0]))

age_range = max(ages) - min(ages)
average_age = sum(ages) / len(ages)

print(f"Range of Ages: {age_range}")
print(f"Average Age: {average_age:.2f}")

Range of Ages: 46
Average Age: 39.21


2. How many males and females are in the dataset?

In [177]:
import csv

sexes = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        sexes.append(row[1])

male_count = sexes.count("male")
female_count = sexes.count("female")

print(f"Number of Males: {male_count}")
print(f"Number of Females: {female_count}")

Number of Males: 676
Number of Females: 662


3. What is the distribution of BMI values? What is the highest and lowest BMI? What is the average BMI?

In [178]:
import csv

bmis = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        bmis.append(float(row[2]))

# Calculate the number of different BMI values
unique_bmis = set(bmis)
num_unique_bmis = len(unique_bmis)

# Find the lowest and highest BMI
lowest_bmi = min(bmis)
highest_bmi = max(bmis)

# Calculate the average BMI
average_bmi = sum(bmis) / len(bmis)

print(f"Number of Different BMI Values: {num_unique_bmis}")
print(f"Lowest BMI: {lowest_bmi:.2f}")
print(f"Highest BMI: {highest_bmi:.2f}")
print(f"Average BMI: {average_bmi:.2f}")

Number of Different BMI Values: 548
Lowest BMI: 15.96
Highest BMI: 53.13
Average BMI: 30.66


4. How many individuals have different numbers of children? What's the most common number of children?

In [179]:
import csv

children = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        children.append(int(row[3]))

child_count = {}

for child in children:
    if child in child_count:
        child_count[child] += 1
    else:
        child_count[child] = 1

most_common_children = max(child_count, key=child_count.get)

print("Number of Individuals with Different Numbers of Children:")
for num_children, count in child_count.items():
    print(f"{num_children} children: {count} individuals")

print(f"Most Common Number of Children: {most_common_children}")

Number of Individuals with Different Numbers of Children:
0 children: 574 individuals
1 children: 324 individuals
3 children: 157 individuals
2 children: 240 individuals
5 children: 18 individuals
4 children: 25 individuals
Most Common Number of Children: 0


## Smoking Analysis

1. What percentage of individuals are smokers?

In [180]:
import csv

smokers = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        smokers.append(row[4])

total_individuals = len(smokers)
smoker_count = smokers.count("yes")
percentage_smokers = (smoker_count / total_individuals) * 100

print(f"Percentage of Individuals Who Are Smokers: {percentage_smokers:.2f}%")

Percentage of Individuals Who Are Smokers: 20.48%


2. Are there any differences in charges between smokers and non-smokers?

In [181]:
import csv

smokers = []
charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        smokers.append(row[4])
        charges.append(float(row[6]))

smoker_charges = [charge for smoke, charge in zip(smokers, charges) if smoke == "yes"]
non_smoker_charges = [charge for smoke, charge in zip(smokers, charges) if smoke == "no"]

average_smoker_charges = sum(smoker_charges) / len(smoker_charges)
average_non_smoker_charges = sum(non_smoker_charges) / len(non_smoker_charges)

print(f"Average Charges for Smokers: {average_smoker_charges:.2f}")
print(f"Average Charges for Non-Smokers: {average_non_smoker_charges:.2f}")

Average Charges for Smokers: 32050.23
Average Charges for Non-Smokers: 8434.27


## Regional Analysis

1. How many individuals are from each region?

In [182]:
import csv

regions = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        regions.append(row[5])

region_count = {}

for region in regions:
    if region in region_count:
        region_count[region] += 1
    else:
        region_count[region] = 1

print("Number of Individuals from Each Region:")
for region, count in region_count.items():
    print(f"{region}: {count} individuals")

Number of Individuals from Each Region:
southwest: 325 individuals
southeast: 364 individuals
northwest: 325 individuals
northeast: 324 individuals


2. Is there any variation in charges based on the region?

In [183]:
import csv

regions = []
charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        regions.append(row[5])
        charges.append(float(row[6]))

region_charges = {}

for region, charge in zip(regions, charges):
    if region in region_charges:
        region_charges[region].append(charge)
    else:
        region_charges[region] = [charge]

print("Average Charges Based on Region:")
for region, charge_list in region_charges.items():
    average_charge = sum(charge_list) / len(charge_list)
    print(f"{region}: {average_charge:.2f}")

Average Charges Based on Region:
southwest: 12346.94
southeast: 14735.41
northwest: 12417.58
northeast: 13406.38


## Relationship Between Features

1. Is there a relationship between age and charges?

In [184]:
import csv

ages = []
charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        ages.append(int(row[0]))
        charges.append(float(row[6]))

n = len(ages)
mean_age = sum(ages) / n
mean_charges = sum(charges) / n

numerator = sum((age - mean_age) * (charge - mean_charges) for age, charge in zip(ages, charges))
denominator_age = sum((age - mean_age)**2 for age in ages)
denominator_charges = sum((charge - mean_charges)**2 for charge in charges)

correlation_coefficient = numerator / (denominator_age**0.5 * denominator_charges**0.5)

print(f"Correlation Coefficient between Age and Charges: {correlation_coefficient:.4f}")

Correlation Coefficient between Age and Charges: 0.2990


2. Is there a difference in charges based on sex?

In [185]:
import csv

sexes = []
charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        sexes.append(row[1])
        charges.append(float(row[6]))

def calculate_average_charges(group):
    total_charges = 0
    count = 0
    for sex, charge in zip(sexes, charges):
        if sex == group:
            total_charges += charge
            count += 1
    return total_charges / count if count > 0 else 0

average_charges_male = calculate_average_charges("male")
average_charges_female = calculate_average_charges("female")

print(f"Average Charges for Males: {average_charges_male:.2f}")
print(f"Average Charges for Females: {average_charges_female:.2f}")

Average Charges for Males: 13956.75
Average Charges for Females: 12569.58


3. How does the amount of children an individual have affect cost of insurance?

In [186]:
import csv

children = []
charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        children.append(int(row[3]))
        charges.append(float(row[6]))

def calculate_average_charges(group):
    total_charges = 0
    count = 0
    for num_children, charge in zip(children, charges):
        if num_children == group:
            total_charges += charge
            count += 1
    return total_charges / count if count > 0 else 0

average_charges_with_children_0 = calculate_average_charges(0)
average_charges_with_children_1 = calculate_average_charges(1)
average_charges_with_children_2 = calculate_average_charges(2)
average_charges_with_children_3 = calculate_average_charges(3)
average_charges_with_children_4 = calculate_average_charges(4)

print(f"Average Charges for Individuals with 0 Children: {average_charges_with_children_0:.2f}")
print(f"Average Charges for Individuals with 1 Child: {average_charges_with_children_1:.2f}")
print(f"Average Charges for Individuals with 2 Children: {average_charges_with_children_2:.2f}")
print(f"Average Charges for Individuals with 3 Children: {average_charges_with_children_3:.2f}")
print(f"Average Charges for Individuals with 4 Children: {average_charges_with_children_4:.2f}")

Average Charges for Individuals with 0 Children: 12365.98
Average Charges for Individuals with 1 Child: 12731.17
Average Charges for Individuals with 2 Children: 15073.56
Average Charges for Individuals with 3 Children: 15355.32
Average Charges for Individuals with 4 Children: 13850.66


4. Is there a difference in the number of individuals with and without children between smokers and non-smokers?"

In [187]:
import csv

smokers = []
children = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        smokers.append(row[4])
        children.append(int(row[3]))

smokers_with_children = sum(1 for smoke, num_children in zip(smokers, children) if smoke == "yes" and num_children > 0)
smokers_no_children = sum(1 for smoke, num_children in zip(smokers, children) if smoke == "yes" and num_children == 0)
non_smokers_with_children = sum(1 for smoke, num_children in zip(smokers, children) if smoke == "no" and num_children > 0)
non_smokers_no_children = sum(1 for smoke, num_children in zip(smokers, children) if smoke == "no" and num_children == 0)

print(f"Number of Smokers with Children: {smokers_with_children}")
print(f"Number of Smokers without Children: {smokers_no_children}")
print(f"Number of Non-Smokers with Children: {non_smokers_with_children}")
print(f"Number of Non-Smokers without Children: {non_smokers_no_children}")

Number of Smokers with Children: 159
Number of Smokers without Children: 115
Number of Non-Smokers with Children: 605
Number of Non-Smokers without Children: 459


## Statistical Analysis

1. Calculate the mean, median, and standard deviation of insurance charges.

In [188]:
import csv
import math

charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        charges.append(float(row[6]))

# Calculate mean (average) charge
mean_charge = sum(charges) / len(charges)

# Calculate median charge
sorted_charges = sorted(charges)
num_charges = len(charges)
median_index = num_charges // 2

if num_charges % 2 == 0:
    median_charge = (sorted_charges[median_index - 1] + sorted_charges[median_index]) / 2
else:
    median_charge = sorted_charges[median_index]

# Calculate standard deviation of charges
sum_squared_diff = sum((charge - mean_charge) ** 2 for charge in charges)
std_deviation = math.sqrt(sum_squared_diff / len(charges))

print(f"Mean (Average) Charge: {mean_charge:.2f}")
print(f"Median Charge: {median_charge:.2f}")
print(f"Standard Deviation of Charges: {std_deviation:.2f}")

Mean (Average) Charge: 13270.42
Median Charge: 9382.03
Standard Deviation of Charges: 12105.48


2. Are charges significantly different between smokers and non-smokers? You could perform a t-test for this.

In [189]:
import csv
import math

smoker_charges = []
non_smoker_charges = []

with open('insurance.csv', 'r') as insurance_csv:
    csv_reader = csv.reader(insurance_csv)
    header = next(csv_reader)
    
    for row in csv_reader:
        if row[4] == "yes":
            smoker_charges.append(float(row[6]))
        else:
            non_smoker_charges.append(float(row[6]))

# Calculate mean charges for smokers and non-smokers
mean_smoker_charges = sum(smoker_charges) / len(smoker_charges)
mean_non_smoker_charges = sum(non_smoker_charges) / len(non_smoker_charges)

# Calculate pooled standard deviation
n_smoker = len(smoker_charges)
n_non_smoker = len(non_smoker_charges)
pooled_std_dev = math.sqrt(((n_smoker - 1) * (math.pow(mean_smoker_charges - mean_non_smoker_charges, 2)) + (n_non_smoker - 1) * (math.pow(mean_non_smoker_charges - mean_smoker_charges, 2))) / (n_smoker + n_non_smoker - 2))

# Calculate t-statistic
t_statistic = (mean_smoker_charges - mean_non_smoker_charges) / (pooled_std_dev * math.sqrt((1 / n_smoker) + (1 / n_non_smoker)))

# Calculate degrees of freedom
degrees_of_freedom = n_smoker + n_non_smoker - 2

# Calculate critical t-value at desired significance level (e.g., 0.05)
critical_t_value = 1.96  # For a two-tailed test at 95% confidence level

if abs(t_statistic) > critical_t_value:
    print("Result: There is a significant difference in charges between smokers and non-smokers.")
else:
    print("Result: There is no significant difference in charges between smokers and non-smokers.")

print(f"T-Statistic: {t_statistic:.4f}")
print(f"Critical T-Value: {critical_t_value}")

Result: There is a significant difference in charges between smokers and non-smokers.
T-Statistic: 14.7611
Critical T-Value: 1.96


## Potential Business Insights

1. How might these insights be used by insurance companies to set premiums?

The insights from the analysis help insurance companies set fair premiums. They can adjust prices based on factors like age, gender, health, family size, and smoking habits. For example, older or smoking individuals might pay more due to potential higher healthcare costs. By using these insights, insurers create customized plans that match individual needs, making insurance more accurate and beneficial for everyone.

2. What factors (age, smoking, BMI, etc.) seem to contribute most to higher insurance charges?

Certain factors contribute significantly to higher insurance charges. Smoking stands out, as smokers have higher costs than non-smokers. Age also matters, with older people facing higher charges due to more healthcare needs. Family size plays a role; more children mean higher charges. Another factor is weight, where higher values are linked to increased charges, likely due to health risks. These insights help insurance companies adjust premiums based on these factors to ensure fairness and accuracy.

## Summary

The exploration of insurance charges uncovers key insights into what drives costs. Smoking plays a significant role, leading to notably higher charges for smokers compared to non-smokers. This underlines the importance of factoring in smoking habits when deciding on insurance premiums, as it greatly impacts potential healthcare expenses.

Age is another critical factor. As people get older, charges tend to increase, reflecting the connection between age and healthcare needs. This highlights the role age plays in determining insurance costs, as older individuals often require more medical attention and therefore face higher charges.

Looking at a broader view, regional differences also come into play. Charges vary by region, with the southeast region showing higher average charges. This suggests that regional factors like healthcare resources and lifestyle choices can impact insurance costs. Overall, these insights provide valuable information for insurance companies to create more tailored and equitable premium plans, considering the unique mix of factors that influence healthcare expenses.