# U.S. Medical Insurance Costs

## Project Overview

This project is part of the **Data Scientist: Machine Learning** course at Codecademy. In this analysis, I will be investigating a dataset containing medical insurance costs using Python. The goal is to apply the Python skills I've developed throughout the course to gain insights from real-world data.

## Dataset Description

The dataset, `insurance.csv`, contains information about medical insurance costs for individuals. It includes the following attributes:

- Age
- Sex
- BMI (Body Mass Index)
- Number of Children
- Smoking Status
- Geographical Region
- Insurance Charges

## Project Objectives

1. Analyze the impact of BMI on insurance costs
2. Compare insurance costs for smokers vs. non-smokers across age groups
3. Investigate regional variations in insurance costs
4. Draw meaningful conclusions about factors influencing medical insurance charges

## Tools and Libraries

For this project, I will primarily use Python's built-in libraries to demonstrate fundamental data analysis skills.

In [None]:
import csv

class InsuranceAnalysis:
    def __init__(self, csv_file):
        # Initialize lists to store data
        self.ages = []
        self.sexes = []
        self.bmis = []
        self.smoker_statuses = []
        self.regions = []
        self.insurance_charges = []
        self.load_data(csv_file)

    def load_data(self, csv_file):
        # Load data from CSV file into respective lists
        with open(csv_file, 'r') as file:
            csv_reader = csv.DictReader(file)
            for row in csv_reader:
                self.ages.append(int(row['age']))
                self.sexes.append(row['sex'])
                self.bmis.append(float(row['bmi']))
                self.smoker_statuses.append(row['smoker'])
                self.regions.append(row['region'])
                self.insurance_charges.append(float(row['charges']))

    def bmi_impact_analysis(self):
        # Analyze the impact of BMI on insurance costs
        bmi_categories = {
            'Underweight': {'count': 0, 'total_charge': 0},
            'Normal': {'count': 0, 'total_charge': 0},
            'Overweight': {'count': 0, 'total_charge': 0},
            'Obese': {'count': 0, 'total_charge': 0}
        }

        # Categorize BMI and sum up charges
        for bmi, charge in zip(self.bmis, self.insurance_charges):
            category = 'Obese' if bmi >= 30 else 'Overweight' if bmi >= 25 else 'Normal' if bmi >= 18.5 else 'Underweight'
            bmi_categories[category]['count'] += 1
            bmi_categories[category]['total_charge'] += charge

        # Print average cost for each BMI category
        print("Average Insurance Cost by BMI Category:")
        for category, data in bmi_categories.items():
            avg_cost = data['total_charge'] / data['count'] if data['count'] > 0 else 0
            print(f"{category}: ${avg_cost:.2f}")

        # Calculate correlation between BMI and insurance charges
        bmi_mean = sum(self.bmis) / len(self.bmis)
        charge_mean = sum(self.insurance_charges) / len(self.insurance_charges)
        numerator = sum((b - bmi_mean) * (c - charge_mean) for b, c in zip(self.bmis, self.insurance_charges))
        denominator = (sum((b - bmi_mean)**2 for b in self.bmis) * sum((c - charge_mean)**2 for c in self.insurance_charges))**0.5
        correlation = numerator / denominator if denominator != 0 else 0

        print(f"\nCorrelation between BMI and Insurance Charges: {correlation:.2f}")

    def smoking_comparison(self):
        # Compare insurance costs for smokers vs. non-smokers across age groups
        smoker_data = {'yes': {'count': 0, 'total_charge': 0}, 'no': {'count': 0, 'total_charge': 0}}
        age_groups = {'18-30': {'yes': {'count': 0, 'total_charge': 0}, 'no': {'count': 0, 'total_charge': 0}},
                      '31-45': {'yes': {'count': 0, 'total_charge': 0}, 'no': {'count': 0, 'total_charge': 0}},
                      '46-60': {'yes': {'count': 0, 'total_charge': 0}, 'no': {'count': 0, 'total_charge': 0}},
                      '60+': {'yes': {'count': 0, 'total_charge': 0}, 'no': {'count': 0, 'total_charge': 0}}}

        # Categorize data by smoking status and age group
        for age, smoker, charge in zip(self.ages, self.smoker_statuses, self.insurance_charges):
            smoker_data[smoker]['count'] += 1
            smoker_data[smoker]['total_charge'] += charge

            age_group = '60+' if age >= 60 else '46-60' if age >= 46 else '31-45' if age >= 31 else '18-30'
            age_groups[age_group][smoker]['count'] += 1
            age_groups[age_group][smoker]['total_charge'] += charge

        # Print average cost for smokers vs. non-smokers
        print("Average Insurance Cost for Smokers vs Non-Smokers:")
        for status, data in smoker_data.items():
            avg_cost = data['total_charge'] / data['count'] if data['count'] > 0 else 0
            print(f"{'Smoker' if status == 'yes' else 'Non-Smoker'}: ${avg_cost:.2f}")

        # Print average cost by age group and smoking status
        print("\nAverage Insurance Cost by Age Group and Smoking Status:")
        for age_group, data in age_groups.items():
            print(f"\nAge Group: {age_group}")
            for status in ['yes', 'no']:
                avg_cost = data[status]['total_charge'] / data[status]['count'] if data[status]['count'] > 0 else 0
                print(f"{'Smoker' if status == 'yes' else 'Non-Smoker'}: ${avg_cost:.2f}")

    def regional_cost_variations(self):
        # Analyze regional variations in insurance costs
        region_data = {}
        for region, charge in zip(self.regions, self.insurance_charges):
            if region not in region_data:
                region_data[region] = {'count': 0, 'total_charge': 0}
            region_data[region]['count'] += 1
            region_data[region]['total_charge'] += charge

        # Print average cost by region, sorted from highest to lowest
        print("Average Insurance Cost by Region:")
        for region, data in sorted(region_data.items(), key=lambda x: x[1]['total_charge'] / x[1]['count'], reverse=True):
            avg_cost = data['total_charge'] / data['count']
            print(f"{region}: ${avg_cost:.2f}")

        # Calculate and print overall average cost
        overall_avg = sum(self.insurance_charges) / len(self.insurance_charges)
        print(f"\nOverall Average Insurance Cost: ${overall_avg:.2f}")

        # Compare each region to the overall average
        print("\nRegional Cost Compared to Overall Average:")
        for region, data in region_data.items():
            avg_cost = data['total_charge'] / data['count']
            difference = ((avg_cost - overall_avg) / overall_avg) * 100
            print(f"{region}: {'Higher' if difference > 0 else 'Lower'} by {abs(difference):.2f}%")

    def run_all_analyses(self):
        print("BMI Impact Analysis:")
        self.bmi_impact_analysis()
        print("\nSmoking Comparison:")
        self.smoking_comparison()
        print("\nRegional Cost Variations:")
        self.regional_cost_variations()

# Create instance and run analyses
analysis = InsuranceAnalysis('insurance.csv')
analysis.run_all_analyses()

BMI Impact Analysis:
Average Insurance Cost by BMI Category:
Underweight: $8852.20
Normal: $10409.34
Overweight: $10987.51
Obese: $15552.34

Correlation between BMI and Insurance Charges: 0.20

Smoking Comparison:
Average Insurance Cost for Smokers vs Non-Smokers:
Smoker: $32050.23
Non-Smoker: $8434.27

Average Insurance Cost by Age Group and Smoking Status:

Age Group: 18-30
Smoker: $27528.08
Non-Smoker: $4462.31

Age Group: 31-45
Smoker: $31707.16
Non-Smoker: $7246.17

Age Group: 46-60
Smoker: $35554.52
Non-Smoker: $12046.40

Age Group: 60+
Smoker: $40630.70
Non-Smoker: $15232.71

Regional Cost Variations:
Average Insurance Cost by Region:
southeast: $14735.41
northeast: $13406.38
northwest: $12417.58
southwest: $12346.94

Overall Average Insurance Cost: $13270.42

Regional Cost Compared to Overall Average:
southwest: Lower by 6.96%
southeast: Higher by 11.04%
northwest: Lower by 6.43%
northeast: Higher by 1.02%


## Final Reflection
This analysis reveals several key factors influencing medical insurance costs:
1. BMI: There's a positive correlation between BMI and insurance charges, with obese individuals facing significantly higher costs.
2. Smoking: Smoking status has a dramatic impact on insurance costs, with smokers paying more than twice as much as non-smokers on average.
3. Age: Insurance costs generally increase with age, with the effect more pronounced for smokers.
4. Region: There are notable regional differences in insurance costs, with the southeast having the highest average charges.

These insights could be valuable for both insurance companies in risk assessment and for individuals in understanding factors that may affect their insurance premiums.
Further analysis could involve multivariate regression to quantify the impact of each factor while controlling for others.