# U.S. Medical Insurance Costs

# Predictive Insurance Cost Analysis

## Project Overview

This project focuses on analyzing medical insurance costs to uncover key insights and develop a predictive model for estimating individual insurance charges. Using a dataset titled "Medical Cost Personal Data Sets," we delve into various attributes such as age, sex, BMI, number of children, smoking status, region, and individual medical charges.

### Objectives

- **Data Exploration**: Understand the basic distribution and characteristics of the dataset through descriptive statistics.
- **Comparative Analysis**: Investigate the impact of smoking on insurance costs and analyze demographic patterns, such as age and region, in relation to insurance charges.
- **Predictive Modeling**: Develop a function capable of predicting insurance costs based on patient attributes, leveraging the insights gained from our analyses.

### Approach

The project is structured into several key steps, executed within a Jupyter notebook to facilitate both the analysis and documentation process:

1. **Data Parsing**: Load and organize the dataset into structured lists for analysis.
2. **Data Analysis**: Utilize the `PatientsInfo` class to methodically explore and analyze patient data across various dimensions.
3. **Statistical Insights**: Provide descriptive statistics to get an overview of the data's central tendencies and distributions.
4. **Predictive Model Development**: Construct a predictive model to estimate insurance costs based on relevant patient attributes.
5. **Summary of Findings**: Consolidate and communicate the key insights and implications of our analysis, underscoring the predictive model's accuracy and reliability.

Through this project, we aim to demonstrate the application of data science methodologies to real-world datasets, offering valuable insights and predictive capabilities that can inform both individuals and organizations in the healthcare sector.


### Step 1: Importing Necessary Libraries
In this step, we will import the CSV library, which is essential for reading our dataset stored in a CSV file. The CSV format (Comma-Separated Values) is widely used for representing tabular data, and Python's CSV library provides functionality to easily parse and access the data within.


In [66]:
import csv
import statistics
import math
import numpy as np
import pandas as pd

df = pd.read_csv('insurance.csv')
print(df.dtypes)

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object


### Step 2: Parsing the Dataset
Using the CSV library, we'll parse the data from `insurance.csv`. We will read the data into individual lists corresponding to each column (age, sex, bmi, children, smoker, region, charges). This step is crucial for preparing the data for analysis.


In [67]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []


### Step 3: Efficient Data Loading with a Helper Function
To streamline the process of loading our dataset into separate lists, we will create a helper function. This function will automate the parsing and organization of column data from the CSV file into lists. This approach enhances code reusability and efficiency.


In [68]:
def load_insurance_data(lst, csv_file, column_name):
    with open(csv_file) as file:
        reader = csv.DictReader(file)
        for row in reader:
            value = row[column_name]
            if value == '':  # Skip rows with missing values
                continue
            if column_name in ['age', 'children']:  # These columns should be integers
                try:
                    value = int(value)
                except ValueError:
                    continue  # Skip rows where the value is not a valid integer
            elif column_name in ['bmi', 'charges']:  # These columns should be floats
                try:
                    value = float(value)
                except ValueError:
                    continue  # Skip rows where the value is not a valid float
            lst.append(value)
        return lst
    
load_insurance_data(age, 'insurance.csv', 'age')
load_insurance_data(sex, 'insurance.csv', 'sex')
load_insurance_data(bmi, 'insurance.csv', 'bmi')
load_insurance_data(children, 'insurance.csv', 'children')
load_insurance_data(smoker, 'insurance.csv', 'smoker')
load_insurance_data(region, 'insurance.csv', 'region')
load_insurance_data(charges, 'insurance.csv', 'charges')


[16884.924,
 1725.5523,
 4449.462,
 21984.47061,
 3866.8552,
 3756.6216,
 8240.5896,
 7281.5056,
 6406.4107,
 28923.13692,
 2721.3208,
 27808.7251,
 1826.843,
 11090.7178,
 39611.7577,
 1837.237,
 10797.3362,
 2395.17155,
 10602.385,
 36837.467,
 13228.84695,
 4149.736,
 1137.011,
 37701.8768,
 6203.90175,
 14001.1338,
 14451.83515,
 12268.63225,
 2775.19215,
 38711.0,
 35585.576,
 2198.18985,
 4687.797,
 13770.0979,
 51194.55914,
 1625.43375,
 15612.19335,
 2302.3,
 39774.2763,
 48173.361,
 3046.062,
 4949.7587,
 6272.4772,
 6313.759,
 6079.6715,
 20630.28351,
 3393.35635,
 3556.9223,
 12629.8967,
 38709.176,
 2211.13075,
 3579.8287,
 23568.272,
 37742.5757,
 8059.6791,
 47496.49445,
 13607.36875,
 34303.1672,
 23244.7902,
 5989.52365,
 8606.2174,
 4504.6624,
 30166.61817,
 4133.64165,
 14711.7438,
 1743.214,
 14235.072,
 6389.37785,
 5920.1041,
 17663.1442,
 16577.7795,
 6799.458,
 11741.726,
 11946.6259,
 7726.854,
 11356.6609,
 3947.4131,
 1532.4697,
 2755.02095,
 6571.02435,
 4441

### Step 4: Analysis with the `PatientsInfo` Class
Now that our data is organized, we're ready to analyze it. We will define a class called `PatientsInfo` with methods to investigate various attributes of the dataset.

In [69]:
class PatientInfo:
    def __init__(self, age, sex, bmi, children, smoker, region, charges):
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region  
        self.charges = charges

    def analyze_age(self):
        age_sum = 0
        for i in self.age:
            age_sum += int(i)
        print("The avearage age is " + str(round(age_sum / len(self.age))) + ".")
    
    def analyze_sexes(self):
        females = 0
        males = 0 
        for sex in self.sex:
            if sex == 'female':
                females += 1
            elif sex == 'male':
                males += 1
        print("Number of females: " + str(females) + "." )
        print("Number of males: " + str(males) + ".")
    
    def analyze_bmi(self):
        bmi_sum = 0
        for i in self.bmi:
            bmi_sum += float(i)
        print("The average BMI is " + str(round(bmi_sum / len(self.bmi), 2)) + ".")
    
    def analyze_children(self):
        children_sum = 0
        for i in self.children:
            children_sum += int(i)
        print("The average number of children is " + str(round(children_sum / len(self.children), 1)) + ".")
    
    def analyze_smoker(self):
        smokers = 0
        non_smokers = 0
        for smoker in self.smoker:
            if smoker == 'yes':
                smokers += 1
            elif smoker == 'no':
                non_smokers += 1
        print("Number of smokers: " + str(smokers) + ".")
        print("Number of non-smokers: " + str(non_smokers) + ".")
    
    def analyze_region(self):
        regions = {}
        for region in self.region:
            if region in regions:
                regions[region] += 1
            else:
                regions[region] = 1
        print("Number of patients from each region: " + str(regions) + ".")

    def analyze_charges(self):  
        charges_sum = 0
        for i in self.charges:
            charges_sum += float(i)
        print("The average charge is " + str(round(charges_sum / len(self.charges), 2)) + ".")

    def create_dictionary(self):
        self.patients_dictionary = {}
        self.patients_dictionary['age'] = self.age
        self.patients_dictionary['sex'] = self.sex 
        self.patients_dictionary['bmi'] = self.bmi
        self.patients_dictionary['children'] = self.children
        self.patients_dictionary['smoker'] = self.smoker
        self.patients_dictionary['region'] = self.region
        self.patients_dictionary['charges'] = self.charges
        return self.patients_dictionary  

patient_info = PatientInfo(age, sex, bmi, children, smoker, region, charges)

patient_info.analyze_age()
patient_info.analyze_sexes()
patient_info.analyze_bmi()
patient_info.analyze_children()
patient_info.analyze_smoker()
patient_info.analyze_region()
patient_info.analyze_charges()

The avearage age is 39.
Number of females: 662.
Number of males: 676.
The average BMI is 30.66.
The average number of children is 1.1.
Number of smokers: 274.
Number of non-smokers: 1064.
Number of patients from each region: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}.
The average charge is 13270.42.


### Step 5: Descriptive Statistics of the Dataset
To gain a comprehensive understanding of the data, we will provide descriptive statistics, including mean, median, mode, and standard deviation for numeric columns, and counts for categorical columns. This step gives us an overview of the dataset's distribution and central tendencies.


In [70]:
class DescriptiveStats:
    def __init__(self, age, bmi, children, charges):
        self.age = age
        self.bmi = bmi
        self.children = children
        self.charges = charges

    def calculate_mean(self, name, values):
        print(f"Mean of {name}: {round(statistics.mean(values), 2)}")
    
    def calculate_median(self, name, values):
        print(f"Median of {name}: {round(statistics.median(values), 2)}")
    
    def calculate_mode(self, name, values):
        try:
            print(f"Mode of {name}: {round(statistics.mode(values))}")
        except statistics.StatisticsError:
            print(f"Mode of {name}: No unique mode found")
    
    def calculate_standard_deviation(self, name, values):
        print(f"Standard deviation of {name}: {round(statistics.stdev(values), 2)}")
    
    def calculate_counts(self, name, values):
        counts = {}
        for value in values:
            if value in counts:
                counts[value] += 1
            else:
                counts[value] = 1
        print(f"Counts of {name}: {counts}")

stats = DescriptiveStats(age, bmi, children, charges)

# Calculate mean
stats.calculate_mean('Age', stats.age)
stats.calculate_mean('Bmi', stats.bmi)
stats.calculate_mean('Children', stats.children)
stats.calculate_mean('Charges', stats.charges)

# Calculate median
stats.calculate_median('Age', stats.age)
stats.calculate_median('Bmi', stats.bmi)
stats.calculate_median('Children', stats.children)
stats.calculate_median('Charges', stats.charges)

# Calculate mode
stats.calculate_mode('Age', stats.age)
stats.calculate_mode('Bmi', stats.bmi)
stats.calculate_mode('Children', stats.children)
stats.calculate_mode('Charges', stats.charges)

# Calculate standard deviation
stats.calculate_standard_deviation('Age', stats.age)
stats.calculate_standard_deviation('Bmi', stats.bmi)
stats.calculate_standard_deviation('Children', stats.children)
stats.calculate_standard_deviation('Charges', stats.charges)

# Calculate counts
stats.calculate_counts('Age', stats.age)
stats.calculate_counts('Bmi', stats.bmi)
stats.calculate_counts('Children', stats.children)
stats.calculate_counts('Charges', stats.charges)

Mean of Age: 39.21
Mean of Bmi: 30.66
Mean of Children: 1.09
Mean of Charges: 13270.42
Median of Age: 39.0
Median of Bmi: 30.4
Median of Children: 1.0
Median of Charges: 9382.03
Mode of Age: 18
Mode of Bmi: 32
Mode of Children: 0
Mode of Charges: 1640
Standard deviation of Age: 14.05
Standard deviation of Bmi: 6.1
Standard deviation of Children: 1.21
Standard deviation of Charges: 12110.01
Counts of Age: {19: 68, 18: 69, 28: 28, 33: 26, 32: 26, 31: 27, 46: 29, 37: 25, 60: 23, 25: 28, 62: 23, 23: 28, 56: 26, 27: 28, 52: 29, 30: 27, 34: 26, 59: 25, 63: 23, 55: 26, 22: 28, 26: 28, 35: 25, 24: 28, 41: 27, 38: 25, 36: 25, 21: 28, 48: 29, 40: 27, 58: 25, 53: 28, 43: 27, 64: 22, 20: 29, 61: 23, 44: 27, 57: 26, 29: 27, 45: 29, 54: 28, 49: 28, 47: 29, 51: 29, 42: 27, 50: 29, 39: 25}
Counts of Bmi: {27.9: 1, 33.77: 2, 33.0: 6, 22.705: 3, 28.88: 8, 25.74: 4, 33.44: 4, 27.74: 6, 29.83: 6, 25.84: 5, 26.22: 4, 26.29: 1, 34.4: 4, 39.82: 3, 42.13: 4, 24.6: 3, 30.78: 5, 23.845: 3, 40.3: 1, 35.3: 4, 36.

### Step 6: Comparative Analysis of Costs
One of the key aspects we're interested in is the difference in medical charges between smokers and non-smokers. This analysis will help us understand the impact of smoking on medical expenses. We will compare the average costs for both groups and use statistical testing to determine if the differences are significant.


In [76]:
class SmokingAnalysis:
    def __init__(self, age, sex, bmi, children, smoker, region, charges):
        # Initialize the attributes from the parameters
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region
        self.charges = charges
        # Create an instance of PatientInfo using the provided attributes
        patient_info = PatientInfo(age, sex, bmi, children, smoker, region, charges) 
        # Call the create_dictionary() method from the PatientInfo instance
        # and store the result in an attribute of SmokingAnalysis
        self.patients_dictionary = patient_info.create_dictionary()
          
    # Reference the self.patients_dictionary and create two new lists of the full dictionary items.   
    def create_smokers_lists(self):  
        smokers = []
        non_smokers = []

        for i in range(len(self.patients_dictionary['smoker'])):
            patient = {key: self.patients_dictionary[key][i] for key in self.patients_dictionary}
            if patient['smoker'] == 'yes':
                smokers.append(patient)
            else:
                non_smokers.append(patient)
        return smokers, non_smokers

    # Calculate the mean, median, and mode of the charges for smokers and non-smokers
    def calculate_smoker_stats(self): 
        smokers, non_smokers = self.create_smokers_lists()  # Define the variables "smokers" and "non_smokers"
        smoker_charges = [float(patient['charges']) for patient in smokers]
        smoker_mean = statistics.mean(smoker_charges)
        smoker_median = statistics.median(smoker_charges)
        smoker_mode = statistics.mode(smoker_charges)

        non_smoker_charges = [float(patient['charges']) for patient in non_smokers]
        non_smoker_mean = statistics.mean(non_smoker_charges)
        non_smoker_median = statistics.median(non_smoker_charges)
        non_smoker_mode = statistics.mode(non_smoker_charges)

        # Print the results
        print("The smoker mean charges are " + str(round(smoker_mean, 2)) + ".")
        print("The smoker median charges are " + str(round(smoker_median, 2)) + ".")
        print("The smoker mode charges are " + str(round(smoker_mode, 2)) + ".")
        print("The non-smoker mean charges are " + str(round(non_smoker_mean, 2)) + ".")
        print("The non-smoker median charges are " + str(round(non_smoker_median, 2)) + ".")
        print("The non-smoker mode charges are " + str(round(non_smoker_mode, 2)) + ".")

        return smoker_mean, non_smoker_mean, smoker_median, non_smoker_median, smoker_mode, non_smoker_mode, smoker_charges, non_smoker_charges
    
    # Return the mean, median, and mode of the charges for smokers and non-smokers
    def calculate_difference(self, smoker_mean, non_smoker_mean, smoker_median, non_smoker_median, smoker_mode, non_smoker_mode):
        mean_diff = smoker_mean - non_smoker_mean
        median_diff = smoker_median - non_smoker_median
        mode_diff = smoker_mode - non_smoker_mode
        print(f"Mean difference: {round(mean_diff, 2)}")
        print(f"Median difference: {round(median_diff, 2)}")
        print(f"Mode difference: {round(mode_diff, 2)}")

    # Calculate the variance of the charges for smokers and non-smokers
    def calculate_variance(self, smoker_charges, non_smoker_charges):
        smoker_variance = statistics.variance(smoker_charges)
        non_smoker_variance = statistics.variance(non_smoker_charges)
        print(f"Smoker variance: {round(smoker_variance, 2)}")
        print(f"Non-smoker variance: {round(non_smoker_variance, 2)}")

    # Calculate the standard deviation of the charges for smokers and non-smokers
    def calculate_standard_deviation(self, smoker_charges, non_smoker_charges):
        smoker_std_dev = math.sqrt(statistics.variance(smoker_charges))
        non_smoker_std_dev = math.sqrt(statistics.variance(non_smoker_charges))
        print(f"Smoker standard deviation: {round(smoker_std_dev, 2)}")
        print(f"Non-smoker standard deviation: {round(non_smoker_std_dev, 2)}")

    # Calculate the t-test for the charges of smokers and non-smokers    
    def calculate_t_test(self, smoker_charges, non_smoker_charges):
        t_test = stats.ttest_ind(smoker_charges, non_smoker_charges)
        print(f"T-test: {t_test}")
    
    # Calculate the p-value for the charges of smokers and non-smokers
    def calculate_p_value(self, smoker_charges, non_smoker_charges):
        t_test = stats.ttest_ind(smoker_charges, non_smoker_charges)
        p_value = t_test[1]
        print(f"P-value: {p_value}")
    
    # Calculate the confidence interval for the charges of smokers and non-smokers
    def calculate_confidence_interval(self, smoker_charges, non_smoker_charges):
        confidence_interval = stats.t.interval(0.95, len(smoker_charges)-1, loc=np.mean(smoker_charges), scale=stats.sem(smoker_charges))
        print(f"Confidence interval for smokers: {confidence_interval}")
        confidence_interval = stats.t.interval(0.95, len(non_smoker_charges)-1, loc=np.mean(non_smoker_charges), scale=stats.sem(non_smoker_charges))
        print(f"Confidence interval for non-smokers: {confidence_interval}")
    
    # Calculate the effect size for the charges of smokers and non-smokers
    def calculate_effect_size(self, smoker_charges, non_smoker_charges):
        effect_size = stats.ttest_ind(smoker_charges, non_smoker_charges)
        print(f"Effect size: {effect_size}")
    
    # Calculate the correlation between the charges of smokers and non-smokers
    def calculate_correlation(self, smoker_charges, non_smoker_charges):
        correlation = stats.pearsonr(smoker_charges, non_smoker_charges)
        print(f"Correlation: {correlation}")
    
    # Calculate the regression between the charges of smokers and non-smokers
    def calculate_regression(self, smoker_charges, non_smoker_charges):
        regression = stats.linregress(smoker_charges, non_smoker_charges)
        print(f"Regression: {regression}")

    # Perform all the calculations
    def perform_all_calculations(self):
        (smoker_mean, non_smoker_mean, smoker_median, non_smoker_median, 
         smoker_mode, non_smoker_mode, smoker_charges, non_smoker_charges) = self.calculate_smoker_stats()

        self.calculate_difference(smoker_mean, non_smoker_mean, smoker_median, non_smoker_median, smoker_mode, non_smoker_mode)
        self.calculate_variance(smoker_charges, non_smoker_charges)
        self.calculate_standard_deviation(smoker_charges, non_smoker_charges)
        self.calculate_t_test(smoker_charges, non_smoker_charges)
        self.calculate_p_value(smoker_charges, non_smoker_charges)
        self.calculate_confidence_interval(smoker_charges, non_smoker_charges)
        self.calculate_effect_size(smoker_charges, non_smoker_charges)
        self.calculate_correlation(smoker_charges, non_smoker_charges)
        self.calculate_regression(smoker_charges, non_smoker_charges)


smoking_analysis = SmokingAnalysis(age, sex, bmi, children, smoker, region, charges)

smoking_analysis.perform_all_calculations()

The smoker mean charges are 32050.23.
The smoker median charges are 34456.35.
The smoker mode charges are 16884.92.
The non-smoker mean charges are 8434.27.
The non-smoker median charges are 7345.41.
The non-smoker mode charges are 1639.56.
Mean difference: 23615.96
Median difference: 27110.94
Mode difference: 15245.36
Smoker variance: 133207311.21
Non-smoker variance: 35925420.5
Smoker standard deviation: 11541.55
Non-smoker standard deviation: 5993.78


AttributeError: 'DescriptiveStats' object has no attribute 'ttest_ind'

### Step 7: Predictive Modeling
To build on our analysis, we will create a function for predictive modeling. This function will use the attributes of the dataset to predict individual medical costs. We may use linear regression or another suitable model for this purpose. The goal is to develop a model that accurately predicts costs based on patient characteristics.


### Step 8: Summary of Findings
Finally, we will summarize the key findings from our analyses. This summary will highlight the most significant insights, including the impact of smoking on costs, demographic patterns, and the performance of our predictive model. This overview will provide clear, actionable insights derived from our data.
