# U.S. Medical Insurance Costs

## Aim:
To develop a predictive model that accurately estimates insurance charges based on personal demographic and lifestyle **attributes** such as **age, sex, BMI, number of children, region, and smoking status**. This project aims to enhance the understanding of how these factors influence insurance premiums and to provide a reliable tool for insurance companies to assess potential costs.

## Goal:
The goal of this project is to analyze various attributes within **insurance.csv** to predict insurance charges accurately, gaining insights into the factors that influence these charges.

In [None]:
# importing csv library
import csv


To begin, it is essential to import all necessary libraries. For this project, the only required library is the **csv** library, which will be used to work with the **insurance.csv** dataset. Although other libraries could benefit from this project, the csv library alone will suffice for our analysis.

The next step involves examining the insurance.csv file to familiarize yourself with the data. We will check the following aspects to plan how to import the data into a Python script:

* The names of columns and rows
* Any obvious missing data
* extraction of data from csv file to list of attributes

In [None]:
# Creating a list of all the columns from a csv file (insurance.csv)
age=[]
sex=[]
BMI=[]
num_children=[]
region=[]
smoking_status=[]
charges=[]

**insurance.csv** includes the following columns:
* Patient Age
* Patient Sex
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S. Geographical Region
* Patient Yearly Medical Insurance Cost
  
There is no evidence of missing data. To store this information, seven empty lists will be created to hold each column of data from insurance.csv.

In [None]:
# helper function to load csv data
def extract_data(data_list, filename, column_name):
    # open csv file 
    with open(filename) as csv_file:
        # read the data from the csv file
        insurance_data = csv.DictReader(csv_file)
        # loop through the data in each row of the csv 
        for row in insurance_data:
            # add the data from each row to a list
            data_list.append(row[column_name])
        # return the list
        return data_list

In [None]:
# displaying data returned by extract_data function
extract_data(age,'insurance.csv','age')
extract_data(sex,'insurance.csv','sex')
extract_data(BMI,'insurance.csv','bmi')
extract_data(num_children,'insurance.csv','children')
extract_data(region,'insurance.csv','region')
extract_data(smoking_status,'insurance.csv','smoker')
extract_data(charges,'insurance.csv','charges')

1. **Firstly we Calculate the Average Insurance Charge:**
To determine the average cost of insurance charges across the entire dataset, I will compute the mean value of the 'charges' column. This analysis aims to provide a comprehensive understanding of typical insurance costs(in USD)$ based on the available data.




In [None]:
# function to calculate average charge cost per year:
def average_charges(charge=[]):
    total_cost=0
    
    for cost in charge:
        total_cost=total_cost+float(cost)
    avearge_charge_cost = total_cost/len(charge)
    return avearge_charge_cost


print("The Average Charge Cost of whole data set including Smokers and Non-Smokers: "+ str(average_charges(charges))) 


    

2. **Is the Average cost of a Smoker is higher than Non_smoker:**
The goal is to determine if there is a difference in average insurance charges between smokers and non-smokers.
By analyzing the 'smoker' column against the 'charge' column in the dataset, I will investigate whether smokers tend to have higher average insurance costs than non-smokers.

In [None]:
def diff_average_cost(smoker=[],charge=[]):
    list_smoking=list(zip(smoker,charge))
    
    total_cost_smoker=0
    total_cost_non_smoker=0
    smoker_list=[cost[1] for cost in list_smoking if cost[0]== 'yes']  # using list comprehension creating list of smoker charges 
    non_smoker_list=[cost[1] for cost in list_smoking if cost[0]== 'no']

    print("The Average Insurance Cost of Smoking Person is :" +str(round(average_charges(smoker_list),2))+"$" )
    
    print("The Average Insurance Cost of Non-Smoking Person is :" +str(round(average_charges(non_smoker_list),2))+"$" )

    difference = round((average_charges(smoker_list))- average_charges(non_smoker_list),2)
    
    print("The Difference between Smoker and Non_Smoker insurance cost is : {difference}$".format(difference=difference))
    
    if (average_charges(smoker_list) > average_charges(non_smoker_list)):
        print("Smoking is injurious to Health!!!! and Wealth!!!!")

diff_average_cost(smoking_status,charges)

3. **Gender-Based Analysis:**
In this analysis, we aim to investigate potential differences or patterns based on gender within a given dataset. By examining variables and metrics across different genders, we seek to uncover insights that may highlight disparities, preferences, or trends specific to male and female groups.

In [None]:
# function to analyze gender count 
def gender_analysis(gender=[]):
    # female=gender.count('female')  # we can also use count() to calculate the value of females
    # print(female)
    female=0
    male=0
    for sex in gender:
        if sex == 'female':
            female+=1
        else:
            male+=1
    print("The Count of Female: {female}".format(female=female))
    print("The Count of Male: {male}".format(male=male))
    
# testing function
gender_analysis(sex)

4. **Age-Based Analysis:**
In this analysis, the objective is to explore and understand patterns, trends, and relationships based on age within a dataset. By focusing on the 'age' column, we aim to calculate average age.

In [None]:
# function to analyze average age
def average_age(ages=[]):
    total_age=0
    for age in ages:
        total_age=total_age+int(age)
    avg_age=total_age/len(ages)
    return avg_age

print("The Average age of People: "+str(average_age(age)))
    

In [None]:
print(region.count('southwest'))

5. **Smoking and Gender-Based Analysis:**
To determine if there is a significant difference in smoking rates between males and females.



In [None]:
# function to see the ratio of male and female smokers
def smoker_Gender_rate(gender=[],smoking_status=[],charges=[]):
    smoker_gender_data = list(zip(gender,smoking_status,charges))

    # Creating a list of females who smoke and storing their charges to female_smoker
    female_smoker=[i[2] for i in smoker_gender_data if i[0]== 'female' and i[1] == 'yes']
    avg_female_smoker_charges=average_charges(female_smoker)
    print("There are total "+str(len(female_smoker))+" Female Smokers and their average insurance charges are "+ str(avg_female_smoker_charges))

    # Creating a list of males who smoke and storing their charges to male_smoker
    male_smoker=[i[2] for i in smoker_gender_data if i[0]== 'male' and i[1] == 'yes']
    
    avg_male_smoker_charges=average_charges(male_smoker)
    print("There are total "+str(len(male_smoker))+" Male Smokers and their average insurance charges are "+ str(avg_male_smoker_charges))


smoker_Gender_rate(sex,smoking_status,charges)

6. **Region-Based Analysis:**
In this analysis, we aim to investigate the patterns and differences across various geographical regions within a dataset. By focusing on the 'region' column, we seek to uncover insights that may reveal how regional differences impact various variables and outcomes.

In [None]:
# function to see region count 
def region_analyse(region=[],charge=[]):
    region_data= list(zip(region,charge))
    # print(region_data)
    south_east_charge=[i[1] for i in region_data if i[0]== 'southeast']
    # print(south_east_charge)
    print("The Average cost of Insurance From SouthEast Region: {average}".format(average=average_charges(south_east_charge)))

    south_west_charge=[i[1] for i in region_data if i[0]== 'southwest']
    # print(south_west_charge)
    print("The Average cost of Insurance From SouthWest Region: {average}".format(average=average_charges(south_west_charge)))

    north_east_charge=[i[1] for i in region_data if i[0]== 'northeast']
    # print(north_east_charge)
    print("The Average cost of Insurance From NorthEast Region: {average}".format(average=average_charges(north_east_charge)))
    
    north_west_charge=[i[1] for i in region_data if i[0]== 'northwest']
    # print(north_west_charge)
    print("The Average cost of Insurance From NorthWest Region: {average}".format(average=average_charges(north_west_charge)))
    

region_analyse(region,charges)
    

This analysis provides valuable insights into the factors influencing insurance charges. By understanding how demographic and lifestyle factors impact insurance costs, more effective and equitable health and insurance policies can be developed.