# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [1]:
# import csv library
import csv
import statistics

## Scope
  For this project his major achivement is to interact whit "real data" in the manner that the user can assimilate the concepts that Python offers to manage the csv files. 
    The file `Insurace.csv` conteins seven columns wich information have to be analyze in base of his "charge" column because it repesent the variation between every subject. In this way we analyzed the next criteria:

##### 1.-Analyze ages
   * Find Max and Min
       * Relation by charges
       * Groups between ages
       * Standart Derivation 
       
##### 2.-Analyze genres   
   * Quantity of each one
   * Relation by charges
   
##### 3.-Percentage of Smoke individuals 
   * Relation by charges
   * Smokers and NonSmokers Percentage 
   
##### 4.-Average number of people how have children
   * Average of childrens in general 
     * Gather into specific groups
   * Relation by charges
   
##### 5.-BMI
   * 

##### 6.-The most repetitive region
   * Quantity of each one.
   * Unique regions
   
##### 7.-Charge average.
   * The max and min quantities
   * Average
   * Variance 
   * Standart Derivation  
   
   
### Importan questions to answer:
   * How do certain factors affect the charge amount?
   * What is the impact of certain factors on charge amount for smokers compared to non-smokers?

The next step is to look through **insurance.csv** in order to get acquainted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [2]:
limit = 3
count = 0 
with open('insurance.csv') as csv_info:
    csv_dict = csv.DictReader(csv_info)
    print(csv_dict)
    for row in csv_dict:
        count +=1
        print(row)
        if count >= limit:
            break

<csv.DictReader object at 0x0000024BC9F9F790>
{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}
{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}


**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.


This is a preview of data type that we are going to dive into. It is useful because we find out what we have in our dataset, their arranchment  and what analysis we need to do.

The next step is create an empty list for every keyword where we will housing all the values.

In [3]:
#Create empty lists for the various attributes in insurance.csv
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_charges = []

In [4]:
# helper function to load csv data
def load_list_data(lst, csv_file, column_name):
    with open(csv_file) as csv_info:
        csv_dict = csv.DictReader(csv_info)
        for row in csv_dict:
            # add the data from each row to a list
            lst.append(row[column_name])
        return lst

The helper function above was created to make loading data into the lists as efficient as possible. Without this function, one would have to open **insurance.csv** and rewrite the `for` loop seven times; however, with this function, one can simply call `load_list_data()` each time as shown below.

In [5]:
# look at the data in insurance_csv_dict
load_list_data(ages, 'insurance.csv', 'age')
load_list_data(sexes, 'insurance.csv', 'sex')
(load_list_data(bmis, 'insurance.csv', 'bmi'))
load_list_data(num_children, 'insurance.csv', 'children')
load_list_data(smoker_statuses, 'insurance.csv', 'smoker')
load_list_data(regions, 'insurance.csv', 'region')
load_list_data(insurance_charges, 'insurance.csv', 'charges')
pass

In [6]:
#We need convert this both list for analysis will make more easier ages = [int(age) for age in ages ]
insurance_charges = [float(charge) for charge in insurance_charges ]
num_children = [int(child) for child in num_children] 
ages = [int(age) for age in ages] 
bmis = [float(bmi) for bmi in bmis] 

Now that all the data from **insurance.csv** is neatly organized into labeled lists, the analysis can be started. 

In [7]:
class PatientsInfo:
    # init method that takes in each list parameter
    def __init__(self, patients_ages, patients_sexes, patients_bmis, patients_num_children,patients_smoker_statuses, patients_regions, patients_charges):
        self.patients_ages = patients_ages
        self.patients_sexes = patients_sexes
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_statuses = patients_smoker_statuses
        self.patients_regions = patients_regions
        self.patients_charges = patients_charges  
        
    # 1. method that calculates thd ages of the patients by relation with charges in insurance.csv
    def analyze_ages(self):
        major_charge = 0
        min_charge = 0
        min_age = 0
        max_age = 0
        count = 0
        dict = {}
        for age, charges in zip(self.patients_ages, self.patients_charges):
            if age > max_age:
                max_age = age
                major_charge = charges
        for age, charges in zip(self.patients_ages, self.patients_charges):
            if age < min_age:
                min_age = age
                min_charge = charges
        range_age = round((max_age - min_age)/3, 2)
        frist_group = min_age + range_age
        ft_age = 0
        ft_charges = 0 
        count_1 = 0
        second_group = frist_group + range_age
        sc_age  = 0
        sc_charges = 0 
        count_2 = 0
        third_group = second_group + range_age
        td_age  = 0
        td_charges = 0
        count_3 = 0
        for age, charges in zip(self.patients_ages, self.patients_charges):
            if age <= frist_group:
                ft_age += age
                ft_charges += charges
                count_1 += 1
            elif age > frist_group and age <= second_group:
                sc_age += age
                sc_charges += charges
                count_2 += 1
            else:
                td_age += age
                td_charges += charges
                count_3 += 1
            count += 1   

        print("Thecount maximun age fund is {max_age} years, his charges: ${major_charge}. The minimum age fund is: {min_age} years, his charges: ${min_charge}".format(max_age=max_age, min_age=min_age, major_charge=round(major_charge,2), min_charge=round(min_charge,2)))
        print( "The average age and charge by clump is:")
        print("     First group:\n {count_1} individuals = age: {ft_age}, charges: ${ft_charges}".format(count_1=count_1, ft_age=round(ft_age/count_1,0), ft_charges=round(ft_charges/count_1,3)))      
        print("     Second group: \n {count_2} individuals = age: {sc_age}, charges: ${sc_charges}".format(count_2=count_2, sc_age=round(sc_age/count_2,0), sc_charges=round(sc_charges/count_2,3)))      
        print("     Third group:\n  {count_3} individuals = age: {td_age}, charges: ${td_charges}".format(count_3=count_3, td_age=round(td_age/count_3,0), td_charges=round(td_charges/count_3,3)))      
        print("the total sample ", count)
    #def standart_derivation(self):
        
    #2.-  method that calculates the number of males and females in insurance.csv
    def analyze_sexes(self):
        females = 0
        males = 0
        charges_females = 0
        charges_males = 0
        count = 0 
        # iterate through each sex in the sexes and charges list
        for sex, charge in zip(self.patients_sexes, self.patients_charges):
            if sex == 'female':
                females += 1
                charges_females += charge
            elif sex == 'male':
                males += 1
                charges_males += charge
        # print out the number of each
        print("Count for female: {females}, charge average: ${charges_females}".format(females=females, charges_females=round(charges_females/females,3)))
        print("Count for male: {males}, charge average: ${charges_males}".format(males=males, charges_males=round(charges_males/males,3)))
        print("the total sample ", males+females)
        
    #3
    def bmis(self):
        maximum = max(bmis)
        minimum = min(bmis)
        dictt = {}
        collect = {}
        under_bmi = []
        under_charge = []
        healthy_bmi = []
        healthy_charge = []
        over_bmi = []
        over_charge = []
        obese_bmi = []
        obese_charge = []
        for bmi, charges in zip(self.patients_bmis, self.patients_charges):
                if bmi == maximum:
                    dictt[bmi] = charges
                elif bmi == minimum:
                    dictt[bmi] = charges
                    
        for bmi, charges in zip(self.patients_bmis, self.patients_charges):
            if bmi < 18.50:
                under_bmi.append(bmi)
                under_charge.append(round(charges,3))
            elif bmi > 18.50 and bmi < 24.90:
                healthy_bmi.append(bmi)
                healthy_charge.append(round(charges,3))
            elif bmi > 25 and bmi < 29.90:
                over_bmi.append(bmi)
                over_charge.append(round(charges,3))
            else: 
                obese_bmi.append(bmi)
                obese_charge.append(round(charges,3))    
                
        print("Patients underweight: {lenght} \n   Average BMI: {under_bmi} \n   Average Charges: {under_charge} \n".format(lenght=len(under_bmi), under_bmi=round(statistics.mean(under_bmi),2), under_charge=round(statistics.mean(under_charge),2)))
        print("Patients healthyweight: {lenght1} \n   Average BMI: {healthy_bmi} \n   Average Charges: {healthy_charge} \n".format(lenght1=len(healthy_bmi), healthy_bmi=round(statistics.mean(healthy_bmi),2), healthy_charge=round(statistics.mean(healthy_charge),2)))
        print("Patients Overweight: {lenght1} \n   Average BMI: {over_bmi} \n   Average Charges: {over_charge} \n".format(lenght1=len(over_bmi), over_bmi=round(statistics.mean(over_bmi),2), over_charge=round(statistics.mean(over_charge),2)))
        print("Patients Obeseweight: {lenght1} \n   Average BMI: {obese_bmi} \n   Average Charges: {obese_charge} \n".format(lenght1=len(obese_bmi), obese_bmi=round(statistics.mean(obese_bmi),2), obese_charge=round(statistics.mean(obese_charge),2)))
    
    def smokers(self):
        smoke = 0
        nonsmoke = 0
        charge_smoke = 0
        charge_nonsmoke = 0
        for smk, charge in zip(self.patients_smoker_statuses,self.patients_charges):
            if smk == "yes":
                smoke += 1
                charge_smoke += charge 
            else:
                nonsmoke += 1
                charge_nonsmoke += charge 
        print("Smokers Patients Percentage:")          
        print("   Smokers: %{smoke}".format(smoke=round((smoke * 100)/(smoke + nonsmoke),2)))
        print("   Non-Smokers: %{nonsmoke} \n".format(nonsmoke=round((nonsmoke * 100)/(smoke + nonsmoke),2)))
        print("The average annual charge for smokers and non smokers:")
        print("   Smokers: ${charge_smoke}".format(charge_smoke=round((charge_smoke/smoke),2)))
        print("   Non-Smokers: ${charge_nonsmoke} \n".format(charge_nonsmoke=round((charge_nonsmoke/nonsmoke),2)))
        print("Diferiencies on charges: $",round((charge_smoke/smoke)-(charge_nonsmoke/nonsmoke),3))
    
    # method to find each unique region patients are from
    def childrens(self):
        count_no_child = 0 
        charge_no_child = 0
        count_1 = 0
        count_1to2_child = 0
        charge_1to2 = 0
        count_2 = 0
        count_more2_child = 0
        charge_more2 = 0
        count_no = 0
        for child, charges in zip(self.patients_num_children, self.patients_charges):
            if child >= 1 and child <= 2:
                count_1to2_child += child
                charge_1to2 += charges 
                count_1 += 1
            elif child >= 3:
                count_more2_child += child
                charge_more2 += charges
                count_2 += 1
            else: 
                charge_no_child += charges
                count_no += 1
            total_count = count_no + count_2 + count_1
            t_charges = charge_no_child + charge_more2 +charge_1to2
        
        print("The clasification correspond to the next categories: \n")
        print("With no childrens \n Patients in this clump: {count_no} \n Average Charges: ${charge_no_child}\n".format(count_no=count_no, charge_no_child=(round((charge_no_child/count_no),2))))
        print("With one to tow childrens \n Patients in this clump: {count_1} \n Average Charges: ${charge_1to2}\n".format(count_1=count_1, charge_1to2=(round((charge_1to2/count_1),2))))
        print("With more than tow childrens \n Patients in this clump: {count_2} \n Average Charges: ${charge_more2}\n".format(count_2=count_2, charge_more2=(round((charge_more2/count_2),2))))
    
    def regions(self):
        unique = {}
        for region, charges in zip(self.patients_regions, self.patients_charges):
                if region  in unique:
                    unique[region] += [charges]
                else: 
                    unique[region] = [charges]
        average = {}
        for region, list_charges in unique.items():
            #print(sum(list_charges))
            #print(len(unique.get(region)))
            average[region] = round((sum(list_charges)) / (len(unique.get(region))),3)
        sortdict =  dict(sorted(average.items(), key=lambda item:item[1], reverse=True))
        return sortdict
        
         
    def charges(self):
        # initialize total_charges variable
        total_charges = 0
        mean = statistics.mean(insurance_charges)
        variance = statistics.pvariance(insurance_charges)
        st_dev = statistics.pstdev(insurance_charges)
        print("Annual charges. \n  Mean: {mean} \n  Variance:{variance}".format(mean=round(mean,2), variance=round(variance)))
        print("  Standart Derivation: {st_dev}".format(st_dev= round(st_dev,2)))
        #return 
        
    # method to create dictionary with all patients information
    def create_dictionary(self):
        self.patients_dictionary = {}
        self.patients_dictionary["age"] = [int(age) for age in self.patients_ages]
        self.patients_dictionary["sex"] = self.patients_sexes
        self.patients_dictionary["bmi"] = self.patients_bmis
        self.patients_dictionary["children"] = self.patients_num_children
        self.patients_dictionary["smoker"] = self.patients_smoker_statuses
        self.patients_dictionary["regions"] = self.patients_regions
        self.patients_dictionary["charges"] = self.patients_charges
        return self.patients_dictionary

The next step is to create an instance of the class called `patient_info`. With this instance, each method can be used to see the results of the analysis.

In [8]:
patient_info = PatientsInfo(ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_charges)

#### Analyzing Patients Ages

In [17]:
patient_info.analyze_ages()

Thecount maximun age fund is 64 years, his charges: $30166.62. The minimum age fund is: 0 years, his charges: $0
The average age and charge by clump is:
     First group:
 194 individuals = age: 19.0, charges: $8138.614
     Second group: 
 561 individuals = age: 32.0, charges: $11048.527
     Third group:
  583 individuals = age: 53.0, charges: $17116.141
the total sample  1338


As shown in instance class above it's clear to see a pattern in the result, we can see how the anual charges increase with age, charges are directly proportional to age. 
We found that major age is 64 and lower is 18, we made three clumps between this parameters and retrieved the ages and annual charges average to visualize the results. 

In [10]:
print("The Mean by age is:", round(statistics.mean(ages), 0))
print("his Standart Derivation is:",round(statistics.pstdev(ages), 0))

The Mean by age is: 39.0
his Standart Derivation is: 14.0


Here above we show two important statistics mathematical expression, `Standard Deviation and Mean` by age it help us to understand how spread out our population is.


#### Analyzing Patients Genres 

In [11]:
patient_info.analyze_sexes()

Count for female: 662, charge average: $12569.579
Count for male: 676, charge average: $13956.751
the total sample  1338


It's important to notice that the distribution is in balance in both groups and we can even see it is apply to the annual charges.

#### Analyzing Ranges Patients BMI 

For this class instance we divide the population in four categories and analyzed the  BMI and annual Charges average discovering the next results:

In [12]:
patient_info.bmis()

Patients underweight: 20 
   Average BMI: 17.57 
   Average Charges: 8852.2 

Patients healthyweight: 221 
   Average BMI: 22.6 
   Average Charges: 10404.9 

Patients Overweight: 372 
   Average BMI: 27.54 
   Average Charges: 11020.18 

Patients Obeseweight: 725 
   Average BMI: 35.08 
   Average Charges: 15420.4 



We can notice the major population it's found in the four category ("Obese")  and that contain too the greader Annual Charges

#### Analyzing Patientes 'Smoke' status 

   The average yearly medical insurance charge per smoker individual is `$32050.23` dollars while non smoker individual is  `$8434.27` dollars that represent a considerable difference amount of `$23615.964` dollars. The population of non smoker patients is bigger %79.52 meanwhile non smokes patients is just %20.43.
 Therefore smoking is a factor that drastically affect the annual charges for an individual. 

In [13]:
patient_info.smokers()

Smokers Patients Percentage:
   Smokers: %20.48
   Non-Smokers: %79.52 

The average annual charge for smokers and non smokers:
   Smokers: $32050.23
   Non-Smokers: $8434.27 

Diferiencies on charges: $ 23615.964


#### Analyzing Patientes -Childrens 

In this category analyzing patients childrens we can see that the difference in annual charges is not big and it can be taken as a factor that don't affect in great manner to this charges. 

In [14]:
patient_info.childrens()

The clasification correspond to the next categories: 

With no childrens 
 Patients in this clump: 574 
 Average Charges: $12365.98

With one to tow childrens 
 Patients in this clump: 564 
 Average Charges: $13727.93

With more than tow childrens 
 Patients in this clump: 200 
 Average Charges: $14576.0



#### Analyzing Regions   

In [15]:
patient_info.regions()

{'southeast': 14735.411,
 'northeast': 13406.385,
 'northwest': 12417.575,
 'southwest': 12346.937}

For analyze this question we decided to make a directory in which every region act as a keyword and  yielded his average.
We discoverted that majors charges are in SouthEast with roughly `$14735` and the minimum in SouthWest with `$12346`

#### Analyzing Patients  Annual Charges  

In [16]:
 patient_info.charges()

Annual charges. 
  Mean: 13270.42 
  Variance:146542766
  Standart Derivation: 12105.48


This statistics mathematical expresions helps us to visualize how widespread that populaton is. 

### Conclusion 
After make this analysis we are able to respond to the questions pose at the beginning in article. e.g. which factors affect directly to annual charges. To visualize which factor have a greater weight on annual charges we show the next bullets (Considering like one the factor with more weight):
 * 1.- Smoke
 * 2.- Age
 * 3.- BMI
 * 4.- Region
 * 5.- Genre 
 
To select the order above we choose the factors how have major contrast with his charges. e.g. we found "Smokers Status" as a factor with more difference, patients how smokes have a major annual charges above even `$2300.00`  compared with patients how don't. However Genre is a factor with less weight because it's difference is barely `$200` and it can't be taken as a real factor which one can compare.