# U.S. Medical Insurance Costs

### Look over the dataset

There are 7 columns in the dataset: 
1. age whose datatype is intergers. This is an ordinal variabel
2. sex are strings. There seems to be two categories "male" and "female"
3. bmi are floats. It is an ordinal variabel
4. children are integers. It is an ordinal variabel
5. smoker is a string. It is categorical with two values: "yes", "no"
6. region is a string. It is also a categorical variabel. It seems to take four values "northeast", "northwest", "southeast", "southwest"
7. charges is a float. This is an ordinal variabel

There does not seem to be a lot of missing values. 

Since there is no information linking the personal features to illnesses, we can try to figure out how the personal characteristics contribute to the insurance costs

### Import Data

In [7]:
import csv

with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv, delimiter=',')
    insurance_records = []
    for row in insurance_reader:
        insurance_records.append(row)

print(insurance_records[0].keys())

dict_keys(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'])


### Saving the column as variables:
Each row is a dictionary, so we need to get the values from the key: value pairs. 
1. Age should be saved as it is
2. Sex should be saved as 0, and 1. This will make it easier later on. We save it as 'male' = 0, and 'female' = 1.
3. bmi should be saved as is
4. children should be saved as is
5. smoker should be saved as an indicator function, where 'no' = 0, and 'yes' = 1
6. region should stay as it is, the analysis, may make sense to do conditioned on the outcome of this variable, but cannot easily be interpreted as ordinal.
7. charges should be saved as is

In [12]:
ages = []
sexs = []
bmis = []
childrens = []
smokers = []
regions = []
charges = []

for row in insurance_records:
    ages.append(row['age'])
    if row['sex'] == 'male':
        sexs.append(0)
    elif row['sex'] == 'female': 
        sexs.append(1)
    else:
        sexs.append('Check data')
    bmis.append(row['bmi'])
    childrens.append(row['children'])
    if row['smoker'] == 'no':
        smokers.append(0)
    elif row['smoker'] == 'yes':
        smokers.append(1)
    else:
        smokers.append('Check data')
    regions.append(row['region'])
    charges.append(row['charges'])

print('Missing in ages: ', None in ages)
print('Missing in sexs: ', 'Check data' in sexs)
print('Missing in bmis: ', None in bmis)
print('Missing in childrens: ', None in childrens)
print('Missing in smokers: ', 'Check data' in smokers)
print('Missing in regions: ', None in regions)
print('Missing in charges: ', None in charges)

Missing in ages:  False
Missing in sexs:  False
Missing in bmis:  False
Missing in childrens:  False
Missing in smokers:  False
Missing in regions:  False
Missing in charges:  False


#### There does not seem to be any missing data

## Exploratory analysis to figure about the mean and median of the ordinal variables, as well as the fractions of the 

### Make functions for the exploratory analysis

In [46]:
def mean_function(alist):
    total = 0
    for item in alist:
        total += float(item)
    average = total/len(alist)
    return average

def median_function(alist):
    sorted_list = sorted(alist)
    median = sorted_list[int(round(len(alist)/2,0)-1)]
    return median

def variance_function(alist, mean_list):
    diff = 0
    for item in alist:
        diff += (float(item) - mean_list)**2
    variance = diff/len(alist)
    return variance

def fraction_function(alist, category):
    count = 0
    for item in alist:
        if item == category:
            count += 1
    fraction = count/len(alist)
    return fraction

def print_ordinal(name, mean, variance, median):
    print(name + ' has a mean of ' + str(round(mean,2)) + ' and a variance of ' + str(round(variance, 2)) 
              + '. The median is ' + str(median) + '.')
    if mean < float(median):
        print('The distribution seems to be right/negatively skewed')
    elif mean > float(median):
        print('The distribution seems to be left/positively skewed')

def ordinal_variables(name, alist):
    mean = mean_function(alist)
    variance = variance_function(alist, mean)
    median = median_function(alist)
    print_ordinal(name, mean, variance, median)
    

### Use the functions to do exploratory analysis

In [48]:
ordinal_variables('Age', ages)
ordinal_variables('BMI', bmis)
ordinal_variables('Children', childrens)
ordinal_variables('Charges', charges)

Age has a mean of 39.21 and a variance of 197.25. The median is 39.
The distribution seems to be left/positively skewed
BMI has a mean of 30.66 and a variance of 37.16. The median is 30.4.
The distribution seems to be left/positively skewed
Children has a mean of 1.09 and a variance of 1.45. The median is 1.
The distribution seems to be left/positively skewed
Charges has a mean of 13270.42 and a variance of 146542766.49. The median is 3161.454.
The distribution seems to be left/positively skewed


In [52]:
share_males = fraction_function(sexs, 0)
print('The fraction of males in the insurance records is ' + str(share_males) + '.')

share_smokers = fraction_function(smokers, 1)
print('The fraction of smokers in the insurance records is ' + str(share_smokers) + '.')

share_northwest = fraction_function(regions, 'northwest')
share_northeast = fraction_function(regions, 'northeast')
share_southwest = fraction_function(regions, 'southwest')
share_southeast = fraction_function(regions, 'southeast')
print('The fraction of people coming from the Northwest is ' + str(share_northwest) + ', from the Northwest is ' + str(share_northeast) 
      + ', from the Southwest is ' + str(share_southwest) + ', and from the Southeast is ' + str(share_southeast) + '.')

The fraction of males in the insurance records is 0.5052316890881914.
The fraction of smokers in the insurance records is 0.20478325859491778.
The fraction of people coming from the Northwest is 0.2428998505231689, from the Northwest is 0.242152466367713, from the Southwest is 0.2428998505231689, and from the Southeast is 0.27204783258594917.


### From exploratory analysis
The insurance seems to be balanced between men and women, as well as across regions. The charges is left skewed. Next we look into to which extend this is due to the 20.5% of the smokers.

In [None]:
def split_function(alist, smokers):
    for index in range(len(smokers)):