# U.S. Medical Insurance Costs

## Initial thoughts about the data
* We have complete data (no missing data points)
* The columns age, bmi, children and charges are numerical - bmi can be mapped to categorical variable (obese, normal weight, etc.)
* region, sex and smoker are strings with the latter two being binary
* It would be interesting to see how charges vary between male - female and smoker - non-smoker
* The dependence of the charges on number of children could also be interesting

## Scope of the Project

### Goals
- Explore data separated along the binary variables sex and smoker: average charge, variance, dependence on numerical variables...
- Explore dependence of charge on region
- Investigate which, if any, combination of values of categorical variables impact insurance charge the most
- Calculate covariance/correlation for numerical variables

## Importing Data

In [141]:
import csv
import math

data_list = []

with open('insurance.csv') as data:
    reader = csv.DictReader(data)
    for row in reader:
        data_dict = {}
        for fieldname in reader.fieldnames:
            data_dict[fieldname] = row[fieldname]
        data_list.append(data_dict)

## Analysis Functions

### Data Handling Functions

In [142]:
#hardcode the data categories and numericals
data_categories = ['sex', 'smoker', 'region']
data_numericals = ['age', 'bmi', 'children', 'charges']

#function for picking columns in data - input is a number of strings representing picks
def pick_data(data, *picks):
    valid_picks = list(filter(lambda pick: pick in data[0].keys(), picks))
    
    return [{pick:datum[pick] for pick in valid_picks} for datum in data]

#function to convert data to dict with keys being the headers and values being lists of data
def pivot(data):
    return {key:[datum[key] for datum in data] for key in data[0].keys()}

#function for extracting the different category values in a list for a given category in data
def category_values(data, category):
    if category in data_categories:
        return list(set([datum[category] for datum in data]))
    else:
        raise ValueError('invalid category')

#filters data by a category
def category_filter(data, category):
    if category in data_categories:
        return {category_value: list(filter(lambda datum: datum[category] == category_value, data)) for category_value in category_values(data, category)}
    else:
        raise ValueError('invalid category')

#normalizes data
def normalize(data):
    if type(data) == list:
        upper = float(max(data))
        lower = float(min(data))
        return [(float(datum) - lower)/(upper - lower) for datum in data]
    else:
        raise ValueError('data must be of type list')

#normalizes rows of data
def normalize_rows(data):
    if type(data) == dict:
        return {key: normalize(data[key]) for key in data.keys()}
    else:
        raise ValueError('data must be of type dict')

#converts 'string' numbers to floats
def convert_to_float(data):
    for datum in data:
        for key in datum.keys():
            datum[key] = float(datum[key])

def prepare_numerical(data):
    picked_data = pick_data(data, *data_numericals)
    convert_to_float(picked_data)
    return pivot(picked_data)

### Numerical Functions

In [143]:
#finds median in data set
def median(data):
    if(type(data) == list and (type(data[0]) == float or type(data[0]) == int)):
        data_copy = data[:]
        data_copy.sort()
        if(len(data_copy) % 2 == 0):
            index = int((len(data_copy) - 1) / 2)
            return mean([data_copy[index], data_copy[index+1]])
        else:
            index = int((len(data_copy) - 1) / 2)
            return data_copy[index]
    else:
        raise ValueError('data must be a list of numbers')

#calculates the mean value of numerical data with a given moment
def mean(data, moment=1):
    if(type(data) == list):
        sum = 0
        for datum in data:
            sum += datum**moment
        return sum/len(data)
    else:
        raise ValueError('data must be a list data type')

#calculates covariance of two variables
def cov(dataX, dataY):
    if(type(dataX) == list and type(dataY) == list):
        dataXYzip = zip(dataX, dataY)
        dataXY = [datumX*datumY for datumX, datumY in dataXYzip]
        return mean(dataXY) - mean(dataX)*mean(dataY)
    else:
        raise ValueError('data must be a list data type')

#calculates the variance of a variable
def var(data):
    return cov(data, data)

#calculates the standard deviation of a variable
def std(data):
    return var(data)**(1/2)

#calculates the correlation of two variables
def corr(dataX, dataY):
    return cov(dataX, dataY)/(std(dataX)*std(dataY))

### Aggregate Functions

In [144]:
#calculates two-component quantities such as correlation between several variables
def matrix(data, func):
    if type(data) == dict:
        return {key:{innerKey:func(data[key], data[innerKey]) for innerKey in data.keys()} for key in data.keys()}
    else:
        raise ValueError('data must be a dict')

#calculates one-component quantities such as func = mean, median, variance...
def vector(data, *funcs):
    if type(data) == dict:
        return {key: {func.__name__:func(data[key]) for func in funcs} for key in data.keys()}
    else:
        raise ValueError('data must be a dict')

### Presentational Functions

In [145]:
#finds length of longest string a matrix dict
def longest_string_length(matrix, decimal_length):
    rowKeys = list(matrix.keys())
    columnKeys = list(matrix[rowKeys[0]].keys())
    longest_len = 0
    for element in rowKeys + columnKeys:
        length = len(element)
        if longest_len < length:
            longest_len = length
            
    for rowKey in rowKeys:
        for columnKey in columnKeys:
            length = len(decimal_length%(matrix[rowKey][columnKey]))
            if longest_len < length:
                longest_len = length
    
    return longest_len

#fits element within i a matrix row and column given a pad_ref
def fit_element(element, pad_ref):
    element_len = len(element) if len(element) % 2 == 0 else len(element) + 1
    pad = int((pad_ref - element_len)/2) *' '
    element_string = pad + element + pad + '|' if len(element) % 2 == 0 else pad + element + pad + ' |'
    return element_string

#prints matrix dict
def print_matrix(matrix):
    rowKeys = list(matrix.keys())
    columnKeys = list(matrix[rowKeys[0]].keys())
    decimal_length = '%0.4f'
    longest_string = longest_string_length(matrix, decimal_length)
    pad_ref = longest_string + 2 if longest_string % 2 == 0 else longest_string + 3
    columnKeys_string = pad_ref*' ' + '|'
    
    for columnKey in columnKeys:
        columnKeys_string += fit_element(columnKey, pad_ref)
    print(columnKeys_string)
    
    for rowKey in rowKeys:
        row = (pad_ref - len(rowKey) - 1)*' ' + rowKey + ' |'
        for columnKey in columnKeys:
            row += fit_element(decimal_length%(matrix[rowKey][columnKey]), pad_ref)
        print(row)

### Executive Functions

In [146]:
#performs correlation analysis on data for category
def do_corr_analysis(data, category=None):
    if not category:
        numerical = prepare_numerical(data)
        print_matrix(matrix(numerical, corr))
    else:
        for category_value in category_values(data, category):
            numerical = prepare_numerical(category_filter(data_list, category)[category_value])
            
            print(category + ': ' + category_value)
            print_matrix(matrix(numerical, corr))
            print('')

#performs full correlation analysis on data
def do_full_corr_analysis(data):
    print('All categories')
    do_corr_analysis(data)
    print('')
    for category in data_categories:
        do_corr_analysis(data, category)
        
#performs numbers analysis on data for category
def do_numbs_analysis(data, category=None):
    if not category:
        numerical = prepare_numerical(data)
        print_matrix(vector(numerical, median, mean, std))
    else:
        for category_value in category_values(data, category):
            numerical = prepare_numerical(category_filter(data_list, category)[category_value])
            
            print(category + ': ' + category_value)
            print_matrix(vector(numerical, median, mean, std))
            print('')

#performs full v analysis on data
def do_full_numbs_analysis(data):
    print('All categories')
    do_numbs_analysis(data)
    print('')
    for category in data_categories:
        do_numbs_analysis(data, category)

## Analysis

### Correlations

#### Correlations between numerical data for all categories

Below are given the correlations between the numerical variables for various categories. We see that for women, charges are more strongly positively correlated than the general population with age than for men. For men the number of children and bmi are more strongly positively correlated than the general population, although the number of children in both cases are an order of magnitude smaller than other numerical variables. It is also not surprising to see that the number of children correlates more strongly and positively with age and bmi for women than the general population, but it is worth noting that for men, these same numbers are and order of magnitude lower, while still positive.

A particularly strong positive correlation is found between bmi and charges for smokers. This is perhaps not surprising since being a smoker propably makes it more unhealthy to be fatter. Also for smokers, charges are more weakly positively correlated with age, which might reflect the fact that aging plays a smaller role than smoking with a smoker's health. Another not so surprising fact is that bmi is weakly positively correlated with age for smokers. Finally for smokers, there is a weak negative correlation between bmi and number of children. Perhaps fertility is more affected by bmi when one is a smoker.

Lastly, a number of negative correlations are seen for the different regions. In two regions the number of children and age are weakly negatively correlated, while in the other two regions this number is positive and stronger than the general population. Perhaps some regions are better suited for larger families, so that when people have few children they stay for a longer time, and the people with many children stay put if they are younger, and therefore have fewer money for moving.

In [147]:
do_full_corr_analysis(data_list)

All categories
          |   age    |   bmi    | children | charges  |
      age |  1.0000  |  0.1093  |  0.0425  |  0.2990  |
      bmi |  0.1093  |  1.0000  |  0.0128  |  0.1983  |
 children |  0.0425  |  0.0128  |  1.0000  |  0.0680  |
  charges |  0.2990  |  0.1983  |  0.0680  |  1.0000  |

sex: female
          |   age    |   bmi    | children | charges  |
      age |  1.0000  |  0.0972  |  0.0785  |  0.3246  |
      bmi |  0.0972  |  1.0000  |  0.0222  |  0.1614  |
 children |  0.0785  |  0.0222  |  1.0000  |  0.0585  |
  charges |  0.3246  |  0.1614  |  0.0585  |  1.0000  |

sex: male
          |   age    |   bmi    | children | charges  |
      age |  1.0000  |  0.1231  |  0.0087  |  0.2824  |
      bmi |  0.1231  |  1.0000  |  0.0024  |  0.2258  |
 children |  0.0087  |  0.0024  |  1.0000  |  0.0745  |
  charges |  0.2824  |  0.2258  |  0.0745  |  1.0000  |

smoker: yes
          |   age    |   bmi    | children | charges  |
      age |  1.0000  |  0.0597  |  0.0812  |  0.3682

### Numbers

#### Statistical numbers of different categories

Below are the various statistical numbers found for this data set. Note that, for discrete numbers like age and children, the median might be more valuable than the mean and the std is also less useful. For the two continuous variables, we can see an especially large std for charges which is comparable to the mean in many cases. It is worth noting that the mean charges differ greatly between smokers and non-smokers, while for the latter the standard deviation is around three times the size of the mean. Another point is that the mean charges is generally significantly larger than the median charges, implying that most people in the given group pay less than the others in that group. 

In [148]:
do_full_numbs_analysis(data_list)

All categories
            |   median   |    mean    |    std     |
        age |  39.0000   |  39.2070   |  14.0447   |
        bmi |  30.4000   |  30.6634   |   6.0959   |
   children |   1.0000   |   1.0949   |   1.2050   |
    charges | 9382.0330  | 13270.4223 | 12105.4850 |

sex: female
            |   median   |    mean    |    std     |
        age |  40.0000   |  39.5030   |  14.0436   |
        bmi |  30.1075   |  30.3777   |   6.0415   |
   children |   1.0000   |   1.0740   |   1.1912   |
    charges | 9412.9625  | 12569.5788 | 11120.2953 |

sex: male
            |   median   |    mean    |    std     |
        age |  39.0000   |  38.9172   |  14.0397   |
        bmi |  30.6875   |  30.9431   |   6.1359   |
   children |   1.0000   |   1.1154   |   1.2181   |
    charges | 9369.6158  | 13956.7512 | 12961.4284 |

smoker: yes
            |   median   |    mean    |    std     |
        age |  38.0000   |  38.5146   |  13.8978   |
        bmi |  30.4475   |  30.7084   |   6.307