# Medical Cost Personal Datasets
## **Insurance Forecast by using Linear Regression**

## Scope
NOTE: the scope if defined during the scoping proccess and will be refined during the project
### Step 1: Goals
NOTE: Projects should idealy be problem-centric and not data-centric. Starting with data may lead to analysis that is not actionable or relevant to the goals of our organization.
#### 1. Problem-centric view: From abstract to optimized
    - Abstract: (good) lower the charges / (evil) increase the charges
    - Improve: alter the variables that most impact the charges
    - Optimezed: get as many people to alter their of BMI and smoking status.

#### 2. Data-centric view: Project Objectives
    - Import a dataset into your program
    - Analyze a dataset by building out functions or class methods
    - Use libraries to assist in your analysis
    - Document and organize your findings
    - Make predictions about a dataset’s features based on your findings

### Step 2: What actions/interventions are you informing?
NOTE: actions need to be concrete in order to achieve the goal.
I will focus only on the good side of the goals.

Possible Actions (+ possible action break down):
#### 1. Improved BMI -> Data Science key imput -> who should improve and has a benefit by improving their BMI ?
    - healthy habbits and eating advice aimed at 75% of the clients
    - fitness/sport programs for 40% of the clients
    - medical professional care for 10% of the clients
#### 2. Smoker to Non-Smoker -> Data Science key imput -> who should and is most likely to quit smoking ?
    - health advice to reduce frequency of smoking and expose the risks of smoking for 90% of all the population
    - medical advice to 50% of affected population
    - advanced support to quit smoking for 10% of the affected population

### Step 3: What Data do you have and what Data do you need?
#### 1. DATA WE HAVE: The file insurance.csv is where data is organized
   **It's good practice to evaluate each data source regarding:**
    - how is it stored? -> a well structured CSV file (see details below)
    - how was it collected? -> more details www.kaggle.com/mirichoi0218/insurance
    - what is the level of granularity? -> we have data about individuals and actions are aimed at individuals
    - how far back does it go? -> unknown, not needed
    - how often does new data come in? -> unknown, not needed
    - dows it overwrite old fields? -> possible, once the status of an individual changes; does not affect the code
    - does it add new rows? -> possible, but current sample size is sufficient; does not affect the code
    - is there any collection bias? -> to be determined

   **The file has 1339 rows, including the header, and 7 columns:**
    - age (integer)
    - sex (female / male)
    - bmi (float) -> objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
    - children (integer) -> Number of children covered by health insurance / Number of dependents
    - smoker (yes / no)
    - region (northeast, northwest, southeast, southwest) -> the residential area in the US
    - charges (float) -> Individual medical costs billed by health insurance -> known from previous projects: estimated_cost = 250 * age − 128 * sex + 370 * bmi + 425 * children + 24000 * smoker − 12500 , it might not be true

   **Biases that should be evaluated:**
    - male to female ratio
    - smoker or nonsmoker prevalence
    - regional distribution
    - age distribution 

### Possible ideas for analysis are the following:
- Find out the average age of the patients in the dataset.
- Analyze where a majority of the individuals are from.
- Look at the different costs between smokers vs. non-smokers.
- Figure out what the average age is for someone who has at least one child in this dataset.

- Organize your findings into dictionaries, lists, or another convenient datatype.
- Make predictions about what features are the most influential for an individual’s medical insurance charges based on your analysis.
- Explore areas where the data may include bias and how that would impact potential use cases.

# IMPORT YOUR DATASET
# SAVE YOUR DATASET

In [173]:
import csv
list_id = []
list_age = []
list_sex = []
list_bmi = []
list_children = []
list_smoker = []
list_region = []
list_charges = []
with open("insurance.csv", newline='') as insurance_csv:
    user_data = csv.DictReader(insurance_csv)
    i=0
    for row in user_data:
        list_id.append(i)
        i+=1
        list_age.append(row['age'])
        if row['sex'] == 'male':
            list_sex.append(1)
        else: list_sex.append(0)
        list_bmi.append(row['bmi'])
        list_children.append(row['children'])
        if row['smoker'] == 'yes':
            list_smoker.append(1)
        else: list_smoker.append(0)
        list_region.append(row['region'])
        list_charges.append(row['charges'])
        
    #print(list_id)

# Build out analysis functions or class methods
Find out the average age of the patients in the dataset.

In [174]:
def average(lyst):
    total = 0
    for x in lyst:
        total = total + float(x)
    avg = total / len(lyst)
    return avg

print(average(list_age))
print(average(list_sex))
print(average(list_bmi))
print(average(list_charges))
print(average(list_smoker))
print(average(list_children))

39.20702541106129
0.5052316890881914
30.663396860986538
13270.422265141257
0.20478325859491778
1.0949177877429


- Female:male = 49.5:50.5 % -> no bias
- Smoker:nonsmoker = 20.4:79.6 % -> there are more nonsmokers

Analyze where a majority of the individuals are from.

In [175]:
def location(lyst):
    geo_loc={}
    for loc in lyst:
        if loc in geo_loc:
            geo_loc[loc] += 1
        else: geo_loc[loc] = 1
    p_hat={}
    for loc in geo_loc:
        p_hat[loc] = str(round(geo_loc[loc]/len(lyst)*100,2))+'%'
    return geo_loc, p_hat
            
location(list_region)


({'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324},
 {'southwest': '24.29%',
  'southeast': '27.2%',
  'northwest': '24.29%',
  'northeast': '24.22%'})

Look at the different costs between smokers vs. non-smokers.

In [176]:
def smokers():
    anti = 0
    charges_anti = 0.0
    pro = 0
    charges_pro = 0.0
    for num in list_id:
        if list_smoker[num] == 1:
            pro += 1
            charges_pro += float(list_charges[num])
        elif list_smoker[num] == 0:
            anti += 1
            charges_anti += float(list_charges[num])
    smoker_stats = {"nonsmokers":(anti, round(charges_anti, 2), round(charges_anti/anti, 2)), \
                    "smokers":(pro, round(charges_pro, 2), round(charges_pro/pro, 2))}
    return smoker_stats

smokers()

{'nonsmokers': (1064, 8974061.47, 8434.27),
 'smokers': (274, 8781763.52, 32050.23)}

Smokers cost is on average much higher than nonsmokers.
It has a great influence on insurance charges.

Figure out what the average age is for someone who has at least one child in this dataset.

In [178]:
def average_age_w_child():
    total_age_w_child = 0
    index_age_w_child = 0    
    for num in list_id:
        if float(list_children[num]) > 0:
            total_age_w_child += float(list_age[num])
            index_age_w_child += 1
    proportion_w_child = round(index_age_w_child / len(list_id) * 100, 2)
    average_age_child = round(total_age_w_child / index_age_w_child, 2)
    return average_age_child, proportion_w_child
average_age_w_child()

(39.78, 57.1)

In [179]:
def age_dict():
    dictionary = {'18-24':0,'25-34':0, '35-44':0, '45-54':0, '55-64':0 ,'65-74':0, '75-100':0}
    for num in list_id:        
        age = float(list_age[num])
        if age>17 and age<25:
            dictionary['18-24'] += 1
        elif age>24 and age<35:
            dictionary['25-34'] += 1
        elif age>34 and age<45:
            dictionary['35-44'] += 1
        elif age>44 and age<55:
            dictionary['45-54'] += 1
        elif age>54 and age<65:
            dictionary['55-64'] += 1
        elif age>64 and age<75:
            dictionary['65-74'] += 1
        elif age>74:
            dictionary['75-100'] += 1
    return dictionary
age_dict()

{'18-24': 278,
 '25-34': 271,
 '35-44': 260,
 '45-54': 287,
 '55-64': 242,
 '65-74': 0,
 '75-100': 0}

Between the ages of 18 to 65 years, all the age groups are fairly represented within the sample.

In [180]:
def child_dict():
    dictionary = {'0':0, '1':0, '2':0, '3':0, '4':0, '5':0}
    for num in list_id:
        num_child = list_children[num]
        if num_child not in dictionary:
            dictionary[num_child] = 1
        else: dictionary[num_child] += 1
    return dictionary
child_dict()

{'0': 574, '1': 324, '2': 240, '3': 157, '4': 25, '5': 18}

In [195]:
def smoker_age_dict():
    dictionary = {'18-24':[0, 0],'25-34':[0, 0], '35-44':[0, 0], '45-54':[0, 0], \
                  '55-64':[0, 0] ,'65-74':[0, 0], '75-100':[0, 0]}
    for num in list_id:      
        if list_smoker[num] == 1:
            age = float(list_age[num])
            if age>17 and age<25:
                dictionary['18-24'][0] += 1
            elif age>24 and age<35:
                dictionary['25-34'][0] += 1
            elif age>34 and age<45:
                dictionary['35-44'][0] += 1
            elif age>44 and age<55:
                dictionary['45-54'][0] += 1
            elif age>54 and age<65:
                dictionary['55-64'][0] += 1
            elif age>64 and age<75:
                dictionary['65-74'][0] += 1
            elif age>74:
                dictionary['75-100'][0] += 1
    dictionary['18-24'][1] = round((dictionary['18-24'][0]/age_dict()['18-24'])*100, 2)
    dictionary['25-34'][1] = round((dictionary['25-34'][0]/age_dict()['25-34'])*100, 2)
    dictionary['35-44'][1] = round((dictionary['35-44'][0]/age_dict()['35-44'])*100, 2)
    dictionary['45-54'][1] = round((dictionary['45-54'][0]/age_dict()['45-54'])*100, 2)
    dictionary['55-64'][1] = round((dictionary['55-64'][0]/age_dict()['55-64'])*100, 2)
    #dictionary['65-74'][1] = round((dictionary['65-74'][0]/age_dict()['65-74'])*100, 2)
    #dictionary['75-100'][1] = round((dictionary['75-100'][0]/age_dict()['75-100'])*100, 2)
    return dictionary
smoker_age_dict()

{'18-24': [60, 21.58],
 '25-34': [56, 20.66],
 '35-44': [61, 23.46],
 '45-54': [55, 19.16],
 '55-64': [42, 17.36],
 '65-74': [0, 0],
 '75-100': [0, 0]}

'age range' : [number of smokers, percent within smoking population] 


Smoking age is evenly distributed within the sample.
This means that we can include smokers when considering the influence of other factors (age, sex, bmi) upon insurance charges.