<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/03_regression/09_regression_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

We have learned about regression and how to build regression models using both scikit-learn and TensorFlow. Now we'll build a regression model from start to finish. We will acquire data and perform exploratory data analysis and data preprocessing. We'll build and tune our model and measure how well our model generalizes.

**Team Members:**
1. Alejandra Barroso
1. Sam Lefforge
1. A'Darius Lee

## Framing the Problem

### Overview

*Friendly Insurance, Inc.* has requested we do a study for them to help predict the cost of their policyholders. They have provided us with sample [anonymous data](https://www.kaggle.com/mirichoi0218/insurance) about some of their policyholders for the previous year. The dataset includes the following information:

Column   | Description
---------|-------------
age      | age of primary beneficiary
sex      | gender of the primary beneficiary (male or female)
bmi      | body mass index of the primary beneficiary
children | number of children covered by the plan
smoker   | is the primary beneficiary a smoker (yes or no)
region   | geographic region of the beneficiaries (northeast, southeast, southwest, or northwest)
charges  | costs to the insurance company

We have been asked to create a model that, given the first six columns, can predict the charges the insurance company might incur.

The company wants to see how accurate we can get with our predictions. If we can make a case for our model, they will provide us with the full dataset of all of their customers for the last ten years to see if we can improve on our model and possibly even predict cost per client year over year.

### Exercise 1: Thinking About the Data

Before we dive in to looking closely at the data, let's think about the problem space and the dataset. Consider the questions below.

#### Question 1

Is this problem actually a good fit for machine learning? Why or why not?

##### **Student Solution**

> *This problem is a good fit for machine learning because the personal information of the insured are important parameters that need to be considered in order to come up with an insurance cost.*

---

#### Question 2

If we do build the machine learning model, what biases might exist in the data? Is there anything that might cause the model to have trouble generalizing to other data? If so, how might we make the model more resilient?

##### **Student Solution**

> *A bias that might exist in the data is implicit bias. The person who decided that only those 7 attributes would determine a reasonable price, were incorrect as they need to get more information. To make the model more resilient, the insurance compant could also get their financial information. Something that might cause the model to have trouble generalizing to other data may be that the personal information that the insurance company gets is not enough to determine a price or an insurance plan for a person. There needs to be more information taken into consideration in order to meet the needs of the insured and to make a reasonal price. *

---

#### Question 3

We have been asked to take input features about people who are insured and predict costs, but we haven't been given much information about how these predictions will be used. What effect might our predictions have on decisions made by the insurance company? How might this affect the insured?

##### **Student Solution**

> *The effect that our predictions may have on decisions made by the insurance company could be that it will cause a lot of bias. The insured could be overcharged because of their age, their sex, their weight, whether or not they have children, if they smoke or not, and because of the area they live in. Receiving this personal information could be tricky because the employee could overcharge someone based on a personal attribute that they do not like. This could affect the insured because they can be getting overcharged or not given all of the insurance benefits because of a personal attribute.

---

## Exploratory Data Analysis

Now that we have considered the societal implications of our model, we can start looking at the data to get a better understanding of what we are working with.

The data we'll be using for this project can be [found on Kaggle](https://www.kaggle.com/mirichoi0218/insurance). Upload your `kaggle.json` file and run the code block below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

### Exercise 2: EDA and Data Preprocessing

Using as many code and text blocks as you need, download the dataset, explore it, and do any model-independent preprocessing that you think is necessary. Feel free to use any of the tools for data analysis and visualization that we have covered in this course so far. Be sure to do individual column analysis and cross-column analysis. Explain your findings.

#### **Student Solution**

In [None]:
# Add code and text blocks to explore the data and explain your work
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

insurance_df = pd.read_csv('insurance.csv')
print(insurance_df.dtypes)
print(insurance_df.describe())

#check if there is any missing data
print(insurance_df.isna())

#convert to numeric
insurance_df = pd.get_dummies(insurance_df)

insurance_df



---

# **Column Analysis**

In [None]:
#Column analysis
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(insurance_df.corr(), annot = True, linewidths=.5, ax=ax)


**Cross Column Analysis**

As you can see from the heatmap, there is a very strong positive correlation between the charges and whether or not someone is a smoker. By far, that is the correleation that stands out to me the most, but there are other strong correlations. Some other positive correlations include someone's bmi and if they live in the southeast, as well as insurance charges and age, and insurance charges and bmi.

# **Individual Chart Analysis**

**Charges Bar Graph**

In [None]:
import matplotlib.pyplot as plt

plt.hist(insurance_df['charges'])
plt.title('Charges')
plt.xlabel("Charge")
plt.ylabel("Number of People")
plt.show()

As you can see from the charges column, most people tend to cost insurance companies less than $15000, with a relatively small but substantial amount of people costing significantly more.



**Ages Bar Chart**

In [None]:
import matplotlib.pyplot as plt

plt.hist(insurance_df['age'])
plt.title('Ages')
plt.xlabel("Age")
plt.ylabel("Number of People")
plt.show()


From the bar graph above, there are more people ensured of the ages relatively between 18-28, and relativly between 42-50.

**BMI Bar Chart**

In [None]:
import matplotlib.pyplot as plt

plt.hist(insurance_df['bmi'])
plt.title('BMI')
plt.xlabel("BMI")
plt.ylabel("Number of People")
plt.show()

**Children Pie Chart**

In [None]:
import matplotlib.pyplot as plt

plt.pie(insurance_df['children'])
plt.title('Children')
children_sizes = [insurance_df['children'].value_counts()[0],
                  insurance_df['children'].value_counts()[1],
                  insurance_df['children'].value_counts()[2],
                  insurance_df['children'].value_counts()[3],
                  insurance_df['children'].value_counts()[4],
                  insurance_df['children'].value_counts()[5]]  
plt.pie(children_sizes, labels = ['Has No Children', 'Has One Child', 'Has Two Children', 'Has Three Children', 'Has Four Children', 'Has Five Children'], autopct='%1.1f%%')
plt.title('Children')
plt.show(children_sizes)

The chart above shows that of the ensured, there are more people who do not have children and there are less people that have 4 or 5 children.

**Smoke Pie Chart**

In [None]:
import matplotlib.pyplot as plt

smoke_sizes = [insurance_df['smoker_yes'].value_counts()[0], insurance_df['smoker_yes'].value_counts()[1]] 
plt.pie(smoke_sizes, labels = ["Doesn't Smoke", 'Smokes'], autopct='%1.1f%%')
plt.title('Smoker')
plt.show(smoke_sizes)

The chart above shows that 79.5% of the insured do not smoke, while 20.5% of the insured do smoke.

**Sex Pie Chart**

In [None]:

import matplotlib.pyplot as plt

sex_sizes = [insurance_df['sex_female'].value_counts()[0], insurance_df['sex_female'].value_counts()[1]] 
plt.pie(sex_sizes, labels = ['Male', 'Female'], autopct='%1.1f%%')
plt.title('Sex')
plt.show(sex_sizes)


The chart above shows that of the insured, 50.5% are male and 49.5% are females.

**Region Pie Chart**

In [None]:
#individual column analysis
import matplotlib.pyplot as plt

region_sizes = [insurance_df['region_northeast'].value_counts()[1], 
                insurance_df['region_northwest'].value_counts()[1], 
                insurance_df['region_southeast'].value_counts()[1],
                insurance_df['region_southwest'].value_counts()[1]] 
plt.pie(region_sizes, labels = ['Northeast', 'Northwest', 'Southeast', 'Southwest'], autopct='%1.1f%%')
plt.title('Region')
plt.show(region_sizes)

The chart above shows that there are more people who are insured in the southeast area, and the least ensured are from the northeast area. You can also see that the northwest and the southwest area have the same number of people ensured.

## Modeling

Now that we understand our data a little better, we can build a model. We are trying to predict 'charges', which is a continuous variable. We'll use a regression model to predict 'charges'.

### Exercise 3: Modeling

Using as many code and text blocks as you need, build a model that can predict 'charges' given the features that we have available. To do this, feel free to use any of the toolkits and models that we have explored so far.

You'll be expected to:
1. Prepare the data for the model (or models) that you choose. Remember that some of the data is categorical. In order for your model to use it, you'll need to convert the data to some numeric representation.
1. Build a model or models and adjust parameters.
1. Validate your model with holdout data. Hold out some percentage of your data (10-20%), and use it as a final validation of your model. Print the root mean squared error. We were able to get an RMSE between `3500` and `4000`, but your final RMSE will likely be different.

#### **Student Solution**

In [None]:
#TensorFlow version
%tensorflow_version 2.x

import tensorflow as tf
tf.__version__

In [None]:
# Add code and text blocks to build and validate a model and explain your work
import numpy as np
import math
from sklearn import metrics
from tensorflow import keras
from tensorflow.keras import layers

#creating numerical columns
target_column = 'charges'
feature_columns = [c for c in insurance_df.columns if c != target_column]
numeric_feature_columns = ['age', 'bmi', 'children', 'charges']
target_column, feature_columns, numeric_feature_columns

TARGET_FACTOR = 10000
insurance_df[target_column] /= TARGET_FACTOR
insurance_df[target_column].describe()

#standardization
mean = insurance_df[numeric_feature_columns].mean()
std = insurance_df[numeric_feature_columns].std()
insurance_df[numeric_feature_columns] -= insurance_df[numeric_feature_columns].mean()
insurance_df[numeric_feature_columns] /= insurance_df[numeric_feature_columns].std()

# Shuffle
insurance_df = insurance_df.sample(frac=1)

# Calculate test set size
test_set_size = int(len(insurance_df) * 0.2)

# Split the data
testing_df = insurance_df[:test_set_size]
training_df = insurance_df[test_set_size:]

# print(training_df.keys())
print(f'Holding out {len(testing_df)} records for testing. ')
print(f'Using {len(training_df)} records for training.')

model = keras.Sequential(layers=[
    layers.Dense(
        1,
        input_shape=[len(feature_columns)],
        # Name your layer here
        name = 'Insurance'
    )
])

model.summary()

feature_count = len(feature_columns)

model = keras.Sequential([
  layers.Dense(64, input_shape=[feature_count], activation='relu'),
  layers.Dense(64),
  layers.Dense(1)
])

model.compile(
  loss='mse',
  optimizer='Adam',
  metrics=['mae', 'mse'],
)

EPOCHS = 50

history = model.fit(
  training_df[feature_columns],
  training_df[target_column],
  epochs=EPOCHS,
  verbose=0,                     # New parameter to make model training silent
  validation_split=0.2,
)

predictions = model.predict(testing_df[feature_columns])
predictions


rmse = tf.keras.metrics.RootMeanSquaredError()
rmse.update_state(testing_df[target_column], predictions)

print(rmse.result().numpy())
print(rmse.result().numpy()*TARGET_FACTOR)

insurance_df


---