<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/03_regression/09_regression_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

We have learned about regression and how to build regression models using both scikit-learn and TensorFlow. Now we'll build a regression model from start to finish. We will acquire data and perform exploratory data analysis and data preprocessing. We'll build and tune our model and measure how well our model generalizes.

## Framing the Problem

### Overview

*Friendly Insurance, Inc.* has requested we do a study for them to help predict the cost of their policyholders. They have provided us with sample [anonymous data](https://www.kaggle.com/mirichoi0218/insurance) about some of their policyholders for the previous year. The dataset includes the following information:

Column   | Description
---------|-------------
age      | age of primary beneficiary
sex      | gender of the primary beneficiary (male or female)
bmi      | body mass index of the primary beneficiary
children | number of children covered by the plan
smoker   | is the primary beneficiary a smoker (yes or no)
region   | geographic region of the beneficiaries (northeast, southeast, southwest, or northwest)
charges  | costs to the insurance company

We have been asked to create a model that, given the first six columns, can predict the charges the insurance company might incur.

The company wants to see how accurate we can get with our predictions. If we can make a case for our model, they will provide us with the full dataset of all of their customers for the last ten years to see if we can improve on our model and possibly even predict cost per client year over year.

### Exercise 1: Thinking About the Data


#### Question 1

Is this problem actually a good fit for machine learning? Why or why not?

##### **Student Solution**

### *Reponse: Since the variety of the categories are not broad enough, which means that the model we build will contain many biases. Nevertheless, the old data for charges are possibly decided by human decision, which will lead to more bias when use the model to predict the price.*

---

#### Question 2

If we do build the machine learning model, what biases might exist in the data? Is there anything that might cause the model to have trouble generalizing to other data? If so, how might we make the model more resilient?

##### **Student Solution**

### *Reponse: Since the data we are using to build the model does not contain clients' medical record and their income, then this will cause potential bias. Medical Record: Assume the client has had surgery in the past, yet their medical record are not requiered upon the enrollment for the insurance. Income: how much should the insurance company charge the clients should based on the clients' incomes.*

---

#### Question 3

We have been asked to take input features about people who are insured and predict costs, but we haven't been given much information about how these predictions will be used. What effect might our predictions have on decisions made by the insurance company? How might this affect the insured?

##### **Student Solution**

### *Since the model is trained based on the given categories, and if in the near future, the insurance company tries to use the model we've built based on the given categories to predict the price with additional categories, the prediction for the price will be not accurate.*

---

## Exploratory Data Analysis

Now that we have considered the societal implications of our model, we can start looking at the data to get a better understanding of what we are working with.

The data we'll be using for this project can be [found on Kaggle](https://www.kaggle.com/mirichoi0218/insurance). Upload your `kaggle.json` file and run the code block below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

### Download Data

In [None]:
! kaggle datasets download mirichoi0218/insurance
! ls

### Exercise 2: EDA and Data Preprocessing

Using as many code and text blocks as you need, download the dataset, explore it, and do any model-independent preprocessing that you think is necessary. Feel free to use any of the tools for data analysis and visualization that we have covered in this course so far. Be sure to do individual column analysis and cross-column analysis. Explain your findings.

#### **Student Solution**

### Import Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

### Load Data to Python Object

In [None]:
insurance_df = pd.read_csv('insurance.zip')
insurance_df

### One-hot encoding for Columns

In [None]:
#sex encoding
#insurance_df.loc[insurance_df['sex'] == 'female', 'sex'] = 1
#insurance_df.loc[insurance_df['sex'] == 'male', 'sex'] = 0
#smoker encoding
#insurance_df.loc[insurance_df['smoker'] == 'yes', 'smoker'] = 1
#insurance_df.loc[insurance_df['smoker'] == 'no', 'smoker'] = 0
#region encoding

#sorted(insurance_df['region'].unique())
#target_column = 'charges'

#feature_columns = [c for c in insurance_df.columns if c != target_column]
#numeric_feature_columns = [c for c in feature_columns if c != 'region']
#target_column, feature_columns, numeric_feature_columns
#for reg in sorted(insurance_df['region'].unique()):
  #insurance_df[reg] = (insurance_df['region'] == reg).astype(int)
  #feature_columns.append(reg)
#feature_columns.remove('region')

#insurance_df

insurance_df = pd.get_dummies(insurance_df)
insurance_df

### Heatmap

In [None]:
sns.heatmap(insurance_df.corr())

### Explore Data Type

In [None]:
insurance_df.dtypes

---

## Modeling

Now that we understand our data a little better, we can build a model. We are trying to predict 'charges', which is a continuous variable. We'll use a regression model to predict 'charges'.

### Exercise 3: Modeling

### Import TensorFlow

In [None]:
#%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
tf.__version__

Using as many code and text blocks as you need, build a model that can predict 'charges' given the features that we have available. To do this, feel free to use any of the toolkits and models that we have explored so far.

You'll be expected to:
1. Prepare the data for the model (or models) that you choose. Remember that some of the data is categorical. In order for your model to use it, you'll need to convert the data to some numeric representation.
1. Build a model or models and adjust parameters.
1. Validate your model with holdout data. Hold out some percentage of your data (10-20%), and use it as a final validation of your model. Print the root mean squared error. We were able to get an RMSE between `3500` and `4000`, but your final RMSE will likely be different.

#### **Student Solution**

### Target, Feature, Numeric

In [None]:
target_column = 'charges'
feature_columns = [c for c in insurance_df.columns if c != target_column]

target_column, feature_columns

### Target Factor to Reduce Training Time

In [None]:
TARGET_FACTOR = 10000

insurance_df[target_column] /= TARGET_FACTOR

insurance_df[target_column].describe()

### Standardization

In [None]:
insurance_df.loc[:, feature_columns] = (insurance_df[feature_columns] - insurance_df[feature_columns].mean()) / (insurance_df[feature_columns].std())

insurance_df[feature_columns]

### Splitting Data to data_train, data_test

In [None]:
# Your Code Goes Here
insurance_df = insurance_df.sample(frac=1)
test_set_size = int(len(insurance_df) * 0.2)
#training_df0.8, testing_df0.2
data_test = insurance_df[:test_set_size]
data_train = insurance_df[test_set_size:]

print(f'Holding out {len(data_test)} records for testing. ')
print(f'Using {len(data_train)} records for training.')

### Building the Model

In [None]:
# Create the Sequential model.
model = keras.Sequential()
# Determine the "input shape", which is the number
# of features that we will feed into the model.
input_shape = len(feature_columns)
# Create a layer that accepts our features and outputs
# a single value, the predicted median home price.
layer = layers.Dense(1, input_shape=[input_shape])
# Add the layer to our model.
model.add(layer)
# Print out a model summary.
model.summary()

### Making Deep Neural Network

In [None]:
feature_count = len(feature_columns)

model = keras.Sequential([
  layers.Dense(256, input_shape=[feature_count], activation='relu'),
  layers.Dense(256, activation='relu'),
  layers.Dense(1, activation='relu')
])

#four layers: first dense has two

model.summary()


model.compile(
  loss='mse',
  optimizer='Adam',
  metrics=['mae', 'mse'],
)


### Training the Model

In [None]:
EPOCHS = 50

history = model.fit(
  data_train[feature_columns],
  data_train[target_column],
  epochs=EPOCHS,
  validation_split=0.2,
)

### Validate the Model

In [None]:
predictions = model.predict(data_test[feature_columns])
#predictions

### RMSE

In [None]:
# Your Code Goes here
import math
from sklearn import metrics
import numpy as np


mean_squared_error = metrics.mean_squared_error(
    np.array(predictions) * TARGET_FACTOR,
    data_test[target_column] * TARGET_FACTOR
)
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)

root_mean_squared_error = math.sqrt(mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)

In [None]:
# Add code and text blocks to build and validate a model and explain your work
x = [i for i in range(0, EPOCHS)]
val_mse = history.history['val_mse']
mse = history.history['mse']


plt.title("Error Plot")
plt.xlabel("Epoch")
plt.ylabel("MSE")

plt.plot(
    x, mse, 'b-',
    x, val_mse, 'r-'
)

plt.legend(["Mean Square Error", "Validation Mean Square Error"])

plt.show()

---