<a href="https://colab.research.google.com/github/RosemaryAustin/AMIL_Projects/blob/master/Lynn_He_T03_09_%5B00%5D_Regression_Project_%5BColab%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

In this project you will be divided into small groups (two or three people). You will be pointed to a dataset and asked to create a model to solve a problem. Over the course of the day, your team will explore the data and train the best model you can for solving the problem. At the end of the day, your team will give a short presentation about your solution.

## Overview

### Learning Objectives

* Apply scikit-learn or TensorFlow to a dataset to create a regression model.
* Preprocess data for feeding into a model.
* Use a hand-built model to make predictions.
* Measure the quality of predictions from your model.

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Intermediate Pandas
* Visualizations
* Regression
* Regression with scikit-learn
* Regression with TensorFlow

### Estimated Duration

330 minutes (285 minutes working time, 45 minutes for presentations)

### Deliverables

1. A copy of this Colab notebook containing your code and responses to the ethical considerations below.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Building and Using a Model (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a 3 point scale for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |

The demonstrations of competency show that the team knows how to use the tools of a data scientist, but they are not a good judge of "thinking like a data scientist". 3 additional points will be graded on the teams demonstration of skillful application of data science concepts and graded on the following rubric:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Created a generic model with little insight |
| 2      | Performed some basic data science processes and patterns |
| 3      | Demonstrated mastery of data science and exploration concepts learned so far |

#### Ethical Implications

There are six questions in the **Ethical Implications** secion. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately considered ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   Rosemary Austin
*   Amanda Ma
*  Lynn He



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing intake and outcome data](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes) for the [Austin Animal Care Shelter](http://www.austintexas.gov/department/aac). In this project we will **use intake data to predict the number of days that an animal is likely to stay in the shelter before being adopted**.

You are free to use any toolkit that we have covered in this class to solve the problem. That should be at least scikit-learn and TensorFlow.

Important details:

* The [dataset](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes) offers three files, one for intakes, one for outcomes, and one that joins the two and adds some additional columns. Feel free to use any combination of the files.
* The column we are trying to predict is 'time_in_shelter_days'.
* Do not use any outcome data as features for training the model. We want to be able to predict the time in shelter for any given animal at intake.
* Not all animals have outcomes. Not all outcomes are adoption.

**Graded** demonstrations of competency:
1. Get the data into a Python object.
1. The ability to examine the data programmatically and visually.
1. Perform at least one preprocessing transformation on the data.
1. Creation and training of a regression model.
1. Testing and/or scoring of a model.
1. Model experimentation and tuning: record parameters and objects used along with resulting scores.

### Student Solution

In [0]:
import pandas as pd

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
# intakes = pd.read_csv('aac_intakes.csv')
in_out = pd.read_csv('aac_intakes_outcomes.csv')
# outcomes = pd.read_csv('aac_outcomes.csv')

In [0]:
in_out.head()

In [0]:
in_out['time_in_shelter_days'].describe()

In [0]:

# in_out.head()

In [0]:
in_out.info()

In [0]:
animal_list = ['Dog', 'Cat', 'Bird', 'Other']
sex_outcomes = ['Neutered Male', 'Spayed Female', 'Intact Female', 'Intact Male', 'Unknown']
in_out['animal_categories'] = in_out.animal_type.apply(lambda x: animal_list.index(x))
in_out['sex_intakes_categories'] = in_out.sex_upon_intake.apply(lambda x: sex_outcomes.index(x) if type(x) is str else 4)

in_out = in_out[(in_out['time_in_shelter_days']>=0) & (in_out['time_in_shelter_days']<=57)]
in_out = in_out[['intake_month','intake_year','time_in_shelter_days',
                 'animal_categories','sex_intakes_categories']]
in_out.head()

##Preparing data for TensorFlow regression models

We created a 100% sample to avoid issues with sorting

In [0]:
in_out = in_out.sample(frac = 1)
in_out.head()

##Spliting dataframe
We split the data into training and testing: 20% for testing, 80% for training. 

In [0]:
test_set_size = int(len(in_out) * .2)

testing_df = in_out[:test_set_size]
training_df = in_out[test_set_size:]

print("Holding out {} records for testing. Using {} records for training.".format(len(testing_df), len(training_df)))

##Translating Dataframes to Datasets

To create the models, we used **TensorFlow** and translated the dataframes into datasets so that we can feed it into the model.

In [0]:
from tensorflow.data import Dataset

testing_ds = Dataset.from_tensor_slices(testing_df)
training_ds = Dataset.from_tensor_slices(training_df)

testing_ds, training_ds

##Implementing a [Gradient Descent Optimizer](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer)

In [0]:
import tensorflow as tf
gd_optimizer = tf.train.GradientDescentOptimizer(learning_rate = 1e-5)

gd_optimizer = tf.contrib.estimator.clip_gradients_by_norm(gd_optimizer, clip_norm = 5.0)

##Linear Regression with Animal Type

We tested if we could run categorical columns as a feature, first with just animal types and a linear regression mdoel. The **RMSE** was the metric by which we measured the accuracy of our models, keeping in mind that the average days in shelter was around 16.  

\


We found that a batch size of 100 repeated 5 times was computationally efficient. Altering these variables did not significantly alter the **RMSE of around 14**.

In [0]:
import numpy as np
import math
from sklearn import metrics

animal_type_feature_col = tf.feature_column.categorical_column_with_identity(
key = 'animal_type',
num_buckets = 4)

def training_input():
  features = {
      'animal_type': training_df['animal_categories']
  }
  
  labels = training_df['time_in_shelter_days']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)

  return training_ds

def testing_input():
  features = {
      'animal_type': testing_df['animal_categories']
  }
  
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)

  return testing_ds

features = [animal_type_feature_col]

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=features,
    optimizer = gd_optimizer
    # TODO: Use a custom optimizer and explore other hyperparameters if you would like 
)

# Train the model
linear_regressor.train(
 input_fn=training_input
)

# Make predictions
predictions = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['time_in_shelter_days']))
root_mean_squared_error
  

##Linear Regression with[link text](https://) Intake Month and Intake Year

Now that we know that categorical data as a feature works, we wanted to try putting in different features to decrease our RMSE of the model with just the animal type (~14.7). 

We wanted to try the top features that the lasso regression found:

**1.intake_month**

**2.intake_year**



In [0]:
import math
from sklearn import metrics

def training_input():
  features = {
      'intake_month': training_df['intake_month'],
      'intake_year': training_df['intake_year']
  }
  
  labels = training_df['time_in_shelter_days']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)

  return training_ds

def testing_input():
  features = {
      'intake_month': testing_df['intake_month'],
      'intake_year': testing_df['intake_year']
  }
  
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)

  return testing_ds

features = [tf.feature_column.numeric_column('intake_month'),
           tf.feature_column.numeric_column('intake_year')]

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=features,
    optimizer = gd_optimizer
    # TODO: Use a custom optimizer and explore other hyperparameters if you would like 
)

# Train the model
linear_regressor.train(
 input_fn=training_input
)

# Make predictions
predictions = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['time_in_shelter_days']))
root_mean_squared_error
  

##Linear Regression with Categorical Variables: Animal Type and Sex Upon Intake

It seems that the RMSE is much improved: **~11.9**. Intake Month and Intake Year are better predictors than just animal type. Now let's try **animal_type** combined with **sex_upon_intake**

RMSE 

In [0]:
animal_type_feature_col = tf.feature_column.categorical_column_with_identity(
key = 'animal_type',
num_buckets = 4)

sex_upon_intake_feature_col = tf.feature_column.categorical_column_with_identity(
key = 'sex_upon_intake',
num_buckets = 5)

def training_input():
  features = {
      'animal_type': training_df['animal_categories'],
      'sex_upon_intake': training_df['sex_intakes_categories']
  }
  
  labels = training_df['time_in_shelter_days']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)

  return training_ds

def testing_input():
  features = {
      'animal_type': testing_df['animal_categories'],
      'sex_upon_intake': testing_df['sex_intakes_categories']
  }
  
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)

  return testing_ds

features = [animal_type_feature_col,sex_upon_intake_feature_col]

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=features,
    optimizer = gd_optimizer
    # TODO: Use a custom optimizer and explore other hyperparameters if you would like 
)

# Train the model
linear_regressor.train(
 input_fn=training_input
)

# Make predictions
predictions = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['time_in_shelter_days']))
root_mean_squared_error
  



```
# This is formatted as code
```

Oh no! The RMSE is worse: **~14.7**. But still better than the model with just the animal type. 

##Linear Regression with Animal Type, Sex Upon Intake, Intake Month, Intake Year

Let's try all of them!


In [0]:
animal_type_feature_col = tf.feature_column.categorical_column_with_identity(
key = 'animal_type',
num_buckets = 4)

sex_upon_intake_feature_col = tf.feature_column.categorical_column_with_identity(
key = 'sex_upon_intake',
num_buckets = 5)

def training_input():
  features = {
      'animal_type': training_df['animal_categories'],
      'sex_upon_intake': training_df['sex_intakes_categories'],
      'intake_year': training_df['intake_year'],
      'intake_month': training_df['intake_month']
  }
  
  labels = training_df['time_in_shelter_days']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)

  return training_ds

def testing_input():
  features = {
      'animal_type': testing_df['animal_categories'],
      'sex_upon_intake': testing_df['sex_intakes_categories'],
      'intake_year': testing_df['intake_year'],
      'intake_month': testing_df['intake_month']
  }
  
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)

  return testing_ds

features = [animal_type_feature_col,
            sex_upon_intake_feature_col,
           tf.feature_column.numeric_column('intake_year'),
           tf.feature_column.numeric_column('intake_month')]

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=features,
    optimizer = gd_optimizer
    # TODO: Use a custom optimizer and explore other hyperparameters if you would like 
)

# Train the model
linear_regressor.train(
 input_fn=training_input
)

# Make predictions
predictions = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['time_in_shelter_days']))
root_mean_squared_error
  

The RMSE is similar to the model with intake month and intake year: **~11.9**. 

Howoever, after re-running the cell that splits the dataframe into testing and training dataframes several times to ensure randomness, we find that the model with all four variables usually comes out with a prediction which is ~.03 better than the model with just intake month and intake year. 

##Troubleshooting RMSE: Looking into Predictions for Individual Variables

In [0]:
#This is the plot of reality vs predictions
#The vertical is time in shelter in days and horizotal is the type of animal

plt.scatter(testing_df['animal_categories'], testing_df['time_in_shelter_days'])


plt.scatter(testing_df['animal_categories'], predictions)
 

In [0]:
# The horizotal is the sex of the animal when brought into the shelter and vertical is time in days
plt.scatter(testing_df['sex_intakes_categories'],testing_df['time_in_shelter_days'])


plt.scatter(testing_df['sex_intakes_categories'],predictions)


In [0]:

plt.scatter(testing_df['intake_year'],testing_df['time_in_shelter_days'])


plt.scatter(testing_df['intake_year'],predictions)


In [0]:
#The horizotal is the month the animal was brought into the shelter and the vertical is time in shelter via days

plt.scatter(testing_df['intake_month'],testing_df['time_in_shelter_days'])
plt.scatter(testing_df['intake_month'],predictions) 

**Iterations**

Record different attempts at model configurations here:

| Model                        | Parameters                | Score         |
|------------------------------|---------------------------|---------------|
| sklearn LinearRegressor      | none                      | R^2 = 0.00123 |
| sklearn SGDRegressor         | batch_size=50, epochs=100 | R^2 = 0.00011 |

## Exercise 2: Ethical Implications

Even the most basic of models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negative effect different types of users.

In this section of the project you will reflect on the positive and negative implications of your model.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit from that problem being solved and write a brief narrative about how the model will help.

---

*Hypothetical entities will benefit because...*

**Negative Impact**

Models don't often have universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

*Hypothetical entity will be negatively impacted because...*

**Bias**

Models can be bias for many reasons. The bias can come from the data used to build the model (eg. sampling, data collection methods, available sources) and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain both below.

---

*One source of bias in the model could be...*

*Another source of bias in the model could be...*

**Changing the Dataset to Mitigate Bias**

Bias datasets are one of the primary ways in which bias is introduced to a machine learning model. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What change or changes could you make to your dataset less bias? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of change that could be made to your input data.

---

*Since the data has potential bias A we can adjust...*

**Changing the Model to Mitigate Bias**

Is there any way to reduce bias by changing the model itself? This could include modifying algorithmic choices, tweaking hyperparameters, etc.

Write a brief summary of changes that you could make to help reduce bias in your model.

---

*Since the model has potential bias A we can adjust...*

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

*Since the predictions have potential bias A we can adjust...*