#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Image Classification Project

In this project, we will built an image classification model and use the model to identify if the image contain a particular object.  The outcome of the model will be true of false for each images.

## Overview

### Learning Objectives

* Use a classification toolkits (scikit-learn, TensorFlow, or Keras) and build an image classification model.
* Prepare image data to the appropriate format and quality to be a suitable input to the model.

### Prerequisites

* Classification with scikit-learn
* Classification with TensorFlow
* Neural Networks
* Image Classificaion with Keras
* Image Manipulation with Python

### Estimated Duration

330 minutes (285 minutes working time, 45 minutes for presentations)

### Deliverables

1. A copy of this Colab notebook containing your code and responses to the ethical considerations below. The code should produce a functional labeled video.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Building and Using a Model (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a 3 point scale for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |


#### Ethical Implications

There are six questions in the **Ethical Implications** secion. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately considered ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   Jesse
*   Qianti
*   Michael



# Exercises

## Exercise 1: Coding

There are very few more important questions in life than "[Hot dog or not hot dog?](https://www.youtube.com/watch?v=ACmydtFDTGs)". For this workshop you will be tasked with creating a machine learning model that can **take an input image and determine if the image is of a hot dog or not a hot dog**.

Train your model with the [Kaggle Hot Dog/Not Hot Dog](https://www.kaggle.com/dansbecker/hot-dog-not-hot-dog/data) data set. Feel free to [do some background research](https://medium.com/@timanglade/how-hbos-silicon-valley-built-not-hotdog-with-mobile-tensorflow-keras-react-native-ef03260747f3) on the topic.

We have looked at regression, classification, and clustering models. We have used the Scikit Learn, TensorFlow, and Keras toolkits. Feel free to use the model and toolkit that you feel is the most appropriate.
 
**Graded** demonstrations of competency:
1. Pick a classification toolkit (eg: scikit-learn, TensorFlow or Keras) and provide jusitifaction for your choice. 
1. Obtain, prepare and load the dataset.
1. Define and train a classification model.
1. Test and evaluation your classification model. 
1. Apply your classification model to an image sourced from outside the Kaggle dataset. 
1. Test multiple models and/or sets of hyper-parameters and record the results. 
  
Some tips:
 
* Think about how to pre-process the images prior to use them as training data.  Should you train with images in color or grayscale, how many pixels should the image contains, etc. Clearly explain the reasoning for all of your choices in your Colab. 
 

### Student Solution

In [0]:
import zipfile
zip_ref = zipfile.ZipFile('hot-dog-not-hot-dog.zip', 'r')
zip_ref.extractall('./')
zip_ref.close()

In [0]:
import cv2 as cv
import glob as gb
import matplotlib.pyplot as plt


filenames = gb.glob("train/hot_dog/*.jpg")
train_imgs = [cv.imread(img) for img in filenames]

filenames = gb.glob("train/not_hot_dog/*.jpg")
train_not_imgs = [cv.imread(img) for img in filenames]

In [0]:
#augment data for hot_dog images using keras
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

#reshape images to be fed into keras datagen
for _ in range(len(train_imgs)):
  train_imgs[_] = train_imgs[_].reshape((1,) + train_imgs[_].shape)
  
i = 0
for j in range(len(train_imgs)):
  
  for batch in datagen.flow(train_imgs[j], batch_size=1,
                            save_to_dir='train/hot_dog', save_prefix='rand', save_format='jpg'):
    i += 1
    if i > 4:
      break 

In [0]:
#augment data for not_hot_dog images using keras
for _ in range(len(train_not_imgs)):
  train_not_imgs[_] = train_not_imgs[_].reshape((1,) + train_not_imgs[_].shape)
  
i = 0
for j in range(len(train_not_imgs)):
  
  for batch in datagen.flow(train_not_imgs[j], batch_size=1,
                            save_to_dir='train/not_hot_dog', save_prefix='rand', save_format='jpg'):
    i += 1
    if i > 4:
      break 

In [0]:
import cv2 as cv
import glob as gb
import matplotlib.pyplot as plt
  
filenames = gb.glob("train/hot_dog/*.jpg")
train_hot = [cv.imread(img) for img in filenames]

filenames = gb.glob("train/not_hot_dog/*.jpg")
train_not = [cv.imread(img) for img in filenames]

filenames = gb.glob("test/hot_dog/*.jpg")
test_hot = [cv.imread(img) for img in filenames]

filenames = gb.glob("test/not_hot_dog/*.jpg")
test_not = [cv.imread(img) for img in filenames]

In [0]:
train_hot = [cv.cvtColor(img, cv.COLOR_BGR2GRAY) for img in train_hot]
train_not = [cv.cvtColor(img, cv.COLOR_BGR2GRAY) for img in train_not]
test_hot = [cv.cvtColor(img, cv.COLOR_BGR2GRAY) for img in test_hot]
test_not = [cv.cvtColor(img, cv.COLOR_BGR2GRAY) for img in test_not]

train_hot = [cv.resize(img, (100, 100)) for img in train_hot] #<---- these were originally 40,40
train_not = [cv.resize(img, (100, 100)) for img in train_not]
test_hot = [cv.resize(img, (100, 100)) for img in test_hot]
test_not = [cv.resize(img, (100, 100)) for img in test_not]

train_hot = [img.reshape(-1) for img in train_hot]
train_not = [img.reshape(-1) for img in train_not]
test_hot = [img.reshape(-1) for img in test_hot]
test_not = [img.reshape(-1) for img in test_not]

In [0]:
import numpy as np

hotdog_images = np.array([i.reshape(100,100) for i in train_hot])
notdog_images = np.array([i.reshape(100,100) for i in train_not])
train_images = np.concatenate((hotdog_images, notdog_images))

In [0]:
print(hotdog_images.shape)
print(notdog_images.shape)

In [0]:
train_labels = [1] * 498 + [0] * 498


In [0]:
import pandas as pd
import numpy as np

train_df = pd.DataFrame(train_hot)
train_df_not = pd.DataFrame(train_not)

test_df = pd.DataFrame(test_hot)
test_df_not = pd.DataFrame(test_not)

train_df['hot_dog'] = np.ones(len(train_hot), dtype=int)
train_df_not['hot_dog'] = np.zeros(len(train_not), dtype=int)

train_df = pd.concat([train_df, train_df_not], axis=0, sort=True)
train_df = train_df.reset_index(drop=True)

test_df['hot_dog'] = np.ones(len(test_hot), dtype=int)
test_df_not['hot_dog'] = np.zeros(len(test_not), dtype=int)

test_df = pd.concat([test_df, test_df_not], axis=0, sort=True)
test_df = test_df.reset_index(drop=True)

#Neural Networks in Scikit-learn

In [0]:
from sklearn.neural_network import MLPClassifier 
clf = MLPClassifier(activation='relu', solver='adam',
                     hidden_layer_sizes=(20, 20), random_state=4321)

clf.fit(np.array(train_df.iloc[:,0:-1]), np.array(train_df.iloc[:,-1])) 

In [0]:
from sklearn import metrics
from sklearn import model_selection

scores = model_selection.cross_val_predict(
  clf, 
  np.array(test_df.iloc[:,0:-1]),
  np.array(test_df.iloc[:,-1]),
  cv=3
)

precisions, recalls, thresholds = metrics.precision_recall_curve(
  np.array(test_df.iloc[:,-1]),
  scores
)

plt.plot(thresholds, precisions[:-1], "g--", label="Precision")
plt.plot(thresholds, recalls[:-1], "r-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper right")
plt.show()

In [0]:
predictions = clf.predict(test_df.iloc[:, 0:-1])

In [0]:
accuracy = metrics.accuracy_score(test_df.iloc[:,-1], predictions)
accuracy

**Results for Scikit Learn Multi-Layer Perceptron:**

Recall: *approx* .52

Precision: *approx* .65

Accuracy: 0.5

Note: Recall and Precision are read from the graph at threshold of 0.7. 

#Neural Network in Keras

We attempted 4 trials to try and find the ideal neural network model. The code shows the last trial used. But the results of each are shown at the bottom of this section. We varied number of hidden layers, number of neurons, and optimizer used.

In [0]:
kerasTrain = train_df
# kerasTrain.iloc[:,0:-1] = kerasTrain.iloc[:,0:-1] / 255.0

kerasTest = test_df
# kerasTest.iloc[:,0:-1] = kerasTest.iloc[:,0:-1] / 255.0


In [0]:
import tensorflow as tf


model = tf.keras.Sequential([
    
    # Layer 1
    tf.keras.layers.Flatten(input_shape=(10000,)),
    
    # Layer 2
    tf.keras.layers.Dense(100, activation=tf.nn.relu), 
    #Layer 3
    tf.keras.layers.Dense(100, activation=tf.nn.relu),
    
    # Layer 4
    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

In [0]:
# Set model's settings, including loss function, optimizer, and metrics

model.compile(
    
    # Specify loss function
    
    loss='binary_crossentropy',
    
    # Specify optimizer
    optimizer='adam',
    
    # Specify metrics
    metrics=['accuracy']
)

In [0]:
# Start model training


history = model.fit(kerasTrain.iloc[:,0:-1], kerasTrain.iloc[:,-1], epochs=15)

In [0]:
# Set the overall dimension of the chart area

plt.figure(figsize=(16,5))

# Set plot to go to first grid of a 1x2 grid

plt.subplot(1,2,1)

# Plot training set accuracy values for each training iteration

plt.plot(history.history['acc'])
plt.title('Training Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train_accuracy'], loc='best')

# Set subplot to go to second grid of a 1x2 grid

plt.subplot(1,2,2)

# Plot training set loss values for each training iteration

plt.plot(history.history['loss'])
plt.title('Training Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss'], loc='best')

In [0]:
probability = model.predict(kerasTest.iloc[:, 0:-1])

In [0]:
y_pred = []
threshold = 0.7

for i in probability:
  if i > threshold:
    y_pred.append(1)
  else:
    y_pred.append(0)

In [0]:
from sklearn import metrics
from sklearn import model_selection
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

print("recall: {}".format(recall_score(kerasTest.iloc[:,-1], y_pred)))
print("precision: {}".format(precision_score(kerasTest.iloc[:, -1], y_pred)))
print("accuracy: {}".format(accuracy_score(kerasTest.iloc[:,-1], y_pred)))


In [0]:
# Use test dataset to evaluate model accuracy

# (test_loss, test_acc) = model.evaluate(kerasTest.iloc[:,0:-1], kerasTest.iloc[:,-1])

# print('Test accuracy:', test_acc)
# #print('Test loss:', test_loss)

**Using Leaky Relu in 1 hidden layer (100) and adam optimizer:**

recall: 0.02

precision: 0.5555555555555556

accuracy: 0.502

**Using Leaky Relu in two hidden layers (128, 10) and adagrad optimizer:**

recall: 0.168

precision: 0.5526315789473685

accuracy: 0.516

**Using Relu in two hidden layers (128, 10) and adam optimizer:**

recall: 0.604

precision: 0.5261324041811847

accuracy: 0.53

**Using Relu in two hidden layers (100,100) and adam optimizer:**

recall: 0.036

precision: 0.75

accuracy: 0.512

#Binary Classifier

In [0]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
  train_df,                    
  stratify=train_df['hot_dog'],
  shuffle=True,
  test_size=0.2
)

In [0]:
from sklearn import linear_model

binary_classifier = linear_model.SGDClassifier(
  tol=1e-3, 
  max_iter=500) 
 
binary_classifier.fit(train[train.columns[0:1599]], train['hot_dog'])

In [0]:
from sklearn import metrics
from sklearn import model_selection

scores = model_selection.cross_val_predict(
  binary_classifier, 
  test[test.columns[0:1599]],
  test['hot_dog'],
  cv=3,
  method="decision_function"
)

precisions, recalls, thresholds = metrics.precision_recall_curve(
  test['hot_dog'],
  scores
)

plt.plot(thresholds, precisions[:-1], "g--", label="Precision")
plt.plot(thresholds, recalls[:-1], "r-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper right")
plt.show()

In [0]:
threshold = -1
predictions = (scores > threshold)

accuracy = metrics.accuracy_score(test['hot_dog'], predictions)
precision = metrics.precision_score(test['hot_dog'], predictions)
recall = metrics.recall_score(test['hot_dog'], predictions)

print("Precision: {}\nRecall: {}".format(precision, recall))
print("Accuracy: {}".format(accuracy))

**Binary Classifier Results: **

Precision: 0.47413793103448276

Recall: 0.55

Accuracy: 0.47

## Exercise 2: Ethical Implications (OPTIONAL)


Even the most basic of models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negative effect different types of users.

In this section of the project you will reflect on the positive and negative implications of your model.

Frame the context of your models creation using this narriative:

  > The city of Seattle is attempting to reduce traffic congestion in their downtown area. As part of this project, they plan on allowing each local driver one free downtown trip per week. After that the driver will have to pay a $50 toll for each extra day per week driven. As an early proof-of-concept for this project your team is tasked with using machine learning to correctly identify automobiles on the road. The next phase of the project will involve detecting license plate numbers and then cross-referencing that data with RFID chips that should be mounted in all local drivers cars.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit from that problem being solved and write a brief narrative about how the model will help.

---

*Hypothetical entities will benefit because...*

**Negative Impact**

Models don't often have universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

*Hypothetical entity will be negatively impacted because...*

**Bias**

Models can be bias for many reasons. The bias can come from the data used to build the model (eg. sampling, data collection methods, available sources) and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain both below.

---

*One source of bias in the model could be...*

*Another source of bias in the model could be...*

**Changing the Dataset to Mitigate Bias**

Bias datasets are one of the primary ways in which bias is introduced to a machine learning model. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What change or changes could you make to your dataset less bias? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of change that could be made to your input data.

---

*Since the data has potential bias A we can adjust...*

**Changing the Model to Mitigate Bias**

Is there any way to reduce bias by changing the model itself? This could include modifying algorithmic choices, tweaking hyperparameters, etc.

Write a brief summary of changes that you could make to help reduce bias in your model.

---

*Since the model has potential bias A we can adjust...*

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

*Since the predictions have potential bias A we can adjust...*