# Mini Project 3: Deep Learning Pets

Make a copy of this notebook and share it with your instructor.

Student Name:    Christopher Schroeder

## Watch the following videos
- [Project 4 Walkthough Video by Tom](https://www.youtube.com/watch?v=JeF-OTbDcGg)
- [Hints for improving your score](https://www.youtube.com/watch?v=fSROOv7S6Vo)


In [None]:
# Display video link below
from IPython.lib.display import YouTubeVideo
YouTubeVideo('JeF-OTbDcGg')

# Main Task

In this project you will try to complete the [PetFinder.my Adoption Prediction challenge on Kaggle](https://www.kaggle.com/c/petfinder-adoption-prediction).

This notebook sets up the basic data and a basic neural network. You must modify this notebook to improve its performance.

You should make a number of different attempts at this. At the end of the notebook you will write up what you tried and how it worked.



# Section 0

=== *You must run this section to set up things for any of the sections below * ===
### Setting up Python tools



We'll use three libraries for this tutorial: 
- [pandas](http://pandas.pydata.org/) : dataframes for spreadsheet-like data analysis, reading CSV files, time series
- [numpy](http://www.numpy.org/) : for multidimensional data and linear algebra tools
- [matplotlib](http://matplotlib.org/) : Simple plotting and graphing
- [seaborn](http://stanford.edu/~mwaskom/software/seaborn/) : more advanced graphing
-  [scikit-learn](https://scikit-learn.org/stable/) : provides many machine learning algorithms and tools to training and test.




In [None]:
# First, we'll import pandas and numpy, two data processing libraries
import pandas as pd
import numpy as np

# We'll also import seaborn and matplot, twp Python graphing libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Import the needed sklearn libraries
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder

# The Keras library provides support for neural networks and deep learning
# Use the updated Keras library from Tensorflow -- provides support for neural networks and deep learning
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Lambda, Flatten, LSTM
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.optimizers import Adam, RMSprop
#from tensorflow.keras.utils import np_utils
from tensorflow.keras.utils import to_categorical


# We will turn off some warns in this notebook to make it easier to read for new students
import warnings
warnings.filterwarnings('ignore')

print ("All libraries imported")

## Task 1: PetFinder Kaggle Challenge

Sign up for the [PetFinder.my Adoption Prediction challenge on Kaggle](https://www.kaggle.com/c/petfinder-adoption-prediction).

Here is a summary of the data

In the writeup below you will need to enter your Kaggle user name.

### Pet Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
### AdoptionSpeed
Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
- 0 - Pet was adopted on the same day as it was listed. 
- 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
- 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
- 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
- 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

### File descriptions
- train.csv - Tabular/text data for the training set
- train.csv - Tabular/text data for the training set
- test.csv - Tabular/text data for the test set
- sample_submission.csv - A sample submission file in the correct format
- breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
- color_labels.csv - Contains ColorName for each ColorID
- state_labels.csv - Contains StateName for each StateID

# Section 1:  Set up Pet Data



### Set up the Input and output

- Training data: Information on 14,993 pets up for adoption
- Submission data: Information on 3,948 pets where we need to predict adoption time

### NOTE: This dataset is somewhat large and loading it may take a minute or two 


In [None]:
# Read data from the actual Kaggle download files stored in a raw file in GitHub
github_folder = 'https://raw.githubusercontent.com/CIS3115-Machine-Learning-Scholastica/CIS3115ML-Units7and8/master/petfinder-adoption/'
kaggle_folder = '../input/'

data_folder = github_folder
# Uncomment the next line to switch from using the github files to the kaggle files for a submission
#data_folder = kaggle_folder

train = pd.read_csv(data_folder + 'train/train.csv')
submit = pd.read_csv(data_folder + 'test/test.csv')

sample_submission = pd.read_csv(data_folder + 'test/sample_submission.csv')
labels_breed = pd.read_csv(data_folder + 'breed_labels.csv')
labels_color = pd.read_csv(data_folder + 'color_labels.csv')
labels_state = pd.read_csv(data_folder + 'state_labels.csv')

print ("training data shape: " ,train.shape)
print ("submission data shape: : " ,submit.shape)

In [None]:
train.head(5)

## Task 2: Select features

Select which pet features to include in the training data. You should also select the same features for the submission.

Note that you may want to modify some features and you can add them in future cells.

### For the writeup

Describe which pet features you think are most important in determining when a pet will get adopted. Provide some justification for this.

In [None]:
#from tensorflow.keras.utils import to_categorical

# Select which features to use
pet_train = train[['Type','Gender','Age','Health','Sterilized','Vaccinated','Dewormed']]
# Everything we do to the training data we also should do the the submission data
pet_submit = submit[['Type','Gender','Age','Health','Sterilized','Vaccinated','Dewormed']]

# Convert output to one-hot encoding
pet_adopt_speed = to_categorical( train['AdoptionSpeed'] )

print ("pet_train data shape: " ,pet_train.shape)
print ("pet_submit data shape: " ,pet_submit.shape)
print ("pet_adopt_speed data shape: " ,pet_adopt_speed.shape)


## Task 3: Encode some features

Some numeric features like color and breed are called categorical features. Even though they may be enoced as a number, the numbers do not relate numerically to each other. So if one dog has color 2 and  dog two has color 4, this does not mean dog two is twice as colorful as dog one. It only means that they have different colors. One might be light brown and the other gray.


For neural networks it works better when encode categorical data as one-hot encoding. See [An Overview of Categorical Input Handling for Neural Networks](https://towardsdatascience.com/an-overview-of-categorical-input-handling-for-neural-networks-c172ba552dee) for more details.


Since this is a common need, the pandas library has a built in method, [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html), for generating dummy variables based a categorical variable. For additional information on this, see 
[The Dummy’s Guide to Creating Dummy Variables](https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40)


### For the writeup

Describe which pet features you want to use are categorical. Change the first line of the code below to list categorical features you want to change to dummy variables.


In [None]:
# Add any columns to the list below that you want dummy variables created
cat_columns = ['Type','Gender','Health','Sterilized','Vaccinated','Dewormed']

# You should not need to change any code below this line
# =======================================================

# Create the dummy variables for the columns listed above
dfTemp = pd.get_dummies( train[cat_columns], columns=cat_columns )
pet_train = pd.concat([pet_train, dfTemp], axis='columns')

# Do the same to the submission data
dfSummit = pd.get_dummies( submit[cat_columns], columns=cat_columns )
pet_submit = pd.concat([pet_submit, dfSummit], axis='columns')
# Get missing columns in the submission data
missing_cols = set( pet_train.columns ) - set( pet_submit.columns )
# Add a missing column to the submission set with default value equal to 0
for c in missing_cols:
    pet_submit[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
pet_submit = pet_submit[pet_train.columns]



In [None]:
# We should check the that the number of features is not too large and that the training and submission data still have the same number of features



# print out the current data
print ("Size of pet_train = ", pet_train.shape)
print ("Size of pet_submit = ", pet_submit.shape)
pet_train.head(5)

## Modify some features

Neural networks perform best when the numeric data has relative meaning. For example, a pet's age is used since we know that a pet that's 9 months old is similar to a pet that's 10 months old since 9 and 10 are relatively close to each other numerically. Likewise, a 60 month old pet is different from a 9 month year old pet since 60 and 9 are far apart numerically.

Some features may need to be modified. For example, Vaccinated has values of  (1 = Yes, 2 = No, 3 = Not Sure) but this will confuse a neural network because these numbers are not relative to each other. So, it might be better to encode Yes as + 1 and No as -1 and then have Not Sure as 0 since that would be relatively in the middle between -1 and +1.

The code for doing this is beyond the scope of this course, so you may want to simply avoid features where the numerical values are not relative. If you are interested in coding, the following code defines a function to do so, and then the map method applies this function to each pet's Vaccinated feature.

Again, everything we do to the training data, we also need to do to the submission data.



In [None]:
# Type - Pet type (1 = dog, 2 = cat)
#encodedType = train[['Type']] 
def fixVac( value ):
    if value == 1: return +1
    elif value == 2: return -1
    else: return 0

#train['encodedType'] = list(map(lambda a: 0 if (a>1) else a,train['Type']))
pet_train['encodedType'] = list(map(fixVac,train['Type']))
# Do the same thing to the submission data
pet_submit['encodedType'] = list(map(fixVac,submit['Type']))

# Gender - Pet gender (1 = M, 2 = F, 3=group)
#encodedType = train[['Type']] 
def fixVac( value ):
    if value == 1: return +1
    elif value == 2: return -1
    else: return 0

#train['encodedGender'] = list(map(lambda a: 0 if (a>1) else a,train['Gender']))
pet_train['encodedGender'] = list(map(fixVac,train['Gender']))
# Do the same thing to the submission data
pet_submit['encodedGender'] = list(map(fixVac,submit['Gender']))

# Health - Pet Health (1 = health, 2 = minor injury, 3 = serious injury, 0 = N/A)
#encodedHealth = train[['Health']] 
def fixVac( value ):
    if value == 1: return +1
    elif value == 2: return -1
    elif value == 3: return +2
    else: return 0

#train['encodedHealth'] = list(map(lambda a: 0 if (a>1) else a,train['Health']))
pet_train['encodedHealth'] = list(map(fixVac,train['Health']))
# Do the same thing to the submission data
pet_submit['encodedHealth'] = list(map(fixVac,submit['Health']))

# Sterilized - Pet has been fixed (1 = Yes, 2 = No, 3 = Not Sure)
#encodedSterilized = train[['Sterilized']] 
def fixVac( value ):
    if value == 1: return +1
    elif value == 2: return -1
    else: return 0

#train['encodedSterilized'] = list(map(lambda a: 0 if (a>1) else a,train['Sterilized']))
pet_train['encodedSterilized'] = list(map(fixVac,train['Sterilized']))
# Do the same thing to the submission data
pet_submit['encodedSterilized'] = list(map(fixVac,submit['Sterilized']))

# Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
#encodedVaccinated = train[['Vaccinated']] 
def fixVac( value ):
    if value == 1: return +1
    elif value == 2: return -1
    else: return 0

#train['encodedVaccinated'] = list(map(lambda a: 0 if (a>1) else a,train['Vaccinated']))
pet_train['encodedVaccinated'] = list(map(fixVac,train['Vaccinated']))
# Do the same thing to the submission data
pet_submit['encodedVaccinated'] = list(map(fixVac,submit['Vaccinated']))

# Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
#encodedDewormed = train[['Dewormed']] 
def fixVac( value ):
    if value == 1: return +1
    elif value == 2: return -1
    else: return 0

#train['encodedDewormed'] = list(map(lambda a: 0 if (a>1) else a,train['Dewormed']))
pet_train['encodedDewormed'] = list(map(fixVac,train['Dewormed']))
# Do the same thing to the submission data
pet_submit['encodedDewormed'] = list(map(fixVac,submit['Dewormed']))

pet_train.head(10)

In [None]:
print ("pet_train data shape: " ,pet_train.shape)
print ("pet_adopt_speed data shape: " ,pet_adopt_speed.shape)
print ("pet_submit data shape: " ,pet_submit.shape)



### Scale and Split the data

**Scale Data:** Neural Networks work best with the inputs are between 0 and +1, but the grayscale images have pixel values between 0 and 255. So, each pixel value is divided by 255 to scale it.

**Submission:** We do the same thing for the submission data

**Split the Data:** The training data is split with 90% used for training and 10% used for testing.

In [None]:
# Scale the data to put large features like area_mean on the same footing as small features like smoothness_mean
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = StandardScaler()
pet_train_scaled = scaler.fit_transform(pet_train)
pet_submit_scaled = scaler.fit_transform(pet_submit)

pet_train_scaled

In [None]:
# Split the data into 80% for training and 10% for testing out the models
X_train, X_test, y_train, y_test = train_test_split(pet_train_scaled, pet_adopt_speed, test_size=0.05)

print ("X_train training data shape of 28x28 pixels greyscale: " ,X_train.shape)
print ("X_test submission data shape of 28x28 pixels greyscale: : " ,X_test.shape)

print ("y_train training data shape of 28x28 pixels greyscale: " ,y_train.shape)
print ("y_test submission data shape of 28x28 pixels greyscale: : " ,y_test.shape)

### Neural Network

Set up the layers of the Neural Network

One possibly configuration would be:

```
model = Sequential()
model.add(Dense(20, activation='relu', input_dim=(input_Size)))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(10, activation='relu'))
model.add(Dense(output_Size, activation='softmax'))
```

Though, you should try your own configuration. We will eventually look at networks of 50+ layers, but for now I suggest you limit yourself to 3-5 hidden layers. 


*Note: You should not change the input or output layers, they are fixed by our problem definition*

In [None]:
# Set up the Neural Network
input_Size = X_test.shape[1]     # This is the number of features you selected for each pet
output_Size = y_train.shape[1]   # This is the number of categories for adoption speed, should be 5

model = Sequential()
model.add(Dense(500, activation='relu', input_dim=(input_Size)))
#model.add(Dropout(0.3))
model.add(Dense(300, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(output_Size, activation='softmax'))

# Compile neural network model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

print ("Neural Network created")
model.summary()

### Callbacks

-  [ReduceLROnPlateau](https://keras.io/callbacks/#reducelronplateau). 

-  [EarlyStopping callback](https://keras.io/callbacks/#earlystopping) 

-  [ModelCheckpoint callback](https://keras.io/callbacks/#modelcheckpoint) 

In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint

learning_rate_reduction = ReduceLROnPlateau(monitor='val_loss', 
                                            patience=5, 
                                            verbose=2, 
                                            factor=0.5,                                            
                                            min_lr=0.000001)

early_stops = EarlyStopping(monitor='val_loss', 
                            min_delta=0, 
                            patience=20, 
                            verbose=2, 
                            mode='auto')

checkpointer = ModelCheckpoint(filepath = 'cis6115_PetFinder.{epoch:02d}-{accuracy:.6f}.hdf5',
                               verbose=2,
                               save_best_only=True, 
                               save_weights_only = True)


### Train the Neural Network

We are only using 10 epochs initially, but you should consider running more epochs

In [None]:
# Fit model on training data for network with dense input layer

history = model.fit(X_train, y_train,
          epochs=500,
          verbose=1,
          callbacks=[learning_rate_reduction, early_stops],
          validation_data=(X_test, y_test))


In [None]:
# 10. Evaluate model on test data
print ("Running final scoring on test data")
score = model.evaluate(X_test, y_test, verbose=1)
print ("The accuracy for this model is ", format(score[1], ",.2f"))

## Plot the Training History

We store the performance during training in a variable named 'history'. The x-axis is the training time or number of epochs.

- Accuracy: Accuracy of the predictions; hopefully this is increasing to near 1.0
- Loss: How close the output is to the desired output; this should decrease to near 0.0

In [None]:
# We will display the loss and the accuracy of the model for each epoch
# NOTE: this is a little fancy display than is shown in the textbook
def display_training_curves(training, validation, title, subplot):
    if subplot%10==1: # set up the subplots on the first call
        plt.subplots(figsize=(10,10), facecolor='#F0F0F0')
        plt.tight_layout()
    ax = plt.subplot(subplot)
    ax.set_facecolor('#F8F8F8')
    ax.plot(training)
    ax.plot(validation)
    ax.set_title('model '+ title)
    ax.set_ylabel(title)
    #ax.set_ylim(0.28,1.05)
    ax.set_xlabel('epoch')
    ax.legend(['train', 'valid.'])
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 211)
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 212)

# Section 2: Create the Submission for Kaggle

The following code generates a file named submission.csv for the [PetFinder.my Adoption Prediction challenge on Kaggle](https://www.kaggle.com/c/petfinder-adoption-prediction).

Once you have this notebook working, you must load it up as a kernel in the Kaggle challenge.







In [None]:
print ("pet_train data shape: " ,pet_train.shape)
print ("submit data shape: " ,submit.shape)
print ("pet_submit data shape: " ,pet_submit_scaled.shape)


In [None]:
predictions = model.predict_classes(pet_submit_scaled, verbose=1)

submissions=pd.DataFrame({'PetID': submit.PetID})
submissions['AdoptionSpeed'] = predictions

submissions.to_csv("submission.csv", index=False, header=True)

submissions.head(10)

## Task 4: Submit to Kaggle

Upload this notebook to Kaggle and submit the results.




### For the writeup

Describe your different attempts to improve this notebook. What pet features did you try and how did you modify them? What neural networks did you try?

How did these different attempts perform with the test data? How did they perform as a Kaggle submission?


 

## Writeup : Improve Pet Finder Performance
Create a document addressing the following questions and submit it for this assignment.

Also remember to share this notebook with your instructor and submit a link to the notebook itself.

You can answer these questions seperately or in one unified paper.

# Writeup 1: Sign up for the Kaggle Pet Finder Challenge
---
You should have signed up for the [PetFinder.my Adoption Prediction challenge on Kaggle](https://www.kaggle.com/c/petfinder-adoption-prediction).

Describe what you thought of the Kaggle enviroment and what your user name is in Kaggle.

# Writeup 2 & 3: Feature Selection
---
Describe which pet features you think are most important in determining when a pet will get adopted. Provide some justification for this. Also describe which pet features you want to use are categorical. Change the first line of the code below to list categorical features you want to change to dummy variables.

# Writeup 4: Kaggle Submission
---
Describe your different attempts to improve this notebook. What pet features did you try and how did you modify them? What neural networks did you try?

How did these different attempts perform with the test data? How did they perform as a Kaggle submission?