# Machine Learning Nanodegree Capstone Project

Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 

Approximately 2.7 million shelter animals are euthanized in the US every year.

In this multi-class classification problem, a dataset of intake information (breed, color, sex, age, etc.) provided by the Austin Animal Center will be used to train a supervised learning algorithm. The trained model will then be utilized to help predict the outcome (adoption, died, euthanasia, return to owner or transfer) of future shelter animals.

Knowing the predicted outcomes can help shelters identify and understand trends in animal outcomes. Such insights could help shelters focus their resources on specific animals who might need extra help finding a new home. For example, if the predicted outcome for a certain animal or breed in a shelter is euthanasia, the shelter could align their efforts to help see these euthanasia candidates find a new home.

I intend to follow the workflow outline below as closely as possible:

- Step 1: Problem Preparation
  - Load libraries
  - Load dataset

- Step 2: Data Summarization
  - Descriptive statistics such as .info(), .describe(), .head() and .shape
  - Data visualization such as histograms, density plots, box plots, scatter matrix and correlation matrix

- Step 3: Data Preparation
  - Data cleaning such as handling missing values
  - Feature preparation and data transforms such as one-hot encoding

- Step 4: Evaluate Algorithm(s)
  - Split-out validation dataset
  - Test options and evaluation metric
  - Spot check and compare algorithms

- Step 5: Improve Algorithm(s)
  - Algorithm tuning
  - Compare selected algorithm against Ensembles

- Step 6: Model Finalization
  - Predictions on validation / test dataset
  - Save model for later use

## Problem Preparation

In this step, I am loading the necessary Python libraries and dataset.

In [None]:
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import util
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
%matplotlib inline

# Load dataset
filepath = 'data/train.csv'
data = pd.read_csv(filepath)

## Data Exploration

In [None]:
# Displaying the first five records of the dataset
data.head()

In [None]:
# Displaying the dimensions of the dataset
print('Number of observations: %s' % data.shape[0])
print('Number of attributes: {}'.format(data.shape[1]))

In [None]:
# Displaying detailed information about dataset
data.info()

In [None]:
data['AnimalType'].value_counts()

In [None]:
data['SexuponOutcome'].value_counts()

In [None]:
data['AgeuponOutcome'].value_counts()

In [None]:
data['Breed'].value_counts()

In [None]:
data['Color'].value_counts()

In [None]:
# Identify which observations are null for the AgeuponOutcome feature
data.AgeuponOutcome[data.AgeuponOutcome.isnull()]

In [None]:
# Identify which observation is null for the SexuponOutcome feature
data.SexuponOutcome[data.SexuponOutcome.isnull()]

In [None]:
# Display class distribution
x = data.groupby('OutcomeType').size()

## Data Preparation

### Cat

In [None]:
# Create new dataframe
cat = data.copy()

# Narrow dataframe to 'Cat' only
cat = cat[cat['AnimalType'] == 'Cat']

# Drop observations from features with NaN
cat = cat.dropna(subset=['AgeuponOutcome', 'SexuponOutcome'])

# Filter out observations with 'Unknown'
cat = cat[cat.SexuponOutcome != 'Unknown']

# Cat names below threshold are replaced with 'Known'
cat_names = cat['Name'].value_counts()
cat_names = cat_names[cat_names < 50]
cat_names = list(cat_names.index)
cat['Name'].replace(to_replace=cat_names, value='Known', inplace=True)

# Cat names with NaN replaced with 'Unknown'
cat['Name'] = cat['Name'].fillna('Unknown')

# Split dataset into features and target variable
cat_y = cat[['OutcomeType']]
cat = cat.drop(['AnimalID', 'DateTime', 'OutcomeType', 'OutcomeSubtype'], axis=1)

# Convert age to number of days
cat['AgeuponOutcome'] = cat['AgeuponOutcome'].apply(util.convertAgeToDays)

# 'Mix' and '/' removed from Breed and Color features
cat['Breed'] = cat['Breed'].apply(util.getBreed)
cat['Color'] = cat['Color'].apply(util.getColor)

# Cat breeds below threshold are replaced with 'Other'
cat_breeds = cat['Breed'].value_counts()
cat_breeds = cat_breeds[cat_breeds < 50]
cat_breeds = list(cat_breeds.index)
cat['Breed'].replace(to_replace=cat_breeds, value='Other', inplace=True)

# Cat colors below threshold are replaced with 'Other'
cat_colors = cat['Color'].value_counts()
cat_colors = cat_colors[cat_colors < 50]
cat_colors = list(cat_colors.index)
cat['Color'].replace(to_replace=cat_colors, value='Other', inplace=True)

# Scale AgeuponOutcome for Cats
scaler = MinMaxScaler()
cat_scaled = pd.DataFrame(data=cat)
numerical = ['AgeuponOutcome']
cat_scaled[numerical] = scaler.fit_transform(cat[numerical])

# Implement one-hot encoding for categorical features
cat_final = pd.get_dummies(cat_scaled)

### Dog

In [None]:
# Create new dataframe
dog = data.copy()

# Narrow dataframe to 'Dog' only
dog = dog[dog['AnimalType'] == 'Dog']

# Drop observations from features with NaN
dog = dog.dropna(subset=['AgeuponOutcome', 'SexuponOutcome'])

# Filter out observations with 'Unknown'
dog = dog[dog['SexuponOutcome'] != 'Unknown']

# Dog names below threshold are replaced with 'Known'
dog_names = dog['Name'].value_counts()
dog_names = dog_names[dog_names < 200]
dog_names = list(dog_names.index)
dog['Name'].replace(to_replace=dog_names, value='Known', inplace=True)

# Dog names with NaN replaced with 'Unknown'
dog['Name'] = dog['Name'].fillna('Unknown')

# Split dataset into features and target variable
dog_y = dog[['OutcomeType']]
dog = dog.drop(['AnimalID', 'DateTime', 'OutcomeType', 'OutcomeSubtype'], axis=1)

# Convert AgeuponOutcome to number of days
dog['AgeuponOutcome'] = dog['AgeuponOutcome'].apply(util.convertAgeToDays)

# 'Mix' and '/' removed from Breed and Color features
dog['Breed'] = dog['Breed'].apply(util.getBreed)
dog['Color'] = dog['Color'].apply(util.getColor)

# Dog breeds below threshold are replaced with 'Other'
dog_breeds = dog['Breed'].value_counts()
dog_breeds = dog_breeds[dog_breeds < 50]
dog_breeds = list(dog_breeds.index)
dog['Breed'].replace(to_replace=dog_breeds, value='Other', inplace=True)

# Dog colors below threshold are replaced with 'Other'
dog_colors = dog['Color'].value_counts()
dog_colors = dog_colors[dog_colors < 50]
dog_colors = list(dog_colors.index)
dog['Color'].replace(to_replace=dog_colors, value='Other', inplace=True)

# Scale AgeuponOutcome for Dogs
scaler = MinMaxScaler()
dog_scaled = pd.DataFrame(data=dog)
numerical = ['AgeuponOutcome']
dog_scaled[numerical] = scaler.fit_transform(dog[numerical])

# Implement one-hot encoding for categorical features
dog_final = pd.get_dummies(dog_scaled)

### Animal (Cat and Dog Combined)

In [None]:
# Create new dataframe
animal = data.copy()

# Remove rows from dataset that have null for specified features
animal = animal.dropna(subset=['AgeuponOutcome', 'SexuponOutcome'])
animal = animal[animal.SexuponOutcome != 'Unknown']

# Animal names below threshold are replaced with 'Known'
animal_names = animal['Name'].value_counts()
animal_names = animal_names[animal_names < 100]
animal_names = list(animal_names.index)
animal['Name'].replace(to_replace=dog_names, value='Known', inplace=True)

# Animal names with NaN replaced with 'Unknown'
animal['Name'] = animal['Name'].fillna('Unknown')

# Split dataset into features and target variable
animal_y = animal[['OutcomeType']]
animal = animal.drop(['AnimalID', 'DateTime', 'OutcomeType', 'OutcomeSubtype'], axis=1)

# Convert AgeuponOutcome to number of days
animal['AgeuponOutcome'] = animal['AgeuponOutcome'].apply(util.convertAgeToDays)

# 'Mix' and '/' removed from Breed and Color features
animal['Breed'] = animal['Breed'].apply(util.getBreed)
animal['Color'] = animal['Breed'].apply(util.getColor)

# Animal breeds below threshold are replaced with 'Other'
animal_breeds = animal['Breed'].value_counts()
animal_breeds = animal_breeds[animal_breeds < 50]
animal_breeds = list(animal_breeds.index)
animal['Breed'].replace(to_replace=animal_breeds, value='Other', inplace=True)

# Animal breeds below threshold are replaced with 'Other'
animal_colors = animal['Color'].value_counts()
animal_colors = animal_colors[animal_colors < 50]
animal_colors = list(animal_colors.index)
animal['Color'].replace(to_replace=animal_colors, value='Other', inplace=True)

# Scale AgeuponOutcome
scaler = MinMaxScaler()
animal_scaled = pd.DataFrame(data=animal)
numerical = ['AgeuponOutcome']
animal_scaled[numerical] = scaler.fit_transform(animal[numerical])

# Implement one-hot encoding for categorical features
animal_final = pd.get_dummies(animal_scaled)

## Evaluate Algorithms

In [None]:
def modelSelection(final_x, final_y):
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(final_x, final_y, test_size=0.40, random_state=42)

    # Spot-check algorithms
    models = []
    models.append(('LG', LogisticRegression(solver='liblinear', multi_class='ovr'))) # Benchmark model
    models.append(('CART', DecisionTreeClassifier()))
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('XGB', XGBClassifier()))
    models.append(('MLP', MLPClassifier()))
    models.append(('SVM', SVC(gamma='auto')))

    results = []
    names = []

    for name, model in models:
        kfold = KFold(n_splits=5, random_state=42)
        cv_results = cross_val_score(model, X_train, y_train.values.ravel(), cv=kfold, scoring='accuracy')
        results.append(cv_results)
        names.append(name)
    
        print('{}: {}'.format(name, cv_results.mean()))

In [None]:
modelSelection(cat_final, cat_y)

In [None]:
modelSelection(dog_final, dog_y)

In [None]:
modelSelection(animal_final, animal_y)

## Algorithm Improvement

In [None]:
# Cat
depth = [1, 2, 3, 4, 5, 6, 7, 8]
child_weight = [1, 2, 3]
sample = [0.25, 0.50, 0.75]
param_grid = dict(max_depth=depth, min_child_weight=child_weight, subsample=sample)
model = XGBClassifier()
kfold = KFold(n_splits=5, random_state=42)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kfold, iid=True)
grid_result = grid.fit(cat_final, cat_y)

print('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print('%f, (%f) with: %r' % (mean, stdev, param))

In [None]:
# Dog
depth = [1, 2, 3, 4, 5, 6, 7, 8]
child_weight = [1, 2, 3]
sample = [0.25, 0.50, 0.75]
param_grid = dict(max_depth=depth, min_child_weight=child_weight, subsample=sample)
model = XGBClassifier()
kfold = KFold(n_splits=5, random_state=42)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kfold, iid=True)
grid_result = grid.fit(dog_final, dog_y)

print('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print('%f, (%f) with: %r' % (mean, stdev, param))

In [None]:
depth = [1, 2, 3, 4, 5, 6, 7, 8]
child_weight = [1, 2, 3]
sample = [0.25, 0.50, 0.75]
param_grid = dict(max_depth=depth, min_child_weight=child_weight, subsample=sample)
model = XGBClassifier()
kfold = KFold(n_splits=5, random_state=42)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kfold, iid=True)
grid_result = grid.fit(animal_final, animal_y)

print('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print('%f, (%f) with: %r' % (mean, stdev, param))

In [None]:
def modelEnsemble(final_x, final_y):
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(final_x, final_y, test_size=0.40, random_state=42)

    # Ensembles
    ensembles = []
    ensembles.append(('AB', AdaBoostClassifier()))
    ensembles.append(('GBM', GradientBoostingClassifier()))
    ensembles.append(('RF', RandomForestClassifier(n_estimators=10)))
    eensembles.append(('ET', ExtraTreesClassifier(n_estimators=10)))

    results = []
    names = []

    for name, model in ensembles:
        kfold = KFold(n_splits=5, random_state=42)
        cv_results = cross_val_score(model, X_train, y_train.values.ravel(), cv=kfold, scoring='accuracy')
        results.append(cv_results)
        names.append(name)
    
        print('{} {}'.format(name, cv_results.mean()))

In [None]:
modelEnsemble(cat_final, cat_y)

In [None]:
modelEnsemble(dog_final, dog_y)

In [None]:
modelEnsemble(animal_final, animal_y)