# Machine Learning Nanodegree Capstone Project

Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 

Approximately 2.7 million shelter animals are euthanized in the US every year.

In this multi-class classification problem, a dataset of intake information (breed, color, sex, age, etc.) provided by the Austin Animal Center will be used to train a supervised learning algorithm. The trained model will then be utilized to help predict the outcome (adoption, died, euthanasia, return to owner or transfer) of future shelter animals.

Knowing the predicted outcomes can help shelters identify and understand trends in animal outcomes. Such insights could help shelters focus their resources on specific animals who might need extra help finding a new home. For example, if the predicted outcome for a certain animal or breed in a shelter is euthanasia, the shelter could align their efforts to help see these euthanasia candidates find a new home.

I intend to follow the workflow outline below as closely as possible:

- Step 1: Problem Preparation
  - Load libraries
  - Load dataset

- Step 2: Data Summarization
  - Descriptive statistics such as .info(), .describe(), .head() and .shape
  - Data visualization such as histograms, density plots, box plots, scatter matrix and correlation matrix

- Step 3: Data Preparation
  - Data cleaning such as handling missing values
  - Feature preparation and data transforms such as one-hot encoding

- Step 4: Evaluate Algorithm(s)
  - Split-out validation dataset
  - Test options and evaluation metric
  - Spot check and compare algorithms

- Step 5: Improve Algorithm(s)
  - Algorithm tuning
  - Compare selected algorithm against Ensembles

- Step 6: Model Finalization
  - Predictions on validation / test dataset
  - Save model for later use

## Problem Preparation

In this step, I am loading the necessary Python libraries and dataset.

In [55]:
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import util
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.trees import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
%matplotlib inline

# Load dataset
filepath = 'data/train.csv'
data = pd.read_csv(filepath)

## Data Exploration

In [None]:
# Displaying the first five records of the dataset
data.head()

In [None]:
# Displaying the dimensions of the dataset
print('Number of observations: %s' % data.shape[0])
print('Number of attributes: {}'.format(data.shape[1]))

In [None]:
# Displaying detailed information about dataset
data.info()

In [None]:
# Identify which observations are null for the AgeuponOutcome feature
data.AgeuponOutcome[data.AgeuponOutcome.isnull()]

In [None]:
# Identify which observation is null for the SexuponOutcome feature
data.SexuponOutcome[data.SexuponOutcome.isnull()]

In [None]:
# Display class distribution
data.groupby('OutcomeType').size()

## Data Preparation

In [56]:
# Remove rows from dataset that have null for specified features
data = data.dropna(subset=['AgeuponOutcome', 'SexuponOutcome'])

In [None]:
# Split dataset into features and target variable
y = data[['OutcomeType']]
data = data.drop(['AnimalID', 'Name', 'DateTime', 'OutcomeType', 'OutcomeSubtype'], axis=1)

In [None]:
# Convert AgeuponOutcome to number of days
data.AgeuponOutcome = data.AgeuponOutcome.apply(util.convertAgeToDays)

In [57]:
# Scale AgeuponOutcome
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(data=data)
numerical = ['AgeuponOutcome']

data_scaled[numerical] = scaler.fit_transform(data[numerical])

In [58]:
# Implement one-hot encoding for categorical features
features_final = pd.get_dummies(data_scaled)

## Evaluate Algorithm(s)

In [59]:
# Split into train and test set
X_train, X_test, y_train, y_test = train_test_split(features_final, y, test_size=0.35, random_state=42)

In [None]:
# Spot-check algorithms
models = []
models.append(('LG', LogisticRegression())) # Benchmark model 
models.append(('CART', DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=10, random_state=42)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    
    print('{}: {}'.format(name, cv_results.mean()))