# CS 363D Final Project: Predicting Adoption in Austin Animal Center

## Mrityunjay Mishra, Rohit Neppalli, Ziyi Zhao, Justin Leong 

### Project Description

Many different types of animals (from dogs to livestock) are taken in by the Austin Animal Center each year for various reasons. Some may be adopted, some may be transferred, and some may even go through Euthanasia. The goal of this project is to predict if an animal, that is taken in by the Austin Animal Center, will be adopted or not. This can have good applications for Austin Animal Center - they could use our findings to predict the probability of adoption for an animal that is taken in and can care for it accordingly. Or perhaps someone else can use our findings to find animals who have a low probability of adoption to care for them accordingly. In the end, we hope that our findings provide insight into the adoption patterns of Austin and, consequently, helps different organizations take care of these animal appropriately.

In [10]:
# importing headers
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Enable inline mode for matplotlib so that IPython displays graphs.
%matplotlib inline

### Dataset

To develop our classifier(s), we use the animal intake and outcome data from the open data portal of the city of Austin. To find our more about the animal intake data, [click here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm). To find out more about the outcome data, [click here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In [11]:
# Intakes data
intakes_df = pd.read_csv('Austin_Animal_Center_Intakes.csv')
intakes_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [12]:
# Outcomes data
outcomes_df = pd.read_csv('Austin_Animal_Center_Outcomes.csv')
outcomes_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,May 2019,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,Jul 2018,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,08/16/2020 11:38:00 AM,Aug 2020,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,02/13/2016 05:59:00 PM,Feb 2016,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A674754,,03/18/2014 11:47:00 AM,Mar 2014,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby


In [None]:
# Do some basic data prep / cleaning
outcomes_df = outcomes_df.drop(columns=['Name', 'Date of Birth', 'Outcome Subtype', 'DateTime'])
cat_columns = ['Outcome Type', 'Animal Type', 'Sex upon Outcome', 'Breed', 'Color']
outcomes_df[cat_columns] = outcomes_df[cat_columns].apply(lambda x: pd.factorize(x)[0])
outcomes_df.dropna(inplace=True)

intakes_df = intakes_df.drop(columns=['Name', 'DateTime', 'Found Location'])
cat_columns = ['Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake', 'Breed', 'Color']
intakes_df[cat_columns] = intakes_df[cat_columns].apply(lambda x: pd.factorize(x)[0])
intakes_df.dropna(inplace=True)

def standardize_age(age):
  if 'month' in age:
    return 0
  if 'year' in age:
    return int(age.split(' ')[0])
  if 'day' in age:
    return 0
  return 0

outcomes_df['Outcome Age (Years)'] = outcomes_df['Age upon Outcome'].apply(lambda x: standardize_age(x))
outcomes_df['Outcome Month'] = outcomes_df['MonthYear'].apply(lambda x: x.split(' ')[0])
outcomes_df['Outcome Month'] = outcomes_df['Outcome Month'].map(lambda x: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'].index(x) + 1)
outcomes_df['Outcome Year'] = outcomes_df['MonthYear'].apply(lambda x: x.split(' ')[1])
outcomes_df = outcomes_df.drop(columns=['Age upon Outcome', 'MonthYear'])

intakes_df['Intake Age (Years)'] = intakes_df['Age upon Intake'].apply(lambda x: standardize_age(x))
intakes_df['Intake Month'] = intakes_df['MonthYear'].apply(lambda x: x.split(' ')[0])
intakes_df['Intake Month'] = intakes_df['Intake Month'].map(lambda x: ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'].index(x) + 1)
intakes_df['Intake Year'] = intakes_df['MonthYear'].apply(lambda x: x.split(' ')[1])
intakes_df = intakes_df.drop(columns=['Age upon Intake', 'MonthYear'])

# A lot of different breeds, may not be a useful metric for classification problems

intakes_df

Unnamed: 0,Animal ID,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Breed,Color,Intake Age (Years),Intake Month,Intake Year
0,A786884,0,0,0,0,0,0,2,1,2019
1,A706918,0,0,0,1,1,1,8,7,2015
2,A724273,0,0,0,2,2,2,0,4,2016
3,A665644,0,1,1,3,3,3,0,10,2013
4,A682524,0,0,0,0,4,4,4,6,2014
...,...,...,...,...,...,...,...,...,...,...
138232,A855357,0,0,2,4,121,104,0,4,2022
138233,A855352,0,0,0,2,295,0,4,4,2022
138234,A855359,0,0,1,2,7,12,0,4,2022
138235,A833526,2,0,0,1,9,16,2,4,2022


In [None]:
# Combine together into singular dataframe\
combined_df = intakes_df.merge(outcomes_df[['Animal ID', 'Outcome Type', 'Sex upon Outcome', 'Outcome Age (Years)', 'Outcome Month', 'Outcome Year']], on='Animal ID')
combined_df = combined_df.drop('Animal ID', axis=1)
combined_df

Unnamed: 0,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Breed,Color,Intake Age (Years),Intake Month,Intake Year,Outcome Type,Sex upon Outcome,Outcome Age (Years),Outcome Month,Outcome Year
0,0,0,0,0,0,0,2,1,2019,3,0,2,1,2019
1,0,0,0,1,1,1,8,7,2015,4,3,8,7,2015
2,0,0,0,2,2,2,0,4,2016,4,0,1,4,2016
3,0,1,1,3,3,3,0,10,2013,3,4,0,10,2013
4,0,0,0,0,4,4,4,6,2014,4,0,4,7,2014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178439,3,1,2,4,37,28,2,4,2022,2,1,2,4,2022
178440,1,0,0,2,136,8,3,8,2021,4,2,2,12,2019
178441,0,0,0,2,136,8,2,12,2019,4,2,2,12,2019
178442,0,0,0,1,182,148,1,4,2022,0,3,1,1,2022


In [None]:
# Try out some classification approaches, with outcome type as the label
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(criterion='entropy')
grid_params = {
    'max_depth': [5, 8, 13], 
    'min_samples_leaf': [5, 10, 15, 20], 
    'max_features': [5, 8, 13]
}
grid_search = GridSearchCV(decision_tree, grid_params, cv=5, scoring='accuracy')
grid_search.fit(combined_df.drop('Outcome Type', axis=1), combined_df['Outcome Type'])
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 13, 'max_features': 13, 'min_samples_leaf': 20}
0.7189875917466759


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict

predictions = cross_val_predict(grid_search, combined_df.drop('Outcome Type', axis=1), combined_df['Outcome Type'])
print(confusion_matrix(combined_df['Outcome Type'], predictions))

[[    0     0    15     4     3    11     0     1     0     0]
 [    0    14   967     3    65   586     1     0     0     0]
 [    0    40 70736   141  3427  8609     2     5     0     0]
 [    0     0   855  5990  1716   865     7    44     0     0]
 [    0     5 12557   690 24426  6055    12    15     0     0]
 [    0    14  7749   203  3427 27081     1     2     0     0]
 [    0     1   174   259   801    88     9     8     0     0]
 [    0     0    14   454    94    21     0    54     0     0]
 [    0     0    23     2    35    39     0     0     0     0]
 [    0     0     5    14     5     0     0     0     0     0]]


In [None]:
combined_df['Outcome Type'].value_counts()

 1    82960
 3    43760
 4    38477
 2     9477
 0     1636
 5     1340
 6      637
 7       99
-1       34
 8       24
Name: Outcome Type, dtype: int64