In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")

In [2]:
%matplotlib inline

# Purpose of Our Project

The purpose of this project is to predict the outcomes of animals that enter the Austin Animal Shelter by using labeled, historical data to create a classification model.

# Data Cleaning, Exploration, and Feature Engineering

We started by cleaning some of the data directly in Excel. 

* We removed the ID column and the Name column.
    * ID clearly should have no effect on the animal's outcome because it is just a made up ID.
    * Name could have an affect if, for example, all animals with a certain name are consistenly adopted. However, because there is no way to know when these names were given to the animals, we can't know whether the names were even in place long enough to realistically have an impact on the outcome.


* We removed the original Age Upon Outcome column and replaced it with a new Age Upon Outcome column. 
    * The new column was calculated by subtracting Date of Birth from MonthYear to get the new Age Upon Outcome. We did this because we observed that the original Age Upon Outcome was just a rounded version of MonthYear minus Date of Birth, so using MonthYear minus Date of Birth without rounding provided us with a more accurate age of the animal. 
    * Because in Excel MonthYear minus Date of Birth returns the age of the animal in days, we divided this number by 365 to get the age of the animal in years. 


* We removed the DateTime column because it was exactly the same as the MonthYear column. 


* We created four binary columns from the Outcome Subtype column. We then dropped the Outcome Subtype column.
    * We created the TNR binary column which stands for Trap, Neuter, Release. For this column, any stray animal that was trapped, neutered, and released back to the wild has a value of 1 in the TNR column.
    * We created the Suffering column. For this column, any animal that was suffering at the shelter is marked with a 1. 
    * We created the Aggressive column. For this column, any aggressive animal is marked with a 1.
    * We created the Rabies Risk column. For this column, any animal that had a risk of rabies is marked with a 1.


We then read in our data and explored the shape of the data and the number of null attributes in the data.



In [3]:
#If these values are in a row, count them as a null value
missing_values = ["na", "--", "", "n/a", "NA", "na", "Unknown", "unknown", "NULL", "null"]

#Read in the data with the specified missing values
data = pd.read_csv("Austin_Animal_Center_Outcomes EDITED V3 11.22.2019.csv", na_values=missing_values)


print("\n\nData shape: ", data.shape)
print("\n\nFirst 10 rows of data: ")
display(data.head(10))

#Print the sum of null values in each column
print("\n\nSum of null values in each column: \n", data.isnull().sum())

#Display the rows for which at least one attribute of the row is null
print("\n\nThese are some of the rows that contain null attributes: ")
display(data[data.isnull().any(axis=1)])



Data shape:  (111649, 12)


First 10 rows of data: 


Unnamed: 0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
0,2/17/2019 11:44,2/13/2017,Adoption,Dog,Neutered Male,Chihuahua Shorthair Mix,Cream,2.012298,0,0,0,0
1,2/13/2016 17:59,10/8/2015,Adoption,Dog,Neutered Male,Anatol Shepherd/Labrador Retriever,Buff,0.352738,0,0,0,0
2,3/18/2014 11:47,3/12/2014,Transfer,Cat,Intact Male,Domestic Shorthair Mix,Orange Tabby,0.017783,0,0,0,0
3,10/18/2014 18:52,8/1/2014,Adoption,Cat,Neutered Male,Domestic Shorthair Mix,Black,0.215852,0,0,0,0
4,8/5/2014 16:59,6/3/2014,Adoption,Cat,Neutered Male,Domestic Shorthair Mix,White/Orange Tabby,0.174541,0,0,0,0
5,7/27/2014 9:00,7/26/2012,Transfer,Cat,Intact Female,Domestic Shorthair Mix,Black,2.003767,1,0,0,0
6,1/22/2017 11:56,1/20/2010,Return to Owner,Cat,Neutered Male,Domestic Shorthair Mix,Blue/White,7.012321,0,0,0,0
7,6/11/2014 17:11,6/9/2014,Transfer,Cat,Intact Male,Domestic Shorthair Mix,Brown Tabby,0.007441,0,0,0,0
8,3/16/2015 14:50,6/5/2014,Transfer,Cat,Spayed Female,Domestic Medium Hair Mix,Black/White,0.779775,0,0,0,0
9,3/10/2019 12:25,2/13/2017,Adoption,Dog,Neutered Male,Chihuahua Shorthair Mix,Cream,2.069911,0,0,0,0




Sum of null values in each column: 
 MonthYear              0
Date of Birth          0
Outcome Type           6
Animal Type            0
Sex upon Outcome    9275
Breed                  0
Color                  0
Age Upon Outcome       0
TNR                    0
Suffering              0
Aggressive             0
Rabies                 0
dtype: int64


These are some of the rows that contain null attributes: 


Unnamed: 0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
10,5/3/2016 12:39,4/15/2016,Euthanasia,Other,,Opossum,Gray,0.050759,0,0,0,0
23,4/3/2015 16:16,4/3/2013,Euthanasia,Cat,,Domestic Shorthair Mix,Gray,2.001857,0,1,0,0
29,7/5/2016 12:47,7/5/2015,Euthanasia,Other,,Bat,Black/Brown,1.004199,0,0,0,1
51,9/1/2016 8:05,8/31/2015,Euthanasia,Other,,Bat Mix,Brown,1.006402,0,0,0,1
52,8/7/2015 8:35,8/6/2014,Euthanasia,Other,,Bat Mix,Black,1.003720,0,0,0,1
56,5/26/2016 18:09,5/26/2014,Euthanasia,Other,,Bat Mix,Brown/Black,2.004812,0,0,0,1
70,3/29/2014 8:42,3/1/2014,Euthanasia,Other,,Bat,Brown/Black,0.077705,0,0,0,1
81,9/8/2014 18:53,9/3/2014,Transfer,Cat,,Domestic Shorthair Mix,Blue Tabby,0.015854,0,0,0,0
95,3/4/2015 9:00,11/18/2014,Transfer,Cat,,Siamese Mix,Lynx Point,0.291438,1,0,0,0
105,5/16/2019 15:54,5/16/2019,Euthanasia,Cat,,Domestic Shorthair,White/Black,0.001815,0,0,0,0


## Possible Attribute Values

We discovered the possible attribute values for the Outcome Type column, the Animal Type column, and the Sex Upon Outcome column. We did this using the groupby method. There are so many attribute values for Breed and Color that they are not included here. Later in the code we will condense the potential attributes for Breed and Color.

The possible outcome types: Adoption, Died, Disposal, Euthanasia, Missing, Relocate, Return to Owner, Rto-Adopt and Transfer.

The possible animal types: Bird, Cat, Dog, Livestock, and Other.

The possible categories for sex upon outcome: Intact Female, Intact Male, Neutered Male, Spayed Female, and Unknown.



In [4]:
print(data.groupby('Outcome Type').count())
print(data.groupby('Animal Type').count())
print(data.groupby('Sex upon Outcome').count())

                 MonthYear  Date of Birth  Animal Type  Sex upon Outcome  \
Outcome Type                                                               
Adoption             48734          48734        48734             48612   
Died                  1040           1040         1040               736   
Disposal               443            443          443                64   
Euthanasia            7572           7572         7572              3279   
Missing                 66             66           66                62   
Relocate                20             20           20                 1   
Return to Owner      19738          19738        19738             19600   
Rto-Adopt              517            517          517               516   
Transfer             33513          33513        33513             29504   

                 Breed  Color  Age Upon Outcome    TNR  Suffering  Aggressive  \
Outcome Type                                                                    
A

## Dropping Rows with Certain Attributes

We decided to only try to classify dogs and cats, so in the below code we drop all rows that have an animal type attribute other than Dog or Cat. 

We also are not interested in classifying dogs and cats into Died, Disposal, Missing, or Rto-adopt. There is not enough information about the dataset to fully determine what these outcomes mean, and there is only a relatively small number of animals with these outcomes. So, in the code below we dropped any row with an outcome type attribute of Died, Disposal, Missing, or Rto-adopt.

In [5]:
# Get names of indexes where animal type = Bird, Livestock, or Other
# Delete these row indexes from data
indexNames = data[ data['Animal Type'] == 'Bird'].index
    
data.drop(indexNames , inplace=True)

indexNames = data[ data['Animal Type'] == 'Livestock'].index
    
data.drop(indexNames , inplace=True)

indexNames = data[ data['Animal Type'] == 'Other'].index
    
data.drop(indexNames , inplace=True)

#Prove we dropped all animals besides cats and dogs
print("\nWe only want to classify outcomes for dogs or cats, so we dropped all other animal types")
display(data.groupby('Animal Type').count())

#Check how the outcome types changed
print("\nThis is how the outcome counts changed when looking at just cats and dogs")
display(data.groupby('Outcome Type').count())


# Get names of indexes where outcome type = Died, Disposal, Missing, Rto-Adopt
# Delete these row indexes from dataFrame
indexNames = data[ data['Outcome Type'] == 'Died'].index
    
data.drop(indexNames , inplace=True)

indexNames = data[ data['Outcome Type'] == 'Disposal'].index
    
data.drop(indexNames , inplace=True)

indexNames = data[ data['Outcome Type'] == 'Missing'].index
    
data.drop(indexNames , inplace=True)

indexNames = data[ data['Outcome Type'] == 'Rto-Adopt'].index
    
data.drop(indexNames , inplace=True)


#Prove we dropped outcome types
print("\nWe only want to classify into the Adoption, Euthanasia, Return to Owner, and Transfer outcomes, \
so we dropped all other outome types")
display(data.groupby('Outcome Type').count())

print("\n\nThis is now the shape of our dataset")
print(data.shape)

#Print the sum of null values in each column
print("\n\nSum of null values in each column now: \n", data.isnull().sum())

#Display the rows for which at least one attribute of the row is null
print("\n\nThese are the rows that contain null attributes now: ")
display(data[data.isnull().any(axis=1)])




We only want to classify outcomes for dogs or cats, so we dropped all other animal types


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
Animal Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Cat,42191,42191,42191,38491,42191,42191,42191,42191,42191,42191,42191
Dog,63165,63165,63163,62746,63165,63165,63165,63165,63165,63165,63165



This is how the outcome counts changed when looking at just cats and dogs


Unnamed: 0_level_0,MonthYear,Date of Birth,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
Outcome Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Adoption,48135,48135,48135,48132,48135,48135,48135,48135,48135,48135,48135
Died,849,849,849,725,849,849,849,849,849,849,849
Disposal,64,64,64,41,64,64,64,64,64,64,64
Euthanasia,3464,3464,3464,3128,3464,3464,3464,3464,3464,3464,3464
Missing,63,63,63,61,63,63,63,63,63,63,63
Return to Owner,19668,19668,19668,19561,19668,19668,19668,19668,19668,19668,19668
Rto-Adopt,517,517,517,516,517,517,517,517,517,517,517
Transfer,32594,32594,32594,29073,32594,32594,32594,32594,32594,32594,32594



We only want to classify into the Adoption, Euthanasia, Return to Owner, and Transfer outcomes, so we dropped all other outome types


Unnamed: 0_level_0,MonthYear,Date of Birth,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
Outcome Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Adoption,48135,48135,48135,48132,48135,48135,48135,48135,48135,48135,48135
Euthanasia,3464,3464,3464,3128,3464,3464,3464,3464,3464,3464,3464
Return to Owner,19668,19668,19668,19561,19668,19668,19668,19668,19668,19668,19668
Transfer,32594,32594,32594,29073,32594,32594,32594,32594,32594,32594,32594




This is now the shape of our dataset
(103863, 12)


Sum of null values in each column now: 
 MonthYear              0
Date of Birth          0
Outcome Type           2
Animal Type            0
Sex upon Outcome    3969
Breed                  0
Color                  0
Age Upon Outcome       0
TNR                    0
Suffering              0
Aggressive             0
Rabies                 0
dtype: int64


These are the rows that contain null attributes now: 


Unnamed: 0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
23,4/3/2015 16:16,4/3/2013,Euthanasia,Cat,,Domestic Shorthair Mix,Gray,2.001857,0,1,0,0
81,9/8/2014 18:53,9/3/2014,Transfer,Cat,,Domestic Shorthair Mix,Blue Tabby,0.015854,0,0,0,0
95,3/4/2015 9:00,11/18/2014,Transfer,Cat,,Siamese Mix,Lynx Point,0.291438,1,0,0,0
105,5/16/2019 15:54,5/16/2019,Euthanasia,Cat,,Domestic Shorthair,White/Black,0.001815,0,0,0,0
149,10/7/2016 9:42,10/7/2014,Euthanasia,Cat,,Domestic Shorthair Mix,Black,2.003847,0,1,0,0
198,5/16/2019 18:16,5/16/2019,Transfer,Cat,,Domestic Shorthair,White/Black,0.002085,0,0,0,0
214,5/16/2019 19:46,5/12/2018,Transfer,Cat,,Domestic Shorthair,Brown Tabby,1.013215,1,0,0,0
235,4/11/2014 12:55,9/10/2013,Transfer,Cat,,Domestic Shorthair Mix,Black/White,0.585036,1,0,0,0
295,10/2/2016 17:37,9/26/2016,Transfer,Cat,,Domestic Medium Hair Mix,Black/White,0.018449,0,0,0,0
304,10/16/2015 12:42,10/1/2015,Transfer,Cat,,Domestic Medium Hair Mix,Black,0.042546,0,0,0,0


## Dealing with Null Values

Because there are only two rows with Outcome Types that are null, we decided to drop these.

We then replaced all the null values for Sex Upon Outcome with the most common Sex Upon Outcome for each row's respective Animal Type.

In [6]:
print("Because there are only two rows with outcome types that are null, we decided to drop these.")
indexeNames = data[data['Outcome Type'].isnull()]
data.dropna(subset = ['Outcome Type'], inplace = True)

print("\n\nThis is now the shape of our dataset")
print(data.shape)

#Print the sum of null values in each column
print("\n\nSum of null values in each column now: \n", data.isnull().sum())

#Display the rows for which at least one attribute of the row is null
print("\n\nThese are the rows that contain null attributes now: ")
display(data[data.isnull().any(axis=1)])


print("\n\nWhat is the most common sex upon outcome for cats and dogs?")
print(data.groupby('Animal Type')['Sex upon Outcome'].agg(pd.Series.mode))
print("\n\nWe are going to replace the null sex upon outcome values with the most common sex for cats or dogs,\
so Neutered Malein either case")
data.fillna('Neutered Male', inplace = True)


#Print the sum of null values in each column
print("\n\nSum of null values in each column now: \n", data.isnull().sum())

Because there are only two rows with outcome types that are null, we decided to drop these.


This is now the shape of our dataset
(103861, 12)


Sum of null values in each column now: 
 MonthYear              0
Date of Birth          0
Outcome Type           0
Animal Type            0
Sex upon Outcome    3967
Breed                  0
Color                  0
Age Upon Outcome       0
TNR                    0
Suffering              0
Aggressive             0
Rabies                 0
dtype: int64


These are the rows that contain null attributes now: 


Unnamed: 0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies
23,4/3/2015 16:16,4/3/2013,Euthanasia,Cat,,Domestic Shorthair Mix,Gray,2.001857,0,1,0,0
81,9/8/2014 18:53,9/3/2014,Transfer,Cat,,Domestic Shorthair Mix,Blue Tabby,0.015854,0,0,0,0
95,3/4/2015 9:00,11/18/2014,Transfer,Cat,,Siamese Mix,Lynx Point,0.291438,1,0,0,0
105,5/16/2019 15:54,5/16/2019,Euthanasia,Cat,,Domestic Shorthair,White/Black,0.001815,0,0,0,0
149,10/7/2016 9:42,10/7/2014,Euthanasia,Cat,,Domestic Shorthair Mix,Black,2.003847,0,1,0,0
198,5/16/2019 18:16,5/16/2019,Transfer,Cat,,Domestic Shorthair,White/Black,0.002085,0,0,0,0
214,5/16/2019 19:46,5/12/2018,Transfer,Cat,,Domestic Shorthair,Brown Tabby,1.013215,1,0,0,0
235,4/11/2014 12:55,9/10/2013,Transfer,Cat,,Domestic Shorthair Mix,Black/White,0.585036,1,0,0,0
295,10/2/2016 17:37,9/26/2016,Transfer,Cat,,Domestic Medium Hair Mix,Black/White,0.018449,0,0,0,0
304,10/16/2015 12:42,10/1/2015,Transfer,Cat,,Domestic Medium Hair Mix,Black,0.042546,0,0,0,0




What is the most common sex upon outcome for cats and dogs?
Animal Type
Cat    Neutered Male
Dog    Neutered Male
Name: Sex upon Outcome, dtype: object


We are going to replace the null sex upon outcome values with the most common sex for cats or dogs,so Neutered Malein either case


Sum of null values in each column now: 
 MonthYear           0
Date of Birth       0
Outcome Type        0
Animal Type         0
Sex upon Outcome    0
Breed               0
Color               0
Age Upon Outcome    0
TNR                 0
Suffering           0
Aggressive          0
Rabies              0
dtype: int64


## Feature Engineering with Color and Breed

There are so many potential values for color and breed that we decided to combine similar attribute values, such as combining all the different kinds of tabby colored cats into just "Tabby." This helps simplify our data. Combining attribute values also makes sense because many of these colors and breeds are almost exactly the same thing but were just entered by different people and consequently given slightly different names. 

In [7]:
print("\n\nWhat are the colors of the animals we have left in our data?\n\n")
display(data.groupby('Color').count()['Outcome Type'])

print("\n\nWhat are the breeds of the animals we have left in our data?\n\n")
display(data.groupby('Breed').count()['Outcome Type'])



What are the colors of the animals we have left in our data?




Color
Agouti                           11
Agouti/Brown Tabby                1
Agouti/Cream                      1
Agouti/White                      1
Apricot                          72
Apricot/Brown                     3
Apricot/Tricolor                  2
Apricot/White                    12
Black                          8745
Black Brindle                   104
Black Brindle/Black              10
Black Brindle/Blue                1
Black Brindle/Blue Tick           1
Black Brindle/Brown              17
Black Brindle/Brown Brindle       2
Black Brindle/Tan                 3
Black Brindle/White             232
Black Smoke                     160
Black Smoke/Black                 1
Black Smoke/Black Tiger           1
Black Smoke/Blue Tick             1
Black Smoke/Brown                 1
Black Smoke/Brown Tabby           1
Black Smoke/Gray                  1
Black Smoke/White                48
Black Tabby                     222
Black Tabby/Black                 1
Black Tabby/Gray      



What are the breeds of the animals we have left in our data?




Breed
Abyssinian                                         6
Abyssinian Mix                                     7
Affenpinscher Mix                                  8
Afghan Hound Mix                                   1
Afghan Hound/German Shepherd                       1
Afghan Hound/Labrador Retriever                    1
Airedale Terrier                                   2
Airedale Terrier Mix                              27
Airedale Terrier/Irish Terrier                     1
Airedale Terrier/Labrador Retriever                2
Airedale Terrier/Miniature Schnauzer               1
Airedale Terrier/Otterhound                        2
Airedale Terrier/Standard Poodle                   1
Akbash Mix                                         6
Akita                                             10
Akita Mix                                         51
Akita/Australian Cattle Dog                        3
Akita/Belgian Malinois                             2
Akita/Border Collie                     

### Defining Functions to Feature Engineer Color Attribute

Below we defined a function to help create a Multicolor column which has a value of 1 if an animal has a "/" in their Color attribute (indicating they have more than one color) and has a value of 0 if there is not a "/" in their Color attribute.

We also defined the mainColor function which assumes that, if an animal has a "/" in its Color attribute, the first color listed for an animal is the animal's predominate color. We came to this assumption because some animals were entered, for example, as Black/White while others were entered as White/Black. This led us to believe that the first color may be the most visable color on the animal. However, because we are making an assumption here, this could be a weakness in our model.

We then applied these two functions to all the rows in our data.

In [8]:
def multicolor (row):
    if '/' in row['Color']:
        return 1
    else:
        return 0

def mainColor (row):
    if '/' in row['Color']:
        color = ""
        for i in row['Color']:
            if i != '/':
                color += i
            else:
                return color.strip()
    else:
        return row['Color']
    
data['Multicolor'] = data.apply (lambda row: multicolor(row), axis=1)

#We are making the assumption that the color listed first for multicolored animals is the main color of the animal
data['MainColor'] = data.apply (lambda row: mainColor(row), axis=1)

print("What are the main colors of the animals we have in our data?")
display(data.groupby('MainColor').count())

What are the main colors of the animals we have in our data?


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor
MainColor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Agouti,14,14,14,14,14,14,14,14,14,14,14,14,14
Apricot,89,89,89,89,89,89,89,89,89,89,89,89,89
Black,25585,25585,25585,25585,25585,25585,25585,25585,25585,25585,25585,25585,25585
Black Brindle,370,370,370,370,370,370,370,370,370,370,370,370,370
Black Smoke,214,214,214,214,214,214,214,214,214,214,214,214,214
Black Tabby,297,297,297,297,297,297,297,297,297,297,297,297,297
Black Tiger,5,5,5,5,5,5,5,5,5,5,5,5,5
Blue,5007,5007,5007,5007,5007,5007,5007,5007,5007,5007,5007,5007,5007
Blue Cream,73,73,73,73,73,73,73,73,73,73,73,73,73
Blue Merle,479,479,479,479,479,479,479,479,479,479,479,479,479


### Combining Colors

Below we defined another function that takes all the similar MainColor attribute values and combines them into a single color which is then returned into the column CombinedColor.

Calicos, Torties, and Torbie colors are all tricolor coats, so we called these colors Tricolor. 

Any color with the word "Point" is, according to Wikipedia, an "animal coat coloration with a pale body and relatively darker extremities, i.e. the face, ears, feet, tail." So, we combined all colors with the word "Point" into one color called Point. 

In [9]:
def combineColor (row):
    if 'Black' in row['MainColor']:
        return 'Black'
    elif 'Blue' in row['MainColor']:
        return 'Blue'
    elif 'Tabby' in row['MainColor']:
        return 'Tabby'
    elif 'Point' in row['MainColor']:
        return 'Point'
    elif 'Brown' in row['MainColor'] or 'Buff' in row['MainColor'] or 'Tan' in row['MainColor'] \
    or 'Chocolate' in row['MainColor'] or 'Tan' in row['MainColor'] or 'Ruddy' in row['MainColor']\
    or 'Fawn' in row['MainColor']: 
        return 'Brown'
    elif 'Cream' in row['MainColor']:
        return 'White'
    elif 'Agouti' in row['MainColor'] and 'Cream' in row['MainColor']:
        return 'White'
    elif 'Agouti' in row['MainColor'] and 'White' in row['MainColor']:
        return 'White'
    elif 'Agouti' in row['MainColor']:
        return 'Brown'
    elif 'Apricot' in row['MainColor']:
        return 'Yellow'
    elif 'Gold' in row['MainColor']:
        return 'Yellow'
    elif 'Liver' in row['MainColor'] or 'Gray' in row['MainColor']:
        return 'Gray'
    elif 'Red' in row['MainColor'] or 'Pink' in row['MainColor'] or 'Orange' in row['MainColor']:
        return 'Red'
    elif 'Sable' in row['MainColor']:
        return 'Brown'
    elif 'Silver' in row['MainColor']:
        return 'Gray'
    elif 'Yellow' in row['MainColor']:
        return 'Yellow'
    elif 'White' in row['MainColor']:
        return 'White'
    elif 'Tricolor' in row['MainColor'] or 'Torbie' in row['MainColor'] or 'Tortie' in row['MainColor']\
    or 'Calico' in row['MainColor']:
        return 'Tricolor'
    else:
        return row['MainColor']
    
    
    
data['CombinedColor'] = data.apply (lambda row: combineColor(row), axis=1)

### The Results for Color

Combining colors left us with the following 10 attribute values for CombinedColor:

Black, Blue, Brown, Gray, Point, Red, Tabby, Tricolor, White, and Yellow.

In [10]:
data.head(20)

data.groupby('CombinedColor').count()

Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor
CombinedColor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Black,26471,26471,26471,26471,26471,26471,26471,26471,26471,26471,26471,26471,26471,26471
Blue,8238,8238,8238,8238,8238,8238,8238,8238,8238,8238,8238,8238,8238,8238
Brown,22929,22929,22929,22929,22929,22929,22929,22929,22929,22929,22929,22929,22929,22929
Gray,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162
Point,1808,1808,1808,1808,1808,1808,1808,1808,1808,1808,1808,1808,1808,1808
Red,3251,3251,3251,3251,3251,3251,3251,3251,3251,3251,3251,3251,3251,3251
Tabby,15824,15824,15824,15824,15824,15824,15824,15824,15824,15824,15824,15824,15824,15824
Tricolor,8365,8365,8365,8365,8365,8365,8365,8365,8365,8365,8365,8365,8365,8365
White,14648,14648,14648,14648,14648,14648,14648,14648,14648,14648,14648,14648,14648,14648
Yellow,1165,1165,1165,1165,1165,1165,1165,1165,1165,1165,1165,1165,1165,1165


## Feature Engineering with Breed

Below we defined a function which assumes that, if an animal has a "/" or the word "Mix" in its breed attribute, it's a mutt (not purebred). If the animal is a mutt, it returns "Mutt" into the breed attribute.

We also defined a function that adds a pitbull attribute. This attribute has a 1 if the animal's breed attribute contains "Pitbull" or any other breed similar to a pitbull. The reason we made this attribute is because of the common stereotype of aggression surrounding pitbulls, which could have an effect on the outcomes for dogs with pitbull features. 

We then applied these functions to all the rows in our data.

In [11]:
def breed (row):
    if '/' in row['Breed'] or 'Mix' in row['Breed']:
        return 'Mutt'
    else:
        return row['Breed']

def pitbull (row):
    if 'Pit Bull' in row['Breed'] or 'Pitbull' in row['Breed'] or 'Dogo' in row['Breed']\
    or 'Presa' in row['Breed'] or 'Staffordshire' in row['Breed']:
        return 1
    else:
        return 0    

data['Pitbull'] = data.apply (lambda row: pitbull(row), axis=1)
data['Breed'] = data.apply (lambda row: breed(row), axis=1)

print("What are the breeds of the animals we have in our data now?")
display(data.groupby('Breed').count())



What are the breeds of the animals we have in our data now?


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Abyssinian,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
Airedale Terrier,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Akita,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10
Alaskan Husky,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42
Alaskan Malamute,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14
American Bulldog,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52
American Curl Shorthair,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
American Eskimo,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13
American Foxhound,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
American Pit Bull Terrier,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48


### Combining Breeds for Purebred Dogs and Cats

#### For Dogs

Below we defined another function that takes all the breed attributes and, if the breed is a type of dog breed, returns one of 7 different dog breed groups described here: https://www.akc.org/expert-advice/lifestyle/7-akc-dog-breed-groups-explained/ and here https://www.akc.org/public-education/resources/general-tips-information/dog-breeds-sorted-groups/. 


#### For Cats

Below we defined another function that takes all the breed attributes and, if the breed is a type of cat breed, returns a cat breed group based on the groups described here: https://www.purina.com.au/cats/ownership/pedigree-cat-breed-groups#.XdePG9VMHb0

In [12]:
def combineBreed (row):
    if 'Akita' in row['Breed'] or 'mastiff' in row['Breed'] or 'Mastiff' in row['Breed']\
    or 'Mountain' in row['Breed'] or 'Malamute' in row['Breed'] or 'Husky' in row['Breed']\
    or 'Schnauzer' in row['Breed'] or 'St. Bernard' in row['Breed'] or 'Black Mouth Cur' in row['Breed']\
    or 'Lacy' in row['Breed'] or 'Boerboel' in row['Breed'] or 'Boxer' in row['Breed']\
    or 'Briard' in row['Breed'] or 'Cane Corso' in row['Breed'] or 'Catahoula'in row['Breed']\
    or 'Doberman' in row['Breed'] or 'Dogue' in row['Breed'] or 'Dane' in row['Breed']\
    or 'Pyrenees' in row['Breed'] or 'Landseer' in row['Breed'] or 'Newfoundland' in row['Breed']\
    or 'Leonberger' in row['Breed'] or 'Rottweiler' in row['Breed'] or 'Samoyed' in row['Breed']:
        
        return 'Working'
    
    if 'Bruss Griffon' in row['Breed'] or 'Chihuahua' in row['Breed'] or 'Chinese Crested' in row['Breed']\
    or 'Havanese' in row['Breed'] or 'Yorkshire' in row['Breed'] or 'Silky' in row['Breed']\
    or 'Manchester' in row['Breed'] or 'Italian' in row['Breed'] or 'Japanese Chin' in row['Breed']\
    or 'Maltese' in row['Breed'] or 'Miniature Pinscher' in row['Breed'] or 'Miniature Poodle' in row['Breed']\
    or 'Papillon' in row['Breed'] or 'Pekingese' in row['Breed'] or 'Pomeranian' in row['Breed']\
    or 'Pug' in row['Breed'] or 'Shih' in row['Breed'] or 'Toy Poodle' in row['Breed']:
        
        return 'Toy'
    
    if 'Terrier' in row['Breed'] or 'Miniture Schnauzer' in row['Breed'] or 'Dandie' in row['Breed']\
    or 'West' in row['Breed'] or 'Pit Bull' in row['Breed'] or 'Dogo' in row['Breed'] or 'Pitbull' in row['Breed'] \
    or 'Presa' in row['Breed'] or 'Staffordshire' in row['Breed']:
        
        return 'Terrier'
    
    if 'Bulldog' in row['Breed'] or 'Bichon' in row['Breed'] or 'Eskimo' in row['Breed']\
    or 'Carolina' in row['Breed'] or 'Sharpei' in row['Breed'] or 'Chow' in row['Breed']\
    or 'Coton' in row['Breed'] or 'Dalmatian' in row['Breed'] or 'Spitz' in row['Breed']\
    or 'Hovawart' in row['Breed'] or 'Jindo' in row['Breed'] or 'Keeshond' in row['Breed']\
    or 'Lhasa Apso' in row['Breed'] or 'Mexican Hairless' in row['Breed'] or 'Shiba' in row['Breed']\
    or 'Standard Poodle' in row['Breed']:
        
        return 'Non-Sporting'
    
    
    if 'Shepherd' in row['Breed'] or 'Cattle' in row['Breed'] or 'Sheepdog' in row['Breed']\
    or 'Collie' in row['Breed'] or 'Corgi' in row['Breed'] or 'Heeler' in row['Breed']\
    or 'Kelpie' in row['Breed'] or 'Beauceron'in row['Breed'] or 'Malinois' in row['Breed'] \
    or 'Vallhund' in row['Breed']:
        
        return 'Herding'
    
    if 'Hound' in row['Breed'] or 'Foxhound' in row['Breed'] or 'Beagle' in row['Breed']\
    or 'Bloodhound' in row['Breed'] or 'Basenji' in row['Breed'] or 'Dachshund' in row['Breed'] \
    or 'Harrier' in row['Breed'] and 'Greyhound' not in row['Breed'] or 'hound' in row['Breed']\
    or 'Pbgv' in row['Breed'] or 'Rhod' in row['Breed'] or 'Saluki' in row['Breed']\
    or 'Whippet' in row['Breed']:
        
        return 'Hound'
    
    if 'Boykin' in row['Breed'] or 'Brittany' in row['Breed'] or 'Span' in row['Breed']\
    or 'Spaniel' in row['Breed'] or 'Retr' in row['Breed'] or 'Retriever' in row['Breed'] \
    or 'Pointer' in row['Breed'] or 'Treeing' in row['Breed'] or 'Vizsla' in row['Breed'] \
    or 'Weimar' in row['Breed'] or 'Irish Setter' in row['Breed']:
        
        return 'Sporting'   
   
    
    if 'Abyssinian' in row['Breed'] or 'Balinese' in row['Breed'] or 'Medium' in row['Breed']\
    or 'Maine' in row['Breed'] or 'Ragdoll' in row['Breed'] or 'Angora' in row['Breed']:
        return 'Semi-Longhair'
    
    if 'Shorthair' in row['Breed'] or 'Bengal' in row['Breed'] or 'Burmese' in row['Breed']\
    or 'Devon' in row['Breed'] or 'Havana' in row['Breed'] or 'Manx' in row['Breed']\
    or 'Russian' in row['Breed'] or 'Siamese' in row['Breed'] or 'Snow' in row['Breed']\
    or 'Sphynx' in row['Breed']:
        return 'Short Hair'
    
    if 'Balinese' in row['Breed'] or 'Long' in row['Breed'] or 'Himalayan' in row['Breed']\
    or 'Persian' in row['Breed']:
        return 'Long Hair'
    
    else:
        return row['Breed']
    
data['CombinedBreed'] = data.apply (lambda row: combineBreed(row), axis=1)

### The Results for Breed

After combining breeds, we ended up with only 11 potential breed attribute values.

Mutt is any animal, cat or dog, that is not purebred.

Purebred dog categories: Herding, Hound, Non-Sporting, Sporting, Terrier, Toy, and Working

Purebred cat categories: Short Hair, Semi-Long Hair, Long Hair

In [13]:
display(data.groupby('Pitbull').count())
display(data.groupby('CombinedBreed').count())

Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,CombinedBreed
Pitbull,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453,92453
1,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408,11408


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull
CombinedBreed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Herding,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157
Hound,491,491,491,491,491,491,491,491,491,491,491,491,491,491,491,491
Long Hair,142,142,142,142,142,142,142,142,142,142,142,142,142,142,142,142
Mutt,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524,92524
Non-Sporting,297,297,297,297,297,297,297,297,297,297,297,297,297,297,297,297
Semi-Longhair,553,553,553,553,553,553,553,553,553,553,553,553,553,553,553,553
Short Hair,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056
Sporting,888,888,888,888,888,888,888,888,888,888,888,888,888,888,888,888
Terrier,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259
Toy,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334


#### Checking the Results for Breed

To make sure that all the cats are falling into their correct breeds and all the dogs are falling into their correct breeds, we made copies of our dataset and dropped dogs from one and cats from the other. After, we looked at the breeds of just cats and just dogs. As you can see, cats and dogs were sorted into their correct breeds, with the only common "breed" between them being mutt.

In [14]:
dataCopy1 = data.copy(deep = True)
dataCopy2 = data.copy(deep = True)

indexNames = data[ data['Animal Type'] == 'Dog'].index

dataCopy1.drop(indexNames , inplace=True)

catData = dataCopy1

print("\n\nBreeds for cats:")
display(catData.groupby('CombinedBreed').count())

indexNames = data[ data['Animal Type'] == 'Cat'].index
    
dataCopy2.drop(indexNames , inplace=True)

dogData = dataCopy2

print("\n\nBreeds for dogs:")
display(dogData.groupby('CombinedBreed').count())



Breeds for cats:


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull
CombinedBreed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Long Hair,142,142,142,142,142,142,142,142,142,142,142,142,142,142,142,142
Mutt,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648,36648
Semi-Longhair,553,553,553,553,553,553,553,553,553,553,553,553,553,553,553,553
Short Hair,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056,4056




Breeds for dogs:


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull
CombinedBreed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Herding,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157,1157
Hound,491,491,491,491,491,491,491,491,491,491,491,491,491,491,491,491
Mutt,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876,55876
Non-Sporting,297,297,297,297,297,297,297,297,297,297,297,297,297,297,297,297
Sporting,888,888,888,888,888,888,888,888,888,888,888,888,888,888,888,888
Terrier,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259
Toy,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334,1334
Working,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160


## Feature Engineering with MonthYear

We determined that MonthYear is the date on which the outcome for the animal occured. We found this out by comparing the Date of Birth attribute to the MonthYear attribute. MonthYear minus Date of Birth consistently resulted in the same age as is shown in the Age Upon Outcome attribute. 

Considering this, we decided to use the MonthYear attribute to create a Season column. The potential values for this attribute are Winter, Spring, Summer, and Fall. 

In [15]:
def month(row):
    val = int(row['MonthYear'].split('/')[0])
    if val in range(3,6):
        return 'Spring'
    elif val in range(6,9):
        return 'Summer'
    elif val in range(9,12):
        return 'Fall'
    else:
        return 'Winter'

data['Season'] = data.apply (lambda row: month(row), axis=1)
data.groupby("Season").count()

Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull,CombinedBreed
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Fall,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115,28115
Spring,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828,23828
Summer,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289,30289
Winter,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629,21629


## Feature Engineering with Sex Upon Outcome

We decided to use an animal's sex upon outcome to create two separate columns. 

* One column is the Gender column. If an animal has female in its Sex Upon Outcome column, for example, then "Female" will be the animal's gender attribute.

* The other column is the Spayed/Neutered column. If an animal was spayed (for females) or neutered (for males), then this column will have a value of 1. If the animal was not spayed or neutered, the value will be 0.

In [16]:
def gender (row):
    if 'Female' in row['Sex upon Outcome']:
        return 'Female'
    elif 'Male' in row['Sex upon Outcome']:
        return 'Male'
    else:
        return row['Sex upon Outcome']
    
def sn (row):
    if 'Spayed' in row['Sex upon Outcome'] or 'Neutered' in row['Sex upon Outcome']:
        return 1
    else:
        return 0
    

data['Gender'] = data.apply (lambda row: gender(row), axis=1)
data['Spayed/Neutered'] = data.apply (lambda row: sn(row), axis=1)

display(data.groupby('Gender').count())
display(data.groupby('Spayed/Neutered').count())


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull,CombinedBreed,Season,Spayed/Neutered
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Female,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996,47996
Male,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865,55865


Unnamed: 0_level_0,MonthYear,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,MainColor,CombinedColor,Pitbull,CombinedBreed,Season,Gender
Spayed/Neutered,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547,26547
1,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314,77314


## Getting Our (Almost) Final Dataset and Creating a One-Hot Encoded Dataset

Because we were done with the MonthYear, Sex upon Outcome, Date of Birth, Color, MainColor, and Breed Columns, we dropped these. Our final dataset without one-hot encoding is called df.

We also created a dataset that is one-hot encoded. This dataset is called dfOHE.

In [17]:
df = data.copy(deep = True)
df = df.drop(['MonthYear','Date of Birth', 'Sex upon Outcome', 'Breed', 'Color', 'MainColor'], axis = 1)

print("\n\nFinal dataframe:")
display(df.head(25))
dfOHE = df.copy(deep = True)

dfOHE = pd.concat([dfOHE,pd.get_dummies(dfOHE['Animal Type'], prefix='AnimalType',dummy_na=True)],axis=1).drop(['Animal Type'],axis=1)
dfOHE = pd.concat([dfOHE,pd.get_dummies(dfOHE['CombinedColor'], prefix='CombinedColor',dummy_na=True)],axis=1).drop(['CombinedColor'],axis=1)
dfOHE = pd.concat([dfOHE,pd.get_dummies(dfOHE['CombinedBreed'], prefix='CombinedBreed',dummy_na=True)],axis=1).drop(['CombinedBreed'],axis=1)
dfOHE = pd.concat([dfOHE,pd.get_dummies(dfOHE['Season'], prefix='Season',dummy_na=True)],axis=1).drop(['Season'],axis=1)
dfOHE = pd.concat([dfOHE,pd.get_dummies(dfOHE['Gender'], prefix='Gender',dummy_na=True)],axis=1).drop(['Gender'],axis=1)

print("\n\nFinal one-hot encoded dataframe:")
display(dfOHE.head(25))

print("\n\nWhen using the get_dummies method a 'nan' column is created. However, as seen below, we don't have any \
nans in our data. So, we decided to drop these nan rows.\n\n")

print("\n\ndfOHE with nans:\n\n")
display(dfOHE.sum())

print("\n\ndfOHE without nans:\n\n")
dfOHE = dfOHE.drop(['AnimalType_nan','CombinedColor_nan', 'CombinedBreed_nan', 'Season_nan', 'Gender_nan'], axis = 1)
display(dfOHE.sum())



Final dataframe:


Unnamed: 0,Outcome Type,Animal Type,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,CombinedColor,Pitbull,CombinedBreed,Season,Gender,Spayed/Neutered
0,Adoption,Dog,2.012298,0,0,0,0,0,White,0,Mutt,Winter,Male,1
1,Adoption,Dog,0.352738,0,0,0,0,0,Brown,0,Mutt,Winter,Male,1
2,Transfer,Cat,0.017783,0,0,0,0,0,Tabby,0,Mutt,Spring,Male,0
3,Adoption,Cat,0.215852,0,0,0,0,0,Black,0,Mutt,Fall,Male,1
4,Adoption,Cat,0.174541,0,0,0,0,1,White,0,Mutt,Summer,Male,1
5,Transfer,Cat,2.003767,1,0,0,0,0,Black,0,Mutt,Summer,Female,0
6,Return to Owner,Cat,7.012321,0,0,0,0,1,Blue,0,Mutt,Winter,Male,1
7,Transfer,Cat,0.007441,0,0,0,0,0,Tabby,0,Mutt,Summer,Male,0
8,Transfer,Cat,0.779775,0,0,0,0,1,Black,0,Mutt,Spring,Female,1
9,Adoption,Dog,2.069911,0,0,0,0,0,White,0,Mutt,Spring,Male,1




Final one-hot encoded dataframe:


Unnamed: 0,Outcome Type,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,...,CombinedBreed_Working,CombinedBreed_nan,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Season_nan,Gender_Female,Gender_Male,Gender_nan
0,Adoption,2.012298,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
1,Adoption,0.352738,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
2,Transfer,0.017783,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
3,Adoption,0.215852,0,0,0,0,0,0,1,1,...,0,0,1,0,0,0,0,0,1,0
4,Adoption,0.174541,0,0,0,0,1,0,1,1,...,0,0,0,0,1,0,0,0,1,0
5,Transfer,2.003767,1,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,1,0,0
6,Return to Owner,7.012321,0,0,0,0,1,0,1,1,...,0,0,0,0,0,1,0,0,1,0
7,Transfer,0.007441,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,1,0
8,Transfer,0.779775,0,0,0,0,1,0,1,1,...,0,0,0,1,0,0,0,1,0,0
9,Adoption,2.069911,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0




When using the get_dummies method a 'nan' column is created. However, as seen below, we don't have any nans in our data. So, we decided to drop these nan rows.




dfOHE with nans:




Outcome Type                   AdoptionAdoptionTransferAdoptionAdoptionTransf...
Age Upon Outcome                                                          234419
TNR                                                                         5518
Suffering                                                                   2338
Aggressive                                                                   525
Rabies                                                                       166
Multicolor                                                                 55996
Pitbull                                                                    11408
Spayed/Neutered                                                            77314
AnimalType_Cat                                                             41399
AnimalType_Dog                                                             62462
AnimalType_nan                                                                 0
CombinedColor_Black         



dfOHE without nans:




Outcome Type                   AdoptionAdoptionTransferAdoptionAdoptionTransf...
Age Upon Outcome                                                          234419
TNR                                                                         5518
Suffering                                                                   2338
Aggressive                                                                   525
Rabies                                                                       166
Multicolor                                                                 55996
Pitbull                                                                    11408
Spayed/Neutered                                                            77314
AnimalType_Cat                                                             41399
AnimalType_Dog                                                             62462
CombinedColor_Black                                                        26471
CombinedColor_Blue          

## Exploring Our Data

Now that we have a cleaned up dataset we decided to look at if we had any class imbalances.

We discovered that the Euthanasia outcome had very few records compared to the other classes.

In [18]:
display(df.groupby('Outcome Type').count()['Animal Type'])

Outcome Type
Adoption           48135
Euthanasia          3464
Return to Owner    19668
Transfer           32594
Name: Animal Type, dtype: int64

In [19]:
dataCopy1 = df.copy(deep = True)
dataCopy2 = df.copy(deep = True)

indexNames = df[ df['Animal Type'] == 'Dog'].index

dataCopy1.drop(indexNames , inplace=True)

catData = dataCopy1

print("\n\nOutcomes for cats:")
display(catData.groupby('Outcome Type').count()['Animal Type'])

indexNames = df[ df['Animal Type'] == 'Cat'].index
    
dataCopy2.drop(indexNames , inplace=True)

dogData = dataCopy2

print("\n\nOutcomes for dogs:")
display(dogData.groupby('Outcome Type').count()['Animal Type'])



Outcomes for cats:


Outcome Type
Adoption           18751
Euthanasia          1781
Return to Owner     1995
Transfer           18872
Name: Animal Type, dtype: int64



Outcomes for dogs:


Outcome Type
Adoption           29384
Euthanasia          1683
Return to Owner    17673
Transfer           13722
Name: Animal Type, dtype: int64

### Splitting Our Data and Outliers

After finding out about the class imbalance, we decided to separate our features from our classes and to look at outliers. 

We noticed that a few records had errors in their MonthYear columns where the MonthYear was a date before the animal's date of birth. We decided to remove these records because there were only 19 of them.

Otherwise, although there are a significant number of other outliers in our data, we decided to leave these in because they did not seem to be illegitimate points. They mainly seemed to occur because of colors and breeds for which there are comparatively few animals with those colors and breeds. This does not mean they are invalid data points, however. 

In [20]:
print('\n\nHow many Age Upon Outcome values are equal to or below 0?')
print(dfOHE.loc[(dfOHE['Age Upon Outcome'] <= 0)].count()['Age Upon Outcome'])

#Drop incorrect ages from dfOHE
indexNames = dfOHE[ dfOHE['Age Upon Outcome'] <= 0].index
dfOHE.drop(indexNames , inplace=True)
print('\n\nHow many Age Upon Outcome values are equal to or below 0 in dfOHE now?')
print(dfOHE.loc[(dfOHE['Age Upon Outcome'] <= 0)].count()['Age Upon Outcome'])

#Drop incorrect ages from df
indexNames = df[ df['Age Upon Outcome'] <= 0].index
df.drop(indexNames , inplace=True)
print('\n\nHow many Age Upon Outcome values are equal to or below 0 in df now?')
print(df.loc[(df['Age Upon Outcome'] <= 0)].count()['Age Upon Outcome'])

#Try method for splitting the data
data_x = dfOHE.drop('Outcome Type', axis = 1)
data_y = dfOHE.copy(deep = True)['Outcome Type']
print("\n\n Our features, or X data in dfOHE:")
display(data_x.head())
print("\n\nOur classes, or Y data:")
display(data_y.head())
print("\n\nDescribe the x data:")
display(data_x.describe())
print('\n\nHow many Age Upon Outcome values are equal to or below 0 in x?')
print(data_x.loc[(data_x['Age Upon Outcome'] <= 0)].count()['Age Upon Outcome'])

from sklearn.preprocessing import StandardScaler

print("\n\nStandardize the x data:")
scalar = StandardScaler()
data_x_stand = pd.DataFrame(scalar.fit_transform(data_x), columns = ['Age Upon Outcome', 'TNR', 'Suffering', 'Aggressive', 'Rabies', 'Multicolor', 'Pitbull', 'Spayed/Neutered', 'AnimalType_Cat', 'AnimalType_Dog', 'CombinedColor_Black', 'CombinedColor_Blue', 'CombinedColor_Brown', 'CombinedColor_Gray', 'CombinedColor_Point', 'CombinedColor_Red', 'CombinedColor_Tabby', 'CombinedColor_Tricolor', 'CombinedColor_White', 'CombinedColor_Yellow', 'CombinedBreed_Herding', 'CombinedBreed_Hound', 'CombinedBreed_Long Hair', 'CombinedBreed_Mutt', 'CombinedBreed_Non-Sporting', 'CombinedBreed_Semi-Longhair', 'CombinedBreed_Short Hair', 'CombinedBreed_Sporting', 'CombinedBreed_Terrier', 'CombinedBreed_Toy', 'CombinedBreed_Working', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Gender_Female', 'Gender_Male'])
display(data_x_stand.head())

print("How many attributes are greater than 4 standard deviations away from their means?")
for i in ['Age Upon Outcome', 'TNR', 'Suffering', 'Aggressive', 'Rabies', 'Multicolor', 'Pitbull', \
          'Spayed/Neutered', 'AnimalType_Cat', 'AnimalType_Dog', 'CombinedColor_Black', \
          'CombinedColor_Blue', 'CombinedColor_Brown', 'CombinedColor_Gray', 'CombinedColor_Point', \
          'CombinedColor_Red', 'CombinedColor_Tabby', 'CombinedColor_Tricolor', 'CombinedColor_White', \
          'CombinedColor_Yellow', 'CombinedBreed_Herding', 'CombinedBreed_Hound', 'CombinedBreed_Long Hair', \
          'CombinedBreed_Mutt', 'CombinedBreed_Non-Sporting', 'CombinedBreed_Semi-Longhair', \
          'CombinedBreed_Short Hair', 'CombinedBreed_Sporting', 'CombinedBreed_Terrier', \
          'CombinedBreed_Toy', 'CombinedBreed_Working', 'Season_Fall', 'Season_Spring', \
          'Season_Summer', 'Season_Winter', 'Gender_Female', 'Gender_Male']:
    print(i, ":", data_x_stand.loc[(data_x_stand[i] >= 4)].count()[i])



How many Age Upon Outcome values are equal to or below 0?
19


How many Age Upon Outcome values are equal to or below 0 in dfOHE now?
0


How many Age Upon Outcome values are equal to or below 0 in df now?
0


 Our features, or X data in dfOHE:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
0,2.012298,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,1
1,0.352738,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,1
2,0.017783,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1
3,0.215852,0,0,0,0,0,0,1,1,0,...,0,0,0,0,1,0,0,0,0,1
4,0.174541,0,0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,1,0,0,1




Our classes, or Y data:


0    Adoption
1    Adoption
2    Transfer
3    Adoption
4    Adoption
Name: Outcome Type, dtype: object



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,...,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0,103842.0
mean,2.257554,0.053138,0.022505,0.005056,0.001599,0.539146,0.109859,0.744419,0.398596,0.601404,...,0.008551,0.012124,0.012846,0.011171,0.2707,0.229416,0.291645,0.208239,0.462135,0.537865
std,3.001397,0.224311,0.148321,0.070924,0.03995,0.498468,0.312716,0.436189,0.489612,0.489612,...,0.092078,0.109441,0.112612,0.105101,0.444323,0.420459,0.454522,0.406051,0.498567,0.498567
min,0.001077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.253082,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.018485,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,3.004391,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,24.018213,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0




How many Age Upon Outcome values are equal to or below 0 in x?
0


Standardize the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
0,-0.081714,-0.236898,-0.151735,-0.071284,-0.040014,-1.081612,-0.351309,0.585943,-0.81411,0.81411,...,-0.092872,-0.110784,-0.114077,-0.106287,-0.609243,-0.545634,-0.641655,1.949914,-0.926931,0.926931
1,-0.634646,-0.236898,-0.151735,-0.071284,-0.040014,-1.081612,-0.351309,0.585943,-0.81411,0.81411,...,-0.092872,-0.110784,-0.114077,-0.106287,-0.609243,-0.545634,-0.641655,1.949914,-0.926931,0.926931
2,-0.746246,-0.236898,-0.151735,-0.071284,-0.040014,-1.081612,-0.351309,-1.706652,1.228335,-1.228335,...,-0.092872,-0.110784,-0.114077,-0.106287,-0.609243,1.832729,-0.641655,-0.512843,-0.926931,0.926931
3,-0.680254,-0.236898,-0.151735,-0.071284,-0.040014,-1.081612,-0.351309,0.585943,1.228335,-1.228335,...,-0.092872,-0.110784,-0.114077,-0.106287,1.641381,-0.545634,-0.641655,-0.512843,-0.926931,0.926931
4,-0.694018,-0.236898,-0.151735,-0.071284,-0.040014,0.924546,-0.351309,0.585943,1.228335,-1.228335,...,-0.092872,-0.110784,-0.114077,-0.106287,-0.609243,-0.545634,1.558469,-0.512843,-0.926931,0.926931


How many attributes are greater than 4 standard deviations away from their means?
Age Upon Outcome : 672
TNR : 5518
Suffering : 2337
Aggressive : 525
Rabies : 166
Multicolor : 0
Pitbull : 0
Spayed/Neutered : 0
AnimalType_Cat : 0
AnimalType_Dog : 0
CombinedColor_Black : 0
CombinedColor_Blue : 0
CombinedColor_Brown : 0
CombinedColor_Gray : 1162
CombinedColor_Point : 1808
CombinedColor_Red : 3248
CombinedColor_Tabby : 0
CombinedColor_Tricolor : 0
CombinedColor_White : 0
CombinedColor_Yellow : 1165
CombinedBreed_Herding : 1156
CombinedBreed_Hound : 488
CombinedBreed_Long Hair : 142
CombinedBreed_Mutt : 0
CombinedBreed_Non-Sporting : 297
CombinedBreed_Semi-Longhair : 553
CombinedBreed_Short Hair : 4053
CombinedBreed_Sporting : 888
CombinedBreed_Terrier : 1259
CombinedBreed_Toy : 1334
CombinedBreed_Working : 1160
Season_Fall : 0
Season_Spring : 0
Season_Summer : 0
Season_Winter : 0
Gender_Female : 0
Gender_Male : 0


### Testing the Process for Fixing Our Class Imbalance

We decided to test out using SMOTE for the minority class, Euthanasia, and to randomly under sample from the other classes. To do this we used the python library imblearn to carry out random under-sampling and SMOTE. If you don't have imblearn installed, remove the # before !pip. 

In [21]:
# import SMOTE module from imblearn library 
# Use the below code to pip install imblearn if it is not already installed
#!pip install imblearn 
from imblearn.over_sampling import SMOTE 
#We want to have 7000 records for Euthanasia
sm = SMOTE(sampling_strategy= {'Euthanasia': 7000})
x_sam, y_sam = sm.fit_sample(data_x, data_y) 


from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy= {'Adoption': 10000, 'Transfer': 10000, 'Return to Owner': 10000})
x_sam, y_sam = rus.fit_resample(x_sam, y_sam)

In [22]:
#Convert the data with synthetic minority over sampling back into pandas dataframes so that
# we can make sure we have the right count for Euthanasia

#for col in dfOHE.columns: 
    #lst.append(col)
#print(lst)

x_sam = pd.DataFrame(x_sam, columns = ['Age Upon Outcome', 'TNR', 'Suffering', 'Aggressive', 'Rabies', 'Multicolor', \
                                       'Pitbull', 'Spayed/Neutered', 'AnimalType_Cat', 'AnimalType_Dog', \
                                       'CombinedColor_Black', 'CombinedColor_Blue', 'CombinedColor_Brown', \
                                       'CombinedColor_Gray', 'CombinedColor_Point', 'CombinedColor_Red', \
                                       'CombinedColor_Tabby', 'CombinedColor_Tricolor', 'CombinedColor_White', \
                                       'CombinedColor_Yellow', 'CombinedBreed_Herding', 'CombinedBreed_Hound', \
                                       'CombinedBreed_Long Hair', 'CombinedBreed_Mutt', 'CombinedBreed_Non-Sporting', \
                                       'CombinedBreed_Semi-Longhair', 'CombinedBreed_Short Hair', \
                                       'CombinedBreed_Sporting', 'CombinedBreed_Terrier', 'CombinedBreed_Toy', 
                                       'CombinedBreed_Working', 'Season_Fall', 'Season_Spring', 'Season_Summer', \
                                       'Season_Winter', 'Gender_Female', 'Gender_Male'])
#display(X_sam)

y_sam = pd.DataFrame(y_sam, columns = ['Outcome Type'])
#display(y_sam.head())

x_sam['Outcome Type'] = y_sam['Outcome Type']

print('\n\nDid SMOTE and random undersamlping work? We should have 7000 samples for Euthanasia and 10000 for \
everything else now:')
display(x_sam.groupby('Outcome Type').count()['Age Upon Outcome'])



Did SMOTE and random undersamlping work? We should have 7000 samples for Euthanasia and 10000 for everything else now:


Outcome Type
Adoption           10000
Euthanasia          7000
Return to Owner    10000
Transfer           10000
Name: Age Upon Outcome, dtype: int64

In [23]:
print("\n\nStandardize the x_sam data:")
scalar = StandardScaler()
x_sam.drop('Outcome Type', axis =1, inplace = True)
x_sam = pd.DataFrame(scalar.fit_transform(x_sam), columns = ['Age Upon Outcome', 'TNR', 'Suffering', 'Aggressive', 'Rabies', 'Multicolor', 'Pitbull', 'Spayed/Neutered', 'AnimalType_Cat', 'AnimalType_Dog', 'CombinedColor_Black', 'CombinedColor_Blue', 'CombinedColor_Brown', 'CombinedColor_Gray', 'CombinedColor_Point', 'CombinedColor_Red', 'CombinedColor_Tabby', 'CombinedColor_Tricolor', 'CombinedColor_White', 'CombinedColor_Yellow', 'CombinedBreed_Herding', 'CombinedBreed_Hound', 'CombinedBreed_Long Hair', 'CombinedBreed_Mutt', 'CombinedBreed_Non-Sporting', 'CombinedBreed_Semi-Longhair', 'CombinedBreed_Short Hair', 'CombinedBreed_Sporting', 'CombinedBreed_Terrier', 'CombinedBreed_Toy', 'CombinedBreed_Working', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Gender_Female', 'Gender_Male'])
display(x_sam.describe())

print("How many attributes are greater than 4 standard deviations away from their means?")
for i in ['Age Upon Outcome', 'TNR', 'Suffering', 'Aggressive', 'Rabies', 'Multicolor', 'Pitbull', \
          'Spayed/Neutered', 'AnimalType_Cat', 'AnimalType_Dog', 'CombinedColor_Black', \
          'CombinedColor_Blue', 'CombinedColor_Brown', 'CombinedColor_Gray', 'CombinedColor_Point', \
          'CombinedColor_Red', 'CombinedColor_Tabby', 'CombinedColor_Tricolor', 'CombinedColor_White', \
          'CombinedColor_Yellow', 'CombinedBreed_Herding', 'CombinedBreed_Hound', 'CombinedBreed_Long Hair', \
          'CombinedBreed_Mutt', 'CombinedBreed_Non-Sporting', 'CombinedBreed_Semi-Longhair', \
          'CombinedBreed_Short Hair', 'CombinedBreed_Sporting', 'CombinedBreed_Terrier', \
          'CombinedBreed_Toy', 'CombinedBreed_Working', 'Season_Fall', 'Season_Spring', \
          'Season_Summer', 'Season_Winter', 'Gender_Female', 'Gender_Male']:
    print(i, ":", x_sam.loc[(x_sam[i] >= 4)].count()[i])



Standardize the x_sam data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,...,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0,37000.0
mean,1.158279e-14,1.242134e-13,9.866369e-14,4.159114e-14,-1.160281e-14,1.464091e-14,3.621454e-15,5.430506e-14,7.708089e-15,-7.645962e-15,...,-8.470432e-16,6.192627e-15,8.308631e-15,1.11281e-15,-5.920969e-16,2.175071e-15,-3.441235e-15,-4.506881e-15,-2.234753e-15,2.224731e-15
std,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,...,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014
min,-0.8109013,-0.2192477,-0.3886581,-0.174016,-0.09354217,-1.088472,-0.3902398,-1.42075,-0.7928165,-1.262497,...,-0.09438442,-0.1154088,-0.1127691,-0.1094553,-0.6086772,-0.5658498,-0.6205785,-0.5181681,-0.888717,-1.1272
25%,-0.7124111,-0.2192477,-0.3886581,-0.174016,-0.09354217,-1.088472,-0.3902398,-1.42075,-0.7928165,-1.262497,...,-0.09438442,-0.1154088,-0.1127691,-0.1094553,-0.6086772,-0.5658498,-0.6205785,-0.5181681,-0.888717,-1.1272
50%,-0.3708305,-0.2192477,-0.3886581,-0.174016,-0.09354217,0.927645,-0.3902398,0.7116803,-0.7928165,0.7928165,...,-0.09438442,-0.1154088,-0.1127691,-0.1094553,-0.6086772,-0.5658498,-0.6205785,-0.5181681,-0.888717,0.888717
75%,0.2864963,-0.2192477,-0.3886581,-0.174016,-0.09354217,0.927645,-0.3902398,0.7116803,1.262497,0.7928165,...,-0.09438442,-0.1154088,-0.1127691,-0.1094553,1.648834,-0.5658498,1.616821,-0.5181681,1.1272,0.888717
max,5.624993,4.561051,2.607516,5.892578,11.48665,0.927645,2.589315,0.7116803,1.262497,0.7928165,...,10.75451,8.782805,8.94466,9.206836,1.648834,1.773101,1.616821,1.936601,1.1272,0.888717


How many attributes are greater than 4 standard deviations away from their means?
Age Upon Outcome : 130
TNR : 1697
Suffering : 0
Aggressive : 1035
Rabies : 314
Multicolor : 0
Pitbull : 0
Spayed/Neutered : 0
AnimalType_Cat : 0
AnimalType_Dog : 0
CombinedColor_Black : 0
CombinedColor_Blue : 0
CombinedColor_Brown : 0
CombinedColor_Gray : 450
CombinedColor_Point : 599
CombinedColor_Red : 1170
CombinedColor_Tabby : 0
CombinedColor_Tricolor : 0
CombinedColor_White : 0
CombinedColor_Yellow : 445
CombinedBreed_Herding : 391
CombinedBreed_Hound : 179
CombinedBreed_Long Hair : 48
CombinedBreed_Mutt : 0
CombinedBreed_Non-Sporting : 127
CombinedBreed_Semi-Longhair : 183
CombinedBreed_Short Hair : 1258
CombinedBreed_Sporting : 324
CombinedBreed_Terrier : 479
CombinedBreed_Toy : 459
CombinedBreed_Working : 438
Season_Fall : 0
Season_Spring : 0
Season_Summer : 0
Season_Winter : 0
Gender_Female : 0
Gender_Male : 0


## Keeping Only Adoption and Euthanasia

After all our cleaning, exploration, and feature engineering, we decided that we needed to drop records with any outcome besides Adoption and Euthanasia for the sake of cutting down records and making it easier to use under-sampling and over-sampling techniques.

In [24]:
indexNames = dfOHE[ dfOHE['Outcome Type'] == 'Return to Owner'].index
    
dfOHE.drop(indexNames , inplace=True)

indexNames = dfOHE[ dfOHE['Outcome Type'] == 'Transfer'].index
    
dfOHE.drop(indexNames , inplace=True)

display(dfOHE.groupby('Outcome Type').count())

indexNames = df[ df['Outcome Type'] == 'Return to Owner'].index
    
df.drop(indexNames , inplace=True)

indexNames = df[ df['Outcome Type'] == 'Transfer'].index
    
df.drop(indexNames , inplace=True)

display(df.groupby('Outcome Type').count())

display(df.shape)

Unnamed: 0_level_0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
Outcome Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Adoption,48129,48129,48129,48129,48129,48129,48129,48129,48129,48129,...,48129,48129,48129,48129,48129,48129,48129,48129,48129,48129
Euthanasia,3461,3461,3461,3461,3461,3461,3461,3461,3461,3461,...,3461,3461,3461,3461,3461,3461,3461,3461,3461,3461


Unnamed: 0_level_0,Animal Type,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,CombinedColor,Pitbull,CombinedBreed,Season,Gender,Spayed/Neutered
Outcome Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Adoption,48129,48129,48129,48129,48129,48129,48129,48129,48129,48129,48129,48129,48129
Euthanasia,3461,3461,3461,3461,3461,3461,3461,3461,3461,3461,3461,3461,3461


(51590, 14)

# Creating Classification Models with Our Data

Below we made a...

* K-nearest neighbor classification model
* Naive bayes classification model
* Decision tree classification model
* Random forest classification model
* SVM classification model
* Neural net classification model

The imports required for these models are below. 

Because we have so many records, even after limiting our classification to Adoption and Euthanasia, we had to build some of our models off of random samples rather than our entire dataset. 

To deal with the class imbalance, we decided to try both SMOTE and random under-sampling for most of our models.

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE 
from imblearn.under_sampling import RandomUnderSampler

## K-Nearest Neighbor

Below we implemented a K-nearest neighbor classifier.

Because KNN is so computationally expensive when there are a lot of records (and therefore distances to be calculated), we decided to use a random sample of our data in order to use a KNN model. 

To create and test the accuracy of our model, we followed the below steps:


* We took a random sample of the records in our data


* We split our data into features and labels


* We created a pipeline with the following steps:
    * Scaling using StandardScaler()
    * SMOTE or random under-sampling using SMOTE() or RandomUnderSampler()
    * Principle component analysis using PCA()
    * K-nearest neighbors classifier using KNeighborsClassifier()
    
    
* We created a parameter grid and tuned the following:
    * We tuned the ratio of minority class to majority class to have in our data
    * We tuned the number of principle components to keep
    * We tuned the number to use for k
    
    
* We then passed our pipe into a GridSearchCV


* We then passed our GridSearchCV into a cross_val_predict and used these predictions to create a classification report

In [26]:
data_sample = dfOHE.sample(5000)
display(data_sample.shape)
display(data_sample.groupby('Outcome Type').count()['Age Upon Outcome'])
x = data_sample.drop('Outcome Type', axis = 1)
y = data_sample.copy(deep = True)['Outcome Type']
print("\n\nDescribe the x data:")
display(x.describe())
print("\n\nDescribe the y data:")
display(y.describe())

(5000, 38)

Outcome Type
Adoption      4668
Euthanasia     332
Name: Age Upon Outcome, dtype: int64



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1.967512,0.0,0.0432,0.0102,0.0036,0.558,0.0998,0.931,0.394,0.606,...,0.0074,0.0096,0.0112,0.0084,0.267,0.202,0.307,0.224,0.4858,0.5142
std,2.719363,0.0,0.203327,0.100489,0.059898,0.496674,0.299763,0.253479,0.488684,0.488684,...,0.085713,0.097518,0.105246,0.091275,0.442437,0.401532,0.461296,0.416964,0.499848,0.499848
min,0.00688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.240472,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.968868,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.138445,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,18.448052,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0




Describe the y data:


count         5000
unique           2
top       Adoption
freq          4668
Name: Outcome Type, dtype: object

In [27]:
#Create a StandardScaler
scale = StandardScaler()

#create SMOTE
sm = SMOTE()

#create a PCA
pca = PCA()

#create a KNN classifier
knn = KNeighborsClassifier()

#create a pipeline that does a SMOTE a PCA and a KNN
pipe = Pipeline(steps=[('sm',sm), ('scale',scale), ('pca', pca), ('knn', knn)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'sm__sampling_strategy': [0.20,0.50,0.70],
    'pca__n_components': [5,10,15,20,25,30,35], 
    'knn__n_neighbors': [2,4,8,10,20,30,40],  
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# call fit() on the GridSearchCV and pass in the normalized data (X_values, Y_values)
#grid_search.fit(x, y)

# print out the best_score_ and best_params_ from the GridSearchCV
#print(grid_search.best_params_)
#print(grid_search.best_score_)

In [28]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.98      1.00      0.99      4668
  Euthanasia       0.97      0.73      0.84       332

    accuracy                           0.98      5000
   macro avg       0.97      0.87      0.91      5000
weighted avg       0.98      0.98      0.98      5000



In [29]:
#Create a StandardScaler
scale = StandardScaler()

#create random undersampling
rus = RandomUnderSampler()

#create a PCA
pca = PCA()

#create a KNN classifier
knn = KNeighborsClassifier()

#create a pipeline that does a RUS a PCA and a KNN
pipe = Pipeline(steps=[('rus',rus),('scale',scale),('pca', pca), ('knn', knn)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'rus__sampling_strategy': [0.20,0.50,0.70],
    'pca__n_components': [5,10,15,20,25,30,35], #find how many principal componenet to keep
    'knn__n_neighbors': [2,4,8,10,20,30,40],  #find the best value of k
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# call fit() on the GridSearchCV and pass in the normalized data (X_values, Y_values)
#grid_search.fit(x, y)

# print out the best_score_ and best_params_ from the GridSearchCV
#print(grid_search.best_params_)
#print(grid_search.best_score_)

In [30]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.98      1.00      0.99      4668
  Euthanasia       0.99      0.69      0.81       332

    accuracy                           0.98      5000
   macro avg       0.98      0.85      0.90      5000
weighted avg       0.98      0.98      0.98      5000



## Naive Bayes

To create and test the accuracy of our model, we followed the below steps:


* We split our data into features and labels


* We created a pipeline with the following steps:
    * SMOTE or random under-sampling using SMOTE() or RandomUnderSampler()
    * Naive bayes classifier using GaussianNB()
    
    
* We created a parameter grid and tuned the following:
    * We tuned the ratio of minority class to majority class to have in our data
    
    
* We then passed our pipe into a GridSearchCV


* We then passed our GridSearchCV into a cross_val_predict and used these predictions to create a classification report

In [31]:
x = dfOHE.drop('Outcome Type', axis = 1)
y = dfOHE.copy(deep = True)['Outcome Type']
print("\n\nDescribe the x data:")
display(x.describe())



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,...,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0
mean,1.913578,0.0,0.045299,0.010176,0.003218,0.553673,0.101337,0.927951,0.397887,0.602113,...,0.006862,0.010079,0.009207,0.00913,0.262435,0.203412,0.308238,0.225916,0.48961,0.51039
std,2.700683,0.0,0.207962,0.100364,0.056634,0.497116,0.301778,0.258571,0.489467,0.489467,...,0.082552,0.09989,0.095512,0.095113,0.439962,0.40254,0.46177,0.418188,0.499897,0.499897
min,0.001077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.240167,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.931234,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.099821,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,22.018396,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [32]:
#create a Naive Bayes classifier
nb = GaussianNB()

#create SMOTE
sm = SMOTE()

#create a pipeline that does a SMOTE and NB
pipe = Pipeline(steps=[('sm',sm),('nb', nb)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'sm__sampling_strategy': [0.10,0.20,0.50,0.70]
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# call fit() on the GridSearchCV and pass in the normalized data (X_values, Y_values)
#grid_search.fit(x, y)

# print out the best_score_ and best_params_ from the GridSearchCV
#print(grid_search.best_params_)
#print(grid_search.best_score_)

In [33]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      1.00     48129
  Euthanasia       1.00      0.88      0.93      3461

    accuracy                           0.99     51590
   macro avg       0.99      0.94      0.96     51590
weighted avg       0.99      0.99      0.99     51590



In [34]:
#create a Naive Bayes classifier
nb = GaussianNB()

#create random undersampling
rus = RandomUnderSampler()

#create a pipeline that does a SMOTE and NB
pipe = Pipeline(steps=[('rus',rus),('nb', nb)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'rus__sampling_strategy': [0.10,0.20,0.50,0.70]
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# call fit() on the GridSearchCV and pass in the normalized data (X_values, Y_values)
#grid_search.fit(x, y)

# print out the best_score_ and best_params_ from the GridSearchCV
#print(grid_search.best_params_)
#print(grid_search.best_score_)

In [35]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      0.99      0.99     48129
  Euthanasia       0.90      0.88      0.89      3461

    accuracy                           0.99     51590
   macro avg       0.95      0.94      0.94     51590
weighted avg       0.99      0.99      0.99     51590



## Decision Tree

Next we decided to try a decision tree.

To create and test the accuracy of our model, we followed the below steps:


* We split our data into features and labels


* We created a pipeline with the following steps:
    * SMOTE or random under-sampling using SMOTE() or RandomUnderSampler()
    * Decision tree classifier using DecisionTreeClassifier()
    
    
* We created a parameter grid and tuned the following:
    * We tuned the ratio of minority class to majority class to have in our data
    * We tuned the depth of the tree
    * We tuned the minimum number of samples to have in the leaves of the tree
    
    
* We then passed our pipe into a GridSearchCV


* We then passed our GridSearchCV into a cross_val_predict and used these predictions to create a classification report

In [36]:
x = dfOHE.drop('Outcome Type', axis = 1)
y = dfOHE.copy(deep = True)['Outcome Type']
print("\n\nDescribe the x data:")
display(x.describe())



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,...,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0
mean,1.913578,0.0,0.045299,0.010176,0.003218,0.553673,0.101337,0.927951,0.397887,0.602113,...,0.006862,0.010079,0.009207,0.00913,0.262435,0.203412,0.308238,0.225916,0.48961,0.51039
std,2.700683,0.0,0.207962,0.100364,0.056634,0.497116,0.301778,0.258571,0.489467,0.489467,...,0.082552,0.09989,0.095512,0.095113,0.439962,0.40254,0.46177,0.418188,0.499897,0.499897
min,0.001077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.240167,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.931234,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.099821,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,22.018396,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [37]:
#create a decision tree classifier
tree = DecisionTreeClassifier()

#create SMOTE
sm = SMOTE()

#create a pipeline that does a SMOTE and NB
pipe = Pipeline(steps=[('sm',sm),('tree', tree)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'tree__max_depth' : [5,10,20],
    'tree__min_samples_leaf': [2,6,10],
    'sm__sampling_strategy': [0.10,0.20,0.50,0.70]
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# call fit() on the GridSearchCV and pass in the normalized data (X_values, Y_values)
#grid_search.fit(x, y)

# print out the best_score_ and best_params_ from the GridSearchCV
#print(grid_search.best_params_)
#print(grid_search.best_score_)

In [38]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      1.00     48129
  Euthanasia       0.99      0.89      0.94      3461

    accuracy                           0.99     51590
   macro avg       0.99      0.95      0.97     51590
weighted avg       0.99      0.99      0.99     51590



In [39]:
#create a decision tree classifier
tree = DecisionTreeClassifier()

#create RUS
rus = RandomUnderSampler()

#create a pipeline that does a SMOTE and NB
pipe = Pipeline(steps=[('rus',rus),('tree', tree)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'tree__max_depth' : [5,10,20],
    'tree__min_samples_leaf': [2,6,10],
    'rus__sampling_strategy': [0.10,0.20,0.50,0.70]
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# call fit() on the GridSearchCV and pass in the normalized data (X_values, Y_values)
#grid_search.fit(x, y)

# print out the best_score_ and best_params_ from the GridSearchCV
#print(grid_search.best_params_)
#print(grid_search.best_score_)

In [40]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      1.00     48129
  Euthanasia       0.99      0.89      0.94      3461

    accuracy                           0.99     51590
   macro avg       0.99      0.94      0.97     51590
weighted avg       0.99      0.99      0.99     51590



## Random Forest Classifier

Next we decided to try a random forest classifier.


To create and test the accuracy of our model, we followed the below steps:


* We split our data into features and labels


* We created a pipeline with the following steps:
    * Random under-sampling using RandomUnderSampler()
        * Note that we only tried random under-sampling to help cut down on run-time
    * Random forest classifier using RandomForestClassifier()
    
    
* We created a parameter grid and tuned the following:
    * We tuned the ratio of minority class to majority class to have in our data
    * We tuned the max depth
    * We tuned the max samples
    * We tuned the max features
    
    
* We then passed our pipe into a GridSearchCV


* We then passed our GridSearchCV into a cross_val_predict and used these predictions to create a classification report

In [41]:
x = dfOHE.drop('Outcome Type', axis = 1)
y = dfOHE.copy(deep = True)['Outcome Type']
print("\n\nDescribe the x data:")
display(x.describe())



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,...,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0,51590.0
mean,1.913578,0.0,0.045299,0.010176,0.003218,0.553673,0.101337,0.927951,0.397887,0.602113,...,0.006862,0.010079,0.009207,0.00913,0.262435,0.203412,0.308238,0.225916,0.48961,0.51039
std,2.700683,0.0,0.207962,0.100364,0.056634,0.497116,0.301778,0.258571,0.489467,0.489467,...,0.082552,0.09989,0.095512,0.095113,0.439962,0.40254,0.46177,0.418188,0.499897,0.499897
min,0.001077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.240167,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.931234,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.099821,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,22.018396,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [42]:
#Create a random forest classifier
rfc = RandomForestClassifier()

#Create a standard scaler
#scaler = sk.preprocessing.StandardScaler()

#Create a random under-sampler
rus = RandomUnderSampler()

#Create a PCA
#pca = sk.decomposition.PCA()

pipe = Pipeline(steps = [('rus', rus), ('rfc', rfc)])
param_grid = {'rus__sampling_strategy': [.25,.5,.75], 
              'rfc__max_depth': range(5,15), 
              'rfc__min_samples_leaf' : [2,6,10], 
              'rfc__max_features' : ["sqrt", "log2"]}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

In [43]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      0.99     48129
  Euthanasia       0.95      0.89      0.92      3461

    accuracy                           0.99     51590
   macro avg       0.97      0.95      0.96     51590
weighted avg       0.99      0.99      0.99     51590



## SVM

Next we tried an SVM classifier.


To create and test the accuracy of our model, we followed the below steps:

* We took a sample of our data


* We split our data into features and labels


* We created a pipeline with the following steps:
    * SMOTE or random under-sampling using SMOTE() or RandomUnderSampler()
    * Scaling using StandardScaler()
    * Principle component analysis using PCA()
    * SVM classifier using SVC()
    
    
* We created a parameter grid and tuned the following:
    * We tuned the ratio of minority class to majority class to have in our data
    * We tuned the number of principle components to keep
    * We tuned the kernal to use for SVM
    
    
* We then passed our pipe into a GridSearchCV


* We then passed our GridSearchCV into a cross_val_predict and used these predictions to create a classification report

In [44]:
data_sample = dfOHE.sample(5000)
x = data_sample.drop('Outcome Type', axis = 1)
y = data_sample.copy(deep = True)['Outcome Type']
print("\n\nDescribe the x data:")
display(x.describe())



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1.912169,0.0,0.044,0.0106,0.0026,0.5562,0.0936,0.9302,0.4136,0.5864,...,0.0068,0.0084,0.0072,0.007,0.2592,0.199,0.3122,0.2296,0.4842,0.5158
std,2.686797,0.0,0.205116,0.102419,0.050929,0.496881,0.291301,0.254835,0.492528,0.492528,...,0.082189,0.091275,0.084555,0.083381,0.438239,0.399288,0.463437,0.420618,0.4998,0.4998
min,0.001743,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.237811,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.926435,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.10887,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,18.026568,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [45]:
#Create the SVM classifier
svc = SVC()

#Create the standard  scaler
scaler = StandardScaler()

#Create the SMOTE
sm = SMOTE()

#Create the principle component analysis
pca = PCA()

pipe = Pipeline(steps = [('sm',sm),('StandardScaler',scaler),('pca',pca,),('svc',svc)])
param_grid = {
    'pca__n_components':[10,15,20,25,30,35], 'sm__sampling_strategy':[.25,.5,.75], 
    'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 'svc__gamma':['auto']
}

GSCV = GridSearchCV(pipe,param_grid=param_grid,cv=5)

In [46]:
predictions = cross_val_predict(GSCV,x,y, cv=5)
print(classification_report(y,predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      1.00      4672
  Euthanasia       0.99      0.87      0.93       328

    accuracy                           0.99      5000
   macro avg       0.99      0.94      0.96      5000
weighted avg       0.99      0.99      0.99      5000



In [47]:
#Create the SVM classifier
svc = SVC()

#Create the standard  scaler
scaler = StandardScaler()

#Create the random under-sampler
rus = RandomUnderSampler()

#Create the principle component analysis
pca = PCA()

pipe = Pipeline(steps = [('rus',rus),('StandardScaler',scaler),('pca',pca,),('svc',svc)])
param_grid = {
    'pca__n_components':[10,15,20,25,30,35], 'rus__sampling_strategy':[.25,.5,.75], 
    'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 'svc__gamma':['auto']
}

GSCV = GridSearchCV(pipe,param_grid=param_grid,cv=5)

In [48]:
predictions = cross_val_predict(GSCV,x,y, cv=5)
print(classification_report(y,predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      0.99      4672
  Euthanasia       0.95      0.88      0.91       328

    accuracy                           0.99      5000
   macro avg       0.97      0.94      0.95      5000
weighted avg       0.99      0.99      0.99      5000



## Neural Net

Finally, the last model we used was a neural net.


To create and test the accuracy of our model, we followed the below steps:


* We took a sample of our data


* We split our data into features and labels


* We created a pipeline with the following steps:
    * SMOTE or random under-sampling using SMOTE() or RandomUnderSampler()
    * Scaling using StandardScaler()
    * Principle component analysis using PCA()
    * Neural net classifier using MLPClassifier()
    
    
* We created a parameter grid and tuned the following:
    * We tuned the ratio of minority class to majority class to have in our data
    * We tuned the number of principle components to keep
    * We tuned the number of hiddlayers for the nn
    * We tuned the activation function for the nn
    
    
* We then passed our pipe into a GridSearchCV


* We then passed our GridSearchCV into a cross_val_predict and used these predictions to create a classification report


In [49]:
data_sample = dfOHE.sample(1000)
x = data_sample.drop('Outcome Type', axis = 1)
y = data_sample.copy(deep = True)['Outcome Type']
print("\n\nDescribe the x data:")
display(x.describe())



Describe the x data:


Unnamed: 0,Age Upon Outcome,TNR,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,...,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Gender_Female,Gender_Male
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,1.769384,0.0,0.04,0.013,0.003,0.559,0.101,0.937,0.383,0.617,...,0.007,0.013,0.013,0.009,0.28,0.184,0.321,0.215,0.485,0.515
std,2.606872,0.0,0.196057,0.113331,0.054717,0.496755,0.30148,0.243085,0.486362,0.486362,...,0.083414,0.113331,0.113331,0.094488,0.449224,0.387678,0.467094,0.411028,0.500025,0.500025
min,0.034658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.241885,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.863634,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.029321,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
max,16.371436,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [50]:
#create scaler
scaler = StandardScaler()

#create SMOTE
sm = SMOTE()

#create a PCA
pca = PCA()

#create neural net
net = MLPClassifier()

pipe = Pipeline(steps=[('sm',sm),('scaler',scaler),('pca',pca),('net', net)])

param_grid = {'pca__n_components':[8,10,15,20],'sm__sampling_strategy':[.25,.5,.75],
                  'net__hidden_layer_sizes':[(5,), (10,), (15,)], 
                  'net__activation':['logistic', 'tanh', 'relu']
                 }

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1_macro')

In [51]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.98      0.98      0.98       936
  Euthanasia       0.76      0.73      0.75        64

    accuracy                           0.97      1000
   macro avg       0.87      0.86      0.86      1000
weighted avg       0.97      0.97      0.97      1000



In [52]:
#create scaler
scaler = StandardScaler()

#create random undersampling
rus = RandomUnderSampler()

#create a PCA
pca = PCA()

#create neural net
net = MLPClassifier()

pipe = Pipeline(steps=[('rus',rus),('scaler',scaler),('pca',pca),('net', net)])

param_grid = {'pca__n_components':[8,10,15,20],'rus__sampling_strategy':[.25,.5,.75],
                  'net__hidden_layer_sizes':[(5,), (10,), (15,)], 
                  'net__activation':['logistic', 'tanh', 'relu']
                 }

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1_macro')

In [53]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.98      0.97      0.98       936
  Euthanasia       0.66      0.72      0.69        64

    accuracy                           0.96      1000
   macro avg       0.82      0.85      0.83      1000
weighted avg       0.96      0.96      0.96      1000



# Selecting a Final Model and One Last Challenge

To determine which model worked best for our data, we considered the minority (Euthanasia) class f-1 scores. 

Although multiple models had the same highest f-1 score for Euthanasia (0.94), we decided to use the decision tree model because it is the most simple and fast.

It was at this point that we realized that including the Age Upon Outcome and Season attributes might not make sense because these attributes wouldn't be known until the outcome occurred. It could be argued that, because it is likely that not too much time will pass between the animal entering into the shelter and the outcome of the animal, the age of the animal upon entering the facility could be used as the input to the Age Upon Outcome attribute in our model, as an approximation of the animal's Age Upon Outcome. However, this argument is not quite so true for Season, because an animal could, for example, enter the shelter at the very end of Winter, in which case Winter would not be a good approximation of the Season in which the outcome would occur for the animal. 

Considering the issues with these two attributes, we decided to try our chosen model without these attributes to see if the f-1 scores were significantly impacted. Ideally we could try all the models with these attributes dropped, but because the results of the decision tree model were still good even after dropping these attributes, we decided not to for the sake of time. 

As a side note, we also realized that TNR had a count of 0 at this point (which makes sense because none of the animals that were planned to be trapped, neutered, and released were adopted or euthanized). So, we also removed this column.

In [54]:
dfNew = df.drop(['Season', 'Age Upon Outcome', 'TNR'], axis = 1)
dfOHENew = dfNew.copy(deep = True)

dfOHENew = pd.concat([dfOHENew,pd.get_dummies(dfOHENew['Animal Type'], prefix='AnimalType',dummy_na=True)],axis=1).drop(['Animal Type'],axis=1)
dfOHENew = pd.concat([dfOHENew,pd.get_dummies(dfOHENew['CombinedColor'], prefix='CombinedColor',dummy_na=True)],axis=1).drop(['CombinedColor'],axis=1)
dfOHENew = pd.concat([dfOHENew,pd.get_dummies(dfOHENew['CombinedBreed'], prefix='CombinedBreed',dummy_na=True)],axis=1).drop(['CombinedBreed'],axis=1)
dfOHENew = pd.concat([dfOHENew,pd.get_dummies(dfOHENew['Gender'], prefix='Gender',dummy_na=True)],axis=1).drop(['Gender'],axis=1)



dfOHENew = dfOHENew.drop(['AnimalType_nan','CombinedColor_nan', 'CombinedBreed_nan', 'Gender_nan'], axis = 1)
display(dfOHENew.sum())

print("\n\nOne-hot encoded dataframe:")
display(dfOHENew.head(25))

x = dfOHENew.drop('Outcome Type', axis = 1)
y = dfOHENew.copy(deep = True)['Outcome Type']
print("\n\nDisplay the x data:")
display(x.head(25))

Outcome Type                   AdoptionAdoptionAdoptionAdoptionAdoptionAdopti...
Suffering                                                                   2337
Aggressive                                                                   525
Rabies                                                                       166
Multicolor                                                                 28564
Pitbull                                                                     5228
Spayed/Neutered                                                            47873
AnimalType_Cat                                                             20527
AnimalType_Dog                                                             31063
CombinedColor_Black                                                        13353
CombinedColor_Blue                                                          4219
CombinedColor_Brown                                                        11632
CombinedColor_Gray          



One-hot encoded dataframe:


Unnamed: 0,Outcome Type,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,CombinedColor_Black,...,CombinedBreed_Mutt,CombinedBreed_Non-Sporting,CombinedBreed_Semi-Longhair,CombinedBreed_Short Hair,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Gender_Female,Gender_Male
0,Adoption,0,0,0,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
1,Adoption,0,0,0,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,Adoption,0,0,0,0,0,1,1,0,1,...,1,0,0,0,0,0,0,0,0,1
4,Adoption,0,0,0,1,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,1
9,Adoption,0,0,0,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
11,Adoption,0,0,0,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,1,0
12,Euthanasia,0,0,1,1,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1
13,Euthanasia,0,0,1,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1
15,Euthanasia,0,0,1,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1
16,Adoption,0,0,0,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,1




Display the x data:


Unnamed: 0,Suffering,Aggressive,Rabies,Multicolor,Pitbull,Spayed/Neutered,AnimalType_Cat,AnimalType_Dog,CombinedColor_Black,CombinedColor_Blue,...,CombinedBreed_Mutt,CombinedBreed_Non-Sporting,CombinedBreed_Semi-Longhair,CombinedBreed_Short Hair,CombinedBreed_Sporting,CombinedBreed_Terrier,CombinedBreed_Toy,CombinedBreed_Working,Gender_Female,Gender_Male
0,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
4,0,0,0,1,0,1,1,0,0,0,...,1,0,0,0,0,0,0,0,0,1
9,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,0,1
11,0,0,0,0,0,1,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
12,0,0,1,1,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,1
13,0,0,1,1,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,1
15,0,0,1,1,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,1
16,0,0,0,0,0,1,1,0,0,0,...,1,0,0,0,0,0,0,0,0,1


In [55]:
#create a decision tree classifier
tree = DecisionTreeClassifier()

#create SMOTE
sm = SMOTE()

#create a pipeline that does a SMOTE and NB
pipe = Pipeline(steps=[('sm',sm),('tree', tree)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'tree__max_depth' : [5,10,20],
    'tree__min_samples_leaf': [2,6,10],
    'sm__sampling_strategy': [0.10,0.20,0.50,0.70]
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

In [56]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      1.00     48129
  Euthanasia       1.00      0.87      0.93      3461

    accuracy                           0.99     51590
   macro avg       1.00      0.94      0.96     51590
weighted avg       0.99      0.99      0.99     51590



In [57]:
#create a decision tree classifier
tree = DecisionTreeClassifier()

#create RUS
rus = RandomUnderSampler()

#create a pipeline that does a SMOTE and NB
pipe = Pipeline(steps=[('rus',rus),('tree', tree)])

#Set up the parameters you want to tune for each of your pipeline steps
#Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = { 
    'tree__max_depth' : [5,10,20],
    'tree__min_samples_leaf': [2,6,10],
    'rus__sampling_strategy': [0.10,0.20,0.50,0.70]
}

# pass the pipeline and the parameters into a GridSearchCV with a 5-fold cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)

In [58]:
predictions = cross_val_predict(grid_search, x, y, cv=5)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

    Adoption       0.99      1.00      1.00     48129
  Euthanasia       1.00      0.87      0.93      3461

    accuracy                           0.99     51590
   macro avg       1.00      0.94      0.96     51590
weighted avg       0.99      0.99      0.99     51590



## Final Result

Luckily, even with Age Upon Outcome and Season dropped, we still had good results, 0.93, for the minority class f-1 score. 