This is the EDA, cleaning, and modeling notebook for the GA Hackathon Kaggle competition. I've selected the Shelter Animal Outcomes Classification Problem dataset. Let's load in the libraries that I'll need here:

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Let's read in the data:

In [3]:
pets = pd.read_csv('./train.csv.gz')
pets.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


In [7]:
pets['OutcomeType'].value_counts(normalize = True)

Adoption           0.402896
Transfer           0.352501
Return_to_owner    0.179056
Euthanasia         0.058177
Died               0.007370
Name: OutcomeType, dtype: float64

So here, we can see that our majority class is 'Adoption' (thankfully), and if we just predicted Adoption for every outcome, we would be right 40% of the time. That's our baseline accuracy.

We can also see that there are five possible outcomes, so we're dealing with a multiclass classification problem.

In [6]:
pets.shape

(26729, 10)

In [16]:
pets.dtypes

AnimalID          object
Name              object
DateTime          object
OutcomeType       object
OutcomeSubtype    object
AnimalType        object
SexuponOutcome    object
AgeuponOutcome    object
Breed             object
Color             object
dtype: object

In [32]:
pets.isnull().sum()

AnimalID              0
Name                  0
DateTime              0
OutcomeType           0
OutcomeSubtype    13611
AnimalType            0
SexuponOutcome        0
AgeuponOutcome        0
Breed                 0
Color                 0
dtype: int64

In [58]:
pets['Breed'].value_counts()

Domestic Shorthair Mix                      8794
Pit Bull Mix                                1906
Chihuahua Shorthair Mix                     1766
Labrador Retriever Mix                      1363
Domestic Medium Hair Mix                     839
                                            ... 
Labrador Retriever/Alaskan Husky               1
Dachshund Longhair/Pembroke Welsh Corgi        1
Boxer/Harrier                                  1
Pembroke Welsh Corgi/Australian Shepherd       1
Cairn Terrier/Affenpinscher                    1
Name: Breed, Length: 1380, dtype: int64

We're gonna do a little data cleaning here: We'll fill in the animal names we don't know with 'X,' and drop the values we're missing for AgeuponOutcome and SexuponOutcome, because there are just a few of them.

In [19]:
pets['Name'] = pets['Name'].fillna('X')

In [30]:
pets = pets.dropna(subset=['AgeuponOutcome', 'SexuponOutcome'])

In [31]:
pets.shape

(26710, 10)

In [33]:
X = pets.drop(columns = ['OutcomeType', 'OutcomeSubtype'])
y = pets.OutcomeType

In [34]:
X.shape

(26710, 8)

In order to use most of my X dataframe, I'll have to dummify the variables. However, most of the columns have a significant number of categories, which will create too many features. First, I'll deal with the DateTime column, so I don't lose that information:

In [41]:
X['DateTime'] = pd.to_datetime(X['DateTime'])

In [45]:
X['year_month'] = X['DateTime'].map(lambda x: 100*x.year + x.month)

In [46]:
X['year'] = X['DateTime'].map(lambda x: x.year)

In [47]:
X['Month'] = X['DateTime'].map(lambda x: x.month)

In [48]:
X.dtypes

AnimalID                  object
Name                      object
DateTime          datetime64[ns]
AnimalType                object
SexuponOutcome            object
AgeuponOutcome            object
Breed                     object
Color                     object
year_month                 int64
year                       int64
Month                      int64
dtype: object

Okay, now we need to drop the features that have too high cardinality in order to dummify the X variable. Let's set the maximum cardinality for a given feature to be 100:

In [49]:
max_cardinality = 100

high_cardinality = [col for col in X.select_dtypes(exclude=np.number)
                   if X[col].nunique() > max_cardinality]

X = X.drop(columns=high_cardinality)

In [50]:
X.dtypes

AnimalType        object
SexuponOutcome    object
AgeuponOutcome    object
year_month         int64
year               int64
Month              int64
dtype: object

In [51]:
X.shape

(26710, 6)

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 42)

In [53]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

y_train = pd.get_dummies(y_train, columns = 'OutcomeType')
y_test = pd.get_dummies(y_test, columns = 'OutcomeType')

In [40]:
y_train.head()

Unnamed: 0,Adoption,Died,Euthanasia,Return_to_owner,Transfer
3821,1,0,0,0,0
15493,1,0,0,0,0
16713,1,0,0,0,0
20984,0,0,0,0,1
8990,0,0,0,1,0


In [61]:
X_train.shape

(20032, 54)

Okay, so now we have our numeric variables, time to scale them:

In [62]:
sc = StandardScaler()

X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)