<img src="https://upload.wikimedia.org/wikipedia/commons/7/7c/Kaggle_logo.png" align="left" height=100 width=200>

🚀 [**`Kaggle - Shelter Animal Outcomes`**](https://www.kaggle.com/competitions/shelter-animal-outcomes) 🚀

# 📚 Libraries

In [1]:
# DATA MANIPULTION 
import pandas as pd 
import numpy as np
import gzip

# DATA VIZ
import matplotlib.pyplot as plt
import seaborn as sns

# STATS
from scipy import stats
from statsmodels.graphics.gofplots import qqplot

# MACHINE LEARNING
## PREPROCESSING
from sklearn.impute import SimpleImputer, KNNImputer
## MODEL SELECTION
from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict
## SCALERS
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
## CLASSIFICATION MODELS
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
## EVALUATION
from sklearn.metrics import accuracy_score, recall_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import precision_recall_curve
## MODEL TUNING
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer, make_column_selector
from sklearn import set_config; set_config(display="diagram")  

#  🐈 Dataset

In [2]:
data_train = pd.read_csv('train.csv')
data_train.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


In [3]:
data_test = pd.read_csv('test.csv')
data_test.head()

Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White


In [4]:
print(f'The shape of the data_train dataset is: {data_train.shape}')
print(f'The shape of the data_test dataset is: {data_test.shape}')

The shape of the data_train dataset is: (26729, 10)
The shape of the data_test dataset is: (11456, 8)


In [5]:
print(data_train.dtypes.value_counts())
print('--'*50)
print(data_test.dtypes.value_counts())

object    10
dtype: int64
----------------------------------------------------------------------------------------------------
object    7
int64     1
dtype: int64


# 📊 Data visualisation/analysis

🐈‍⬛ Let's see the types of animals in the dataset

In [None]:
sns.countplot(data_train.AnimalType, palette='Set3')



<AxesSubplot:xlabel='AnimalType', ylabel='count'>

🐈 What is the distribution of the outcomes for the animals?

In [None]:
sns.countplot(data_train.OutcomeType, palette='Set3')

The vast majority of animals are getting either adopted or transfered

Another column interesting to show would be SexuponOutcome

In [None]:
sns.countplot(data_train.SexuponOutcome, palette='Set3')

In this plot, we see two informations displayed: whether the animal is a male/female, and if it has been spayed/neutered or not<br />
The other columns are not that meaningful to plot

We can then divide this column into multiple ones

In [None]:
def get_sex(x):
    x = str(x)
    if x.find('Male') >= 0: return 'Male'
    if x.find('Female') >= 0: return 'Female'
    return 'Unknown'
def get_neutered(x):
    x = str(x)
    if x.find('Neutered') >= 0: return 'Neutered'
    if x.find('Spayed') >= 0: return 'Spayed'
    if x.find('Intact') >= 0: return 'Intact'
    return 'Unknown'

In [None]:
data_train['Sex'] = data_train.SexuponOutcome.apply(get_sex)
data_train['Neutered'] = data_train.SexuponOutcome.apply(get_neutered)
data_test['Sex'] = data_test.SexuponOutcome.apply(get_sex)
data_test['Neutered'] = data_test.SexuponOutcome.apply(get_neutered)

In [None]:
data_train.drop('SexuponOutcome', axis=1, inplace=True)
data_test.drop('SexuponOutcome', axis=1, inplace=True)

In [None]:
print(f"The proportion of Female is: {(data_train['Sex']=='Female').sum()/len(data_train)}")
print(f"The proportion of Male is: {(data_train['Sex']=='Male').sum()/len(data_train)}")

The column 'Breed' gives indication on the animal too: whether that animal is pure race or mixed

In [None]:
def get_mix(x):
    x=str(x)
    if x.find('Mix') >= 0: return 'Mix'
    return 'Not'
data_train['Breed'] = data_train.Breed.apply(get_mix)
data_test['Breed'] = data_test.Breed.apply(get_mix)

In [None]:
sns.countplot(data_train.Breed, palette='Set3')

We can see the mix are a much larger part of the shelter animals

But how much do these parameters influence the outcome?

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(data=data_train, x='OutcomeType',hue='Sex', ax=ax1)
sns.countplot(data=data_train, x='Sex',hue='OutcomeType', ax=ax2)

We can see that the sex of the animal does not really matter in the outcome

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(data=data_train, x='OutcomeType',hue='AnimalType', ax=ax1)
sns.countplot(data=data_train, x='AnimalType',hue='OutcomeType', ax=ax2)

However, the type of the animal does: dogs are most likely to be adopted or returned to its owner while cats are most likely to be transfered

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(data=data_train, x='OutcomeType',hue='Breed', ax=ax1)
sns.countplot(data=data_train, x='Breed',hue='OutcomeType', ax=ax2)

The breed does not play a significant part in the process even with the disparities in the graphs, the proportions are already very different from the start

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(data=data_train, x='OutcomeType',hue='Neutered', ax=ax1)
sns.countplot(data=data_train, x='Neutered',hue='OutcomeType', ax=ax2)

Last but not least: animals left intact are much higher chances of being either euthanised or transfered

# 🏛️ Classification model

In [None]:
data_train

## ⌚ Convert DateTime to numerical features

In [None]:
train_time = pd.to_datetime(data_train['DateTime'])
test_time = pd.to_datetime(data_test['DateTime'])

In [None]:
data_train['Year'] = train_time.dt.year
data_train['Month'] = train_time.dt.month
data_train['Day'] = train_time.dt.day
data_train['Hour'] = train_time.dt.hour
data_train['Minute'] = train_time.dt.minute
data_train.drop('DateTime', axis=1, inplace=True)

In [None]:
data_test['Year'] = test_time.dt.year
data_test['Month'] = test_time.dt.month
data_test['Day'] = test_time.dt.day
data_test['Hour'] = test_time.dt.hour
data_test['Minute'] = test_time.dt.minute
data_test.drop('DateTime', axis=1, inplace=True)

In [None]:
data_test

## 📅 Convert AgeuponOutcome to numerical feature

In [None]:
def get_age(x):
    x=str(x)
    age_list = x.split()
    if age_list[1].find('NaN'):
        return 'Unknown'
    if age_list[1].find('year'):
        return int(age_list[0])*360
    if age_list[1].find('month'):
        return int(age_list[0])*30
    if age_list[1].find('day'):
        return int(age_list[0])
data_train['Age'] = data_train.AgeuponOutcome.apply(get_age)
data_test['Age'] = data_test.AgeuponOutcome.apply(get_age)

In [None]:
data_train