# **machine_learning.ipynb**

### Objectives

* train a machine learning model to predict whether an animal is likely to be adopted
* use the sci-kit learn python library

In [19]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [27]:
# import machine learning libraries
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

The machine learning model we will use is a **Classification** model because the target variable, the adoption likelihood, is binary: either likely or not likely. 

The target is available, therefore the model is **Supervised**. 

In [21]:
# load the cleaned data in to a dataframe
df = pd.read_csv('../data/data_clean.csv')
df.head()

Unnamed: 0,PetType,Breed,AgeMonths,Colour,Size,WeightKg,Vaccinated,HealthCondition,TimeInShelterDays,AdoptionFee,PreviousOwner,AdoptionLikelihood,AgeInYears
0,Bird,Parakeet,131,Orange,Large,5.039768,1,0,27,140,0,0,10.916667
1,Rabbit,Rabbit,73,White,Large,16.086727,0,0,8,235,0,0,6.083333
2,Dog,Golden Retriever,136,Orange,Medium,2.076286,0,0,85,385,0,0,11.333333
3,Bird,Parakeet,97,White,Small,3.339423,0,0,61,217,1,0,8.083333
4,Rabbit,Rabbit,123,Gray,Large,20.4981,0,0,28,14,1,0,10.25


In [None]:
# convert categorical columns to category data type for use in machine learning
category_cols = ["PetType", "Breed", "Colour","Size"]
for cat in category_cols:
    df[cat] = df[cat].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2007 entries, 0 to 2006
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   PetType             2007 non-null   category
 1   Breed               2007 non-null   category
 2   AgeMonths           2007 non-null   int64   
 3   Colour              2007 non-null   category
 4   Size                2007 non-null   category
 5   WeightKg            2007 non-null   float64 
 6   Vaccinated          2007 non-null   int64   
 7   HealthCondition     2007 non-null   int64   
 8   TimeInShelterDays   2007 non-null   int64   
 9   AdoptionFee         2007 non-null   int64   
 10  PreviousOwner       2007 non-null   int64   
 11  AdoptionLikelihood  2007 non-null   int64   
 12  AgeInYears          2007 non-null   float64 
dtypes: category(4), float64(2), int64(7)
memory usage: 150.0 KB


We need to drop the adoption fee column as it not useful, please see notebook 02 for the explanation. We will also drop the age in months column as it is duplicating the age in years column.

In [23]:
# drop the Adoption Fee and Agemonths columns as it is not needed for prediction
df = df.drop(['AdoptionFee','AgeMonths'], axis=1)
df.head()

Unnamed: 0,PetType,Breed,Colour,Size,WeightKg,Vaccinated,HealthCondition,TimeInShelterDays,PreviousOwner,AdoptionLikelihood,AgeInYears
0,Bird,Parakeet,Orange,Large,5.039768,1,0,27,0,0,10.916667
1,Rabbit,Rabbit,White,Large,16.086727,0,0,8,0,0,6.083333
2,Dog,Golden Retriever,Orange,Medium,2.076286,0,0,85,0,0,11.333333
3,Bird,Parakeet,White,Small,3.339423,0,0,61,1,0,8.083333
4,Rabbit,Rabbit,Gray,Large,20.4981,0,0,28,1,0,10.25


We will split the data into train and test sets, 80% of the data will be used for training with 20% for testing. 

In [24]:
# split the data into training and testing sets
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['AdoptionLikelihood'],axis=1),
                                    df['AdoptionLikelihood'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (1605, 10) (1605,) 
* Test set: (402, 10) (402,)


The data is scaled after splitting because fitting the scaler on all data before splitting causes data leakage.

The test set would influence scaling parameters, making evaluation overly optimistic.

In [None]:
# check the dataframe info for use in the next step
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2007 entries, 0 to 2006
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   PetType             2007 non-null   category
 1   Breed               2007 non-null   category
 2   Colour              2007 non-null   category
 3   Size                2007 non-null   category
 4   WeightKg            2007 non-null   float64 
 5   Vaccinated          2007 non-null   int64   
 6   HealthCondition     2007 non-null   int64   
 7   TimeInShelterDays   2007 non-null   int64   
 8   PreviousOwner       2007 non-null   int64   
 9   AdoptionLikelihood  2007 non-null   int64   
 10  AgeInYears          2007 non-null   float64 
dtypes: category(4), float64(2), int64(5)
memory usage: 118.6 KB


In [30]:
# create a preprocessing and modeling pipeline
# generative AI model suggested code to help with the ordering of steps

categorical_features = ['PetType', 'Breed', 'Colour', 'Size']
numeric_features = ['WeightKg', 'Vaccinated', 'HealthCondition', 'TimeInShelterDays',
                    'PreviousOwner', 'AgeInYears']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', StandardScaler(), numeric_features)
    ]
)

model = Pipeline([
    ('prep', preprocessor),
    ('clf', DecisionTreeClassifier())
])

model.fit(X_train, y_train)