# Implement a KNN model to classify the animals into categories
---

## Data Gathering

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Loading the data
df = pd.read_csv('Zoo.csv')
df

Unnamed: 0,animal name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,wallaby,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1
97,wasp,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6
98,wolf,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
99,worm,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7


## Data Exploration

In [3]:
# Getting information on each feature
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   animal name  101 non-null    object
 1   hair         101 non-null    int64 
 2   feathers     101 non-null    int64 
 3   eggs         101 non-null    int64 
 4   milk         101 non-null    int64 
 5   airborne     101 non-null    int64 
 6   aquatic      101 non-null    int64 
 7   predator     101 non-null    int64 
 8   toothed      101 non-null    int64 
 9   backbone     101 non-null    int64 
 10  breathes     101 non-null    int64 
 11  venomous     101 non-null    int64 
 12  fins         101 non-null    int64 
 13  legs         101 non-null    int64 
 14  tail         101 non-null    int64 
 15  domestic     101 non-null    int64 
 16  catsize      101 non-null    int64 
 17  type         101 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 14.3+ KB


'animal name', which is a object data type, provides a label for each record. All other features are numeric.

In [4]:
# Checking unique values for each feature
df.nunique()

animal name    100
hair             2
feathers         2
eggs             2
milk             2
airborne         2
aquatic          2
predator         2
toothed          2
backbone         2
breathes         2
venomous         2
fins             2
legs             6
tail             2
domestic         2
catsize          2
type             7
dtype: int64

There are 101 entries, but 100 unique animal names, which means there is 1 repeat of one of the names. Except 'legs', all features are binary. 'type' gives 7 unique classes

In [5]:
# Checking for na values
df.isna().sum()

animal name    0
hair           0
feathers       0
eggs           0
milk           0
airborne       0
aquatic        0
predator       0
toothed        0
backbone       0
breathes       0
venomous       0
fins           0
legs           0
tail           0
domestic       0
catsize        0
type           0
dtype: int64

No na values

In [6]:
# Checking for duplicated records
df[df.duplicated()].shape[0]

0

No duplicated records

In [7]:
# Value Counts for 'animal name'
df['animal name'].value_counts()

frog        2
pony        1
sealion     1
seal        1
seahorse    1
           ..
gorilla     1
goat        1
gnat        1
girl        1
wren        1
Name: animal name, Length: 100, dtype: int64

Except 'frog', all names appear once

In [8]:
df[df['animal name'] == 'frog']

Unnamed: 0,animal name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
25,frog,0,0,1,0,0,1,1,1,1,1,0,0,4,0,0,0,5
26,frog,0,0,1,0,0,1,1,1,1,1,1,0,4,0,0,0,5


We have two entries of frogs for venomous and non-venomous each

In [9]:
# Checking value counts for unique values for 'legs'
df.legs.value_counts()

4    38
2    27
0    23
6    10
8     2
5     1
Name: legs, dtype: int64

0, 2, 4, 6, 8 legs make sense. We need to investigate the 5-legged creature

In [10]:
df[df.legs == 5]

Unnamed: 0,animal name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
85,starfish,0,0,1,0,0,1,1,0,0,0,0,0,5,0,0,0,7


Starfish can be considered to have 5 legs. The record is valid

In [11]:
# Checking value count for target label
df.type.value_counts()

1    41
2    20
4    13
7    10
6     8
3     5
5     4
Name: type, dtype: int64

## Feature Engineering

Let's convert 'legs' into dummies as all other features are binary

In [12]:
df = pd.get_dummies(df, columns=['legs'])

## Hyperparameter Tuning

For this classification task, we can use KNeighborsClassifier or RadiusNeighborsClassifier. However, RadiusNeighborsClassifier doesn't perform well with higher number of features due to curse of dimensionality

In [13]:
# Importing knn classifier models
from sklearn.neighbors import KNeighborsClassifier
# Importing Classes for Cross Validation and Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [14]:
# Defining multiple values of k
k = [i for i in range(1,41)]

In [15]:
# Defining various values for hyperparameters for knn classifier
param_grid = {'n_neighbors':k, 
              'weights' : ['uniform', 'distance'], 
              'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'], 
              'p' : [1,2]}

In [16]:
# Creating a knn classifier model
knnc = KNeighborsClassifier()

In [17]:
# Creating a stratified k fold cross validator object for 4 splits as the lowest value count for a target class is 4
skf_cv = StratifiedKFold(n_splits=4)

In [18]:
# Creating a grid search object
grid = GridSearchCV(estimator = knnc, param_grid = param_grid, cv = skf_cv)

In [19]:
# Splitting data into independent and dependent features
X = df.drop(columns=['animal name', 'type'])
y = df.type

In [20]:
# Fitting the data to the grid
grid.fit(X,y)

In [21]:
# Best score obtained from the grid search
grid.best_score_

0.9603846153846154

In [22]:
# Parameter which gave the best score
grid.best_params_

{'algorithm': 'auto', 'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}

## Model Training

'auto' for 'algorithm' and 'uniform' for 'weights' are the default hyperparameters

In [23]:
# Creating a knn classifier model with the optimal parameters
knnc = KNeighborsClassifier(n_neighbors=1, p = 1)

In [24]:
# Fitting the data to the model
knnc.fit(X,y)

In [25]:
# Accuracy for the model
knnc.score(X,y)

1.0

We get perfect accuracy, which means all samples were classified correctly

Now, let's try splitting the data into training and testing sets. We would expect such a small dataset to not generate accurate models on splitting data

In [26]:
from sklearn.model_selection import train_test_split

As some target labels have value counts as small as 4, we want to use stratified train_test_split, as otherwise we might end up with these labels not being represented in either training or testing datasets

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y)

In [28]:
# Fitting the model to the training data
knnc.fit(X_train, y_train)

In [29]:
# Accuracy for training data
knnc.score(X_train, y_train)

1.0

In [30]:
# Accuracy for testing data
knnc.score(X_test, y_test)

0.8846153846153846

Again, we get perfect accuracy even with a train test split validation

## Conclusion

As the dataset has only 101 entries, and using hyperparameter tuning, we were able to get perfect accuracy for the K Neighbor Classifier