# Zoo Animal Classification

In this notebook I will look at a dataset containing 101 different animals.

Using a decision tree model, we will aim to predict what class an animal is in, taking into account various variables.

*Data from: https://www.kaggle.com/uciml/zoo-animal-classification*

*Same data from UCI ML: https://archive.ics.uci.edu/ml/datasets/Zoo*

In [1]:
# General
import pandas as pd

# Ignoring warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Data viz
import seaborn as sn
import matplotlib.pyplot as plt

# Decision Tree
from sklearn import tree

# Label Encoding - Converting labels to numbers
from sklearn.preprocessing import LabelEncoder

# Confusion matrix
from sklearn.metrics import confusion_matrix

# Spilitting data; train & test
from sklearn.model_selection import train_test_split

In [3]:
# Call data
## Model - Classes
df_class = pd.read_csv(r"C:\Users\ssc44611\Documents\L4 Projects\4. ML Practice Projects\Zoo Classification\class.csv")

## Model - Features
df_features = pd.read_csv(r"C:\Users\ssc44611\Documents\L4 Projects\4. ML Practice Projects\Zoo Classification\zoo.csv")

In [4]:
# Let's look at the CLASSES in further detail
df_class

Unnamed: 0,Class_Number,Number_Of_Animal_Species_In_Class,Class_Type,Animal_Names
0,1,41,Mammal,"aardvark, antelope, bear, boar, buffalo, calf,..."
1,2,20,Bird,"chicken, crow, dove, duck, flamingo, gull, haw..."
2,3,5,Reptile,"pitviper, seasnake, slowworm, tortoise, tuatara"
3,4,13,Fish,"bass, carp, catfish, chub, dogfish, haddock, h..."
4,5,4,Amphibian,"frog, frog, newt, toad"
5,6,8,Bug,"flea, gnat, honeybee, housefly, ladybird, moth..."
6,7,10,Invertebrate,"clam, crab, crayfish, lobster, octopus, scorpi..."


In [5]:
# Let's look at the FEATURES in further detail
df_features.head()

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


As we can see from the data above, most of the features are boolean. 

If _all_ features were boolean, we could've opted for a logistic regression model.

However, __column "legs" has that aren't boolean (not-linear), hence, we should use a decision tree__.

In [62]:
# Let's see whether our class data requires cleaning
df_class.isnull().sum()

Class_Number                         0
Number_Of_Animal_Species_In_Class    0
Class_Type                           0
Animal_Names                         0
dtype: int64

In [63]:
# Let's see whether our feature data requires cleaning
df_features.isnull().sum()

animal_name    0
hair           0
feathers       0
eggs           0
milk           0
airborne       0
aquatic        0
predator       0
toothed        0
backbone       0
breathes       0
venomous       0
fins           0
legs           0
tail           0
domestic       0
catsize        0
class_type     0
dtype: int64

Both of our datasets are fine and don't require any further cleaning

In [7]:
# Let's take a deeper dive into our data
print(f'There are: {df_class.shape[0]} unique classes.')
print("-----------------------------------------")
print(f'The unique classes are: {df_class.Class_Type.unique()}')
print("-----------------------------------------")
print(f'There are: {df_features.shape[1] - 1} unique features.') # -1 because the animal's name isn't a feature of use.
print("-----------------------------------------")
print(f'The unique feature names are: {list(df_features.columns[:-1])}')
print("-----------------------------------------")
print(f'There are: {df_features.shape[0]} unique animals.')

There are: 7 unique classes.
-----------------------------------------
The unique classes are: ['Mammal' 'Bird' 'Reptile' 'Fish' 'Amphibian' 'Bug' 'Invertebrate']
-----------------------------------------
There are: 17 unique features.
-----------------------------------------
The unique feature names are: ['animal_name', 'hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize']
-----------------------------------------
There are: 101 unique animals.


## What to do from here?
 
 
1. For current "class_type" column in feature df; -1 all values, so they start from an index of 0, not 1.
2. Encode current "Class_Number" column, to reset the index, should start from 0 - not 1.
3. Fit the DT model