https://www.kaggle.com/uciml/zoo-animal-classification

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def setseed(seed=9871243):
    np.random.seed(seed)

In [5]:
zoo = pd.read_csv("./data/zoo-data.csv")
class_ = pd.read_csv("./data/zoo-class.csv")

In [6]:
zoo.head()

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


In [7]:
zoo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   animal_name  101 non-null    object
 1   hair         101 non-null    int64 
 2   feathers     101 non-null    int64 
 3   eggs         101 non-null    int64 
 4   milk         101 non-null    int64 
 5   airborne     101 non-null    int64 
 6   aquatic      101 non-null    int64 
 7   predator     101 non-null    int64 
 8   toothed      101 non-null    int64 
 9   backbone     101 non-null    int64 
 10  breathes     101 non-null    int64 
 11  venomous     101 non-null    int64 
 12  fins         101 non-null    int64 
 13  legs         101 non-null    int64 
 14  tail         101 non-null    int64 
 15  domestic     101 non-null    int64 
 16  catsize      101 non-null    int64 
 17  class_type   101 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 14.3+ KB


In [8]:
class_

Unnamed: 0,Class_Number,Number_Of_Animal_Species_In_Class,Class_Type,Animal_Names
0,1,41,Mammal,"aardvark, antelope, bear, boar, buffalo, calf,..."
1,2,20,Bird,"chicken, crow, dove, duck, flamingo, gull, haw..."
2,3,5,Reptile,"pitviper, seasnake, slowworm, tortoise, tuatara"
3,4,13,Fish,"bass, carp, catfish, chub, dogfish, haddock, h..."
4,5,4,Amphibian,"frog, frog, newt, toad"
5,6,8,Bug,"flea, gnat, honeybee, housefly, ladybird, moth..."
6,7,10,Invertebrate,"clam, crab, crayfish, lobster, octopus, scorpi..."


The `class_` dataframe, as it is now, cannot represent our target `y`.  
It would be cleaner and easier to have a dataframe such as:
```
+    name  +  class   +
----------------------
+ aardvark |     1    +
+ antelope |     1    +
```

We'll be building it ourselves. What we have to do is:
* create a dictionary with key "animal_name" and value a copy of the `animal_name` column (from `zoo` df)
* create a key "class_number" with value the classification that the animal has

In [9]:
animal_names = list(zoo.animal_name)
animal_names[:5]

['aardvark', 'antelope', 'bass', 'bear', 'boar']

In [10]:
class_numbers = []

for animal in animal_names:
    for i in range(len(class_)):
        if animal in class_.Animal_Names[i].split(', '):
            class_numbers.append(i+1)
            break

class_numbers[:5]

[1, 1, 4, 1, 1]

In [11]:
classifications = pd.DataFrame({
    'animal_name': animal_names,
    'class_number': class_numbers
})
classifications.head()

Unnamed: 0,animal_name,class_number
0,aardvark,1
1,antelope,1
2,bass,4
3,bear,1
4,boar,1


Now let's create `X` and `y`.  
For the purpose of X and y, the column `name` is meaningless. We'll then `drop()` it.

In [12]:
X = zoo.drop("animal_name", axis=1)
y = classifications.drop("animal_name", axis=1)

In [13]:
X.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
0,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


In [14]:
y.head()

Unnamed: 0,class_number
0,1
1,1
2,4
3,1
4,1


Now let's create two models and compare them to each other.  


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [16]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

setseed()

svc = LinearSVC()
rfc = RandomForestClassifier()

svc.fit(X_train, y_train)
rfc.fit(X_train, y_train)

svc.score(X_test, y_test), rfc.score(X_test, y_test)

(0.9047619047619048, 1.0)

Well... either there are some really good patterns ore both of our models are overfitting.  
Let's try to cross-validate our training data.