This dataset consists of 101 animals from a zoo.  
There are 16 variables with various traits to describe the animals.  
The 7 Class Types are: Mammal, Bird, Reptile, Fish, Amphibian, Bug and Invertebrate  

The purpose for this dataset is to be able to predict the classification of the animals, based upon the variables.  
It is the perfect dataset for those who are new to learning Machine Learning.  

# zoo.csv  
Attribute Information: (name of attribute and type of value domain)  

1. animal_name: Unique for each instance  
2. hair Boolean  
3. feathers Boolean  
4. eggs Boolean  
5. milk Boolean  
6. airborne Boolean  
7. aquatic Boolean  
8. predator Boolean  
9. toothed Boolean  
10. backbone Boolean  
11. breathes Boolean  
12. venomous Boolean  
13. fins Boolean  
14. legs Numeric (set of values: {0,2,4,5,6,8})  
15. tail Boolean  
16. domestic Boolean  
17. catsize Boolean  
18. class_type Numeric (integer values in range [1,7])  

# class.csv  
This csv describes the dataset  

1. Class_Number Numeric (integer values in range [1,7])  
2. NumberOfAnimalSpeciesIn_Class Numeric  
3. Class_Type character -- The actual word description of the class  
4. Animal_Names character -- The list of the animals that fall in the category of the class  

# Acknowledgements  
UCI Machine Learning: https://archive.ics.uci.edu/ml/datasets/Zoo  

Source Information
-- Creator: Richard Forsyth  
-- Donor: Richard S. Forsyth  
8 Grosvenor Avenue  
Mapperley Park  
Nottingham NG3 5DX  
0602-621676  
-- Date: 5/15/1990  

# Inspiration  
What are the best machine learning ensembles/methods for classifying these animals based upon the variables given?

101마리의 동물원 동물에 대한 특징 16가지를 수집한 DataSet이다.  
zoo.csv는 각 동물과 특성에 대한 값들을 표기했다.  
class.csv는 동물을 laveling한 값들을 표기했다.   
Dummy Data Encoder를 사용해 보기위해 데이터를 가져와 봤는데 다리의 개수를 제외한 값들이 이미 처리가 되어있어서 그냥 데이터 분석을 해봐야 할것같다.

# DataSet 받아오기

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
DATA_PATH = os.path.join('data/zoo') # Data save folder

def load_zoo_data(): # Loading Data
    csv_path = os.path.join(DATA_PATH,'zoo.csv')
    return pd.read_csv(csv_path)

def load_label_data(): # Loading Data
    csv_path = os.path.join(DATA_PATH,'class.csv')
    return pd.read_csv(csv_path)

In [11]:
zoo = load_zoo_data()
label = load_label_data()
zoo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   animal_name  101 non-null    object
 1   hair         101 non-null    int64 
 2   feathers     101 non-null    int64 
 3   eggs         101 non-null    int64 
 4   milk         101 non-null    int64 
 5   airborne     101 non-null    int64 
 6   aquatic      101 non-null    int64 
 7   predator     101 non-null    int64 
 8   toothed      101 non-null    int64 
 9   backbone     101 non-null    int64 
 10  breathes     101 non-null    int64 
 11  venomous     101 non-null    int64 
 12  fins         101 non-null    int64 
 13  legs         101 non-null    int64 
 14  tail         101 non-null    int64 
 15  domestic     101 non-null    int64 
 16  catsize      101 non-null    int64 
 17  class_type   101 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 14.3+ KB


# Label Data 정리

In [34]:
new_label=[] # 동물을 label에 따라 묶어놓음
labelling=[] # 각 동물이 속해져 있는 label

for i in range(0,7,1):
    new_label.append(label.Animal_Names[i].split(", "))

for i in zoo.animal_name:
    for j in range(0,7,1):
        if i in new_label[j]:
            labelling.append(j)

# pre-processing

In [4]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
# +로그 변환환

In [51]:
zoo_val = zoo.iloc[:,1:]

mm = MinMaxScaler()
ma = MaxAbsScaler()
ss = StandardScaler()
rs = RobustScaler()
no = Normalizer()

zoo_1 = pd.DataFrame(mm.fit_transform(zoo_val), columns=zoo_val.columns)
zoo_2 = pd.DataFrame(ma.fit_transform(zoo_val), columns=zoo_val.columns)
zoo_3 = pd.DataFrame(ss.fit_transform(zoo_val), columns=zoo_val.columns)
zoo_4 = pd.DataFrame(rs.fit_transform(zoo_val), columns=zoo_val.columns)
zoo_5 = pd.DataFrame(no.fit_transform(zoo_val), columns=zoo_val.columns)
zoo_6 = pd.DataFrame(np.log1p(zoo_val))

# K-Means

In [52]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

km = KMeans(n_clusters=4, random_state=1)
predict0 = pd.DataFrame(km.fit_predict(zoo_val), columns=['predict'])
predict1 = pd.DataFrame(km.fit_predict(zoo_1), columns=['predict'])
predict2 = pd.DataFrame(km.fit_predict(zoo_2), columns=['predict'])
predict3 = pd.DataFrame(km.fit_predict(zoo_3), columns=['predict'])
predict4 = pd.DataFrame(km.fit_predict(zoo_4), columns=['predict'])
predict5 = pd.DataFrame(km.fit_predict(zoo_5), columns=['predict'])
predict6 = pd.DataFrame(km.fit_predict(zoo_6), columns=['predict'])

# 공부해보니 Accuracy score이라는 train data의 클러스터링 결과를 평가하는 함수가 있어 사용해 보았다. (올바른 데이터의 수/전체데이터의 수)

print('not_Scaled: ', accuracy_score(labelling,predict0))
print('MinMax_Scaled: ', accuracy_score(labelling,predict1))
print('MaxAbs_Scaled: ', accuracy_score(labelling,predict2))
print('Strandard_Scaled: ', accuracy_score(labelling,predict3))
print('Robust_Scaled: ', accuracy_score(labelling,predict4))
print('Normalizer_Scaled: ', accuracy_score(labelling,predict5))
print('Log_Scaled: ', accuracy_score(labelling,predict6))

not_Scaled:  0.0
MinMax_Scaled:  0.039603960396039604
MaxAbs_Scaled:  0.13861386138613863
Strandard_Scaled:  0.039603960396039604
Robust_Scaled:  0.5247524752475248
Normalizer_Scaled:  0.40594059405940597
Log_Scaled:  0.6039603960396039


K-means 알고리즘으로 클러스터링 했을 때 log_scale 했을 때 가장 높은 정확도인 60%가량의 결과를 보인다.

# Dicision Tree Classifier

저번시간에 봤던 Decision Tree로 yes/no를 정해서 구분하는 것이 동물의 과를 정하기 제격이라는 생각이 들어 시도해 봤다.  
Decision Tree의 장점 중 하나가 scailer의 영향을 받지 않는다는 것인데 Normalizer를 제외하고 결과 값에 영향이 실제로 있는지 직접 체크해 봐야겠다.
