# Problem Statement

1. Title: Zoo database

2. Source Information
   -- Creator: Richard Forsyth
   -- Donor: Richard S. Forsyth 
             8 Grosvenor Avenue
             Mapperley Park
             Nottingham NG3 5DX
             0602-621676
   -- Date: 5/15/1990
 
3. Past Usage:
   -- None known other than what is shown in Forsyth's PC/BEAGLE User's Guide.

4. Relevant Information:
   -- A simple database containing 17 Boolean-valued attributes.  The "type"
      attribute appears to be the class attribute.  Here is a breakdown of
      which animals are in which type: (I find it unusual that there are
      2 instances of "frog" and one of "girl"!)

      Class# Set of animals:
      ====== ===============================================================
           1 (41) aardvark, antelope, bear, boar, buffalo, calf,
                  cavy, cheetah, deer, dolphin, elephant,
                  fruitbat, giraffe, girl, goat, gorilla, hamster,
                  hare, leopard, lion, lynx, mink, mole, mongoose,
                  opossum, oryx, platypus, polecat, pony,
                  porpoise, puma, pussycat, raccoon, reindeer,
                  seal, sealion, squirrel, vampire, vole, wallaby,wolf
           2 (20) chicken, crow, dove, duck, flamingo, gull, hawk,
                  kiwi, lark, ostrich, parakeet, penguin, pheasant,
                  rhea, skimmer, skua, sparrow, swan, vulture, wren
           3 (5)  pitviper, seasnake, slowworm, tortoise, tuatara 
           4 (13) bass, carp, catfish, chub, dogfish, haddock,
                  herring, pike, piranha, seahorse, sole, stingray, tuna
           5 (4)  frog, frog, newt, toad 
           6 (8)  flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp
           7 (10) clam, crab, crayfish, lobster, octopus,
                  scorpion, seawasp, slug, starfish, worm

5. Number of Instances: 101

6. Number of Attributes: 18 (animal name, 15 Boolean attributes, 2 numerics)

7. Attribute Information: (name of attribute and type of value domain)
   1. animal name:      Unique for each instance
   2. hair		Boolean
   3. feathers		Boolean
   4. eggs		Boolean
   5. milk		Boolean
   6. airborne		Boolean
   7. aquatic		Boolean
   8. predator		Boolean
   9. toothed		Boolean
  10. backbone		Boolean
  11. breathes		Boolean
  12. venomous		Boolean
  13. fins		Boolean
  14. legs		Numeric (set of values: {0,2,4,5,6,8})
  15. tail		Boolean
  16. domestic		Boolean
  17. catsize		Boolean
  18. type		Numeric (integer values in range [1,7])

8. Missing Attribute Values: None

9. Class Distribution: Given above
   


# IMPORTING BASIC LIBRARIES

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# READING THE DATASET 

In [5]:
data = pd.read_table(r"C:\Users\Pragya\Downloads\zoo\zoo.data",delimiter=",")
data.head()

Unnamed: 0,aardvark,1,0,0.1,1.1,0.2,0.3,1.2,1.3,1.4,1.5,0.4,0.5,4,0.6,0.7,1.6,1.7
0,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
1,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
2,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
3,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
4,buffalo,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1


**ASSIGNING VARIABLES TO THE COLUMNS**

In [6]:
data.columns = ["animal name","hair","feathers","eggs","milk","airborne","aquatic","predator","toothed","backbone","breathes","venomous",
"fins","legs Numeric","tail","domestic","catsize","type Numeric" ]

In [7]:
data.head()

Unnamed: 0,animal name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs Numeric,tail,domestic,catsize,type Numeric
0,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
1,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
2,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
3,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
4,buffalo,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1


# DATA PREPROCESSING

**SHAPE**

In [28]:
data.shape

(100, 17)

**HEAD**

In [29]:
data.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs Numeric,tail,domestic,catsize,type Numeric
0,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
1,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
2,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
3,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
4,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1


**TAIL**

In [31]:
data.tail()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs Numeric,tail,domestic,catsize,type Numeric
95,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1
96,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6
97,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
98,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7
99,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2


**DESCRIPTIVE STATISTICE**

In [32]:
data.describe()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs Numeric,tail,domestic,catsize,type Numeric
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.42,0.2,0.59,0.4,0.24,0.36,0.55,0.6,0.82,0.79,0.08,0.17,2.83,0.75,0.13,0.43,2.85
std,0.496045,0.402015,0.494311,0.492366,0.429235,0.482418,0.5,0.492366,0.386123,0.40936,0.27266,0.377525,2.040276,0.435194,0.337998,0.49757,2.105188
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2.0,0.75,0.0,0.0,1.0
50%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,1.0,0.0,0.0,2.0
75%,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,4.0,1.0,0.0,1.0,4.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0,1.0,7.0


**SUMMARIZED INFORMATION**

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   hair          100 non-null    int64
 1   feathers      100 non-null    int64
 2   eggs          100 non-null    int64
 3   milk          100 non-null    int64
 4   airborne      100 non-null    int64
 5   aquatic       100 non-null    int64
 6   predator      100 non-null    int64
 7   toothed       100 non-null    int64
 8   backbone      100 non-null    int64
 9   breathes      100 non-null    int64
 10  venomous      100 non-null    int64
 11  fins          100 non-null    int64
 12  legs Numeric  100 non-null    int64
 13  tail          100 non-null    int64
 14  domestic      100 non-null    int64
 15  catsize       100 non-null    int64
 16  type Numeric  100 non-null    int64
dtypes: int64(17)
memory usage: 13.4 KB


**FEATURE SELECTION**

Dropping the column 'animal name' because it is completely unique

In [8]:
data["animal name"].value_counts()

animal name
frog        2
pony        1
sealion     1
seal        1
seahorse    1
           ..
gull        1
gorilla     1
goat        1
gnat        1
wren        1
Name: count, Length: 99, dtype: int64

In [9]:
data.drop(["animal name"],axis=1,inplace = True)

In [10]:
data.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs Numeric,tail,domestic,catsize,type Numeric
0,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
1,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
2,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
3,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
4,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1


**Checking for missing values**

In [11]:
data.isnull().sum()

hair            0
feathers        0
eggs            0
milk            0
airborne        0
aquatic         0
predator        0
toothed         0
backbone        0
breathes        0
venomous        0
fins            0
legs Numeric    0
tail            0
domestic        0
catsize         0
type Numeric    0
dtype: int64

**Class Distribution of Target Variable**

In [12]:
data["type Numeric"].value_counts()

type Numeric
1    40
2    20
4    13
7    10
6     8
3     5
5     4
Name: count, dtype: int64

**No need of applying label encoder because all variables are already numerical**

In [13]:
data.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs Numeric,tail,domestic,catsize,type Numeric
0,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
1,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
2,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
3,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
4,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1


**Creating X and Y**

In [14]:
X = data.values[:,0:-1]
Y = data.values[:,-1]

In [15]:
print(X.shape)
print(Y.shape)

(100, 16)
(100,)


**Scalling the data**

In [16]:
from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()
scaler.fit(X)
X= scaler.transform(X)


**Splitting data into test and train**

In [17]:
# splitting the data into test and train
from sklearn.model_selection import train_test_split

# split the data into test and train
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state = 10,stratify=Y)  

In [18]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(80, 16)
(20, 16)
(80,)
(20,)


# MODEL BUILDING

# MODEL_1:- Applying Logistic Regression

**CREATING MODEL OBJECT**

In [19]:
from sklearn.linear_model import LogisticRegression
# create a model
Animals =LogisticRegression(random_state=10)

# fitting training data to the model...........(input is X and Y)
Animals.fit(X_train,Y_train)

# output is equation of liner regression 
print(Animals.intercept_)   # intercept is beta-not
print(Animals.coef_)     # coef is beta-1,beta-2

[ 1.6695224   0.34269181  0.34006503 -0.40055089 -0.35472001 -1.22966581
 -0.36734255]
[[ 8.31143841e-01 -1.15872917e-01 -7.91429118e-01  8.80988460e-01
  -6.33670660e-02 -1.84651321e-01 -4.24093338e-02  4.02234474e-01
   2.06443708e-01  1.76334222e-01 -2.16636579e-01  1.15940096e-02
   3.90157137e-02  2.91124543e-02  1.13396802e-01  3.34300596e-01]
 [-1.21142427e-01  1.18624684e+00  1.25827359e-01 -8.99776280e-02
   5.56703578e-01 -5.81973993e-02  9.98226576e-02 -4.81827294e-01
   2.98597834e-01  1.52319509e-01 -1.19122886e-01 -9.83932484e-02
  -2.42493104e-01  2.91081841e-01  6.29256088e-02 -7.71108953e-04]
 [-4.15481930e-01 -5.33541481e-01  4.64898628e-04 -4.13479281e-01
  -2.68566644e-01 -5.27102662e-01  2.09704928e-02  1.56421061e-01
   5.37019142e-01 -1.46027991e-01  2.78256330e-01 -5.34770101e-01
  -3.14595641e-01  5.42204809e-01 -1.20356318e-01  6.01850522e-02]
 [-1.10031271e-01 -1.05702427e-01  3.00832043e-01 -1.09231530e-01
  -7.82341296e-02  3.37957505e-01 -4.70866413e-02  2

**FITTING MODEL OBJECT**

In [34]:
Animals.fit(X_train,Y_train)

**GENERATING BETA PARAMETERS**

In [35]:
print(Animals.intercept_)   
print(Animals.coef_)     

[ 1.6695224   0.34269181  0.34006503 -0.40055089 -0.35472001 -1.22966581
 -0.36734255]
[[ 8.31143841e-01 -1.15872917e-01 -7.91429118e-01  8.80988460e-01
  -6.33670660e-02 -1.84651321e-01 -4.24093338e-02  4.02234474e-01
   2.06443708e-01  1.76334222e-01 -2.16636579e-01  1.15940096e-02
   3.90157137e-02  2.91124543e-02  1.13396802e-01  3.34300596e-01]
 [-1.21142427e-01  1.18624684e+00  1.25827359e-01 -8.99776280e-02
   5.56703578e-01 -5.81973993e-02  9.98226576e-02 -4.81827294e-01
   2.98597834e-01  1.52319509e-01 -1.19122886e-01 -9.83932484e-02
  -2.42493104e-01  2.91081841e-01  6.29256088e-02 -7.71108953e-04]
 [-4.15481930e-01 -5.33541481e-01  4.64898628e-04 -4.13479281e-01
  -2.68566644e-01 -5.27102662e-01  2.09704928e-02  1.56421061e-01
   5.37019142e-01 -1.46027991e-01  2.78256330e-01 -5.34770101e-01
  -3.14595641e-01  5.42204809e-01 -1.20356318e-01  6.01850522e-02]
 [-1.10031271e-01 -1.05702427e-01  3.00832043e-01 -1.09231530e-01
  -7.82341296e-02  3.37957505e-01 -4.70866413e-02  2

**PREDICTING VALUES OF X_TEST**

In [20]:
Y_pred=Animals.predict(X_test)
print(Y_pred)

[3 4 1 2 5 1 6 2 1 1 1 7 7 6 1 1 2 2 4 1]


**COMPARING Y_ACTUAL V/S Y_PREDICTED**

In [33]:
print(list(zip(Y_test,Y_pred)))

[(3, 3), (4, 4), (1, 1), (2, 2), (5, 5), (1, 1), (6, 6), (2, 2), (1, 1), (1, 1), (1, 1), (7, 7), (7, 7), (6, 6), (1, 1), (1, 1), (2, 2), (2, 2), (4, 4), (1, 1)]


**GENERATE EVALUATION MATRIX**

In [22]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
# confusion matrix
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))


# accuracy_score
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[8 0 0 0 0 0 0]
 [0 4 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 2 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 2 0]
 [0 0 0 0 0 0 2]]
Classification report: 
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         8
           2       1.00      1.00      1.00         4
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         2
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         2

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

Accuracy of the model:  1.0


# MODEL_2:- Applying Decision Tree

**CREATING MODEL OBJECT**

In [36]:
from sklearn.tree import DecisionTreeClassifier

# create a model
model_DT =DecisionTreeClassifier(random_state=10 , criterion = "gini")

**FITTING MODEL OBJECT**

In [37]:
model_DT.fit(X_train,Y_train)

**PREDICTING VALUES OF X_TEST**

In [38]:
Y_pred = model_DT.predict(X_test)
print(Y_pred)

[3 4 1 2 5 1 6 2 1 1 1 7 7 6 1 1 2 2 4 1]


**COMPARING Y_ACTUAL V/S Y_PREDICTED**

In [24]:
print(list(zip(Y_test,Y_pred)))

[(3, 3), (4, 4), (1, 1), (2, 2), (5, 5), (1, 1), (6, 6), (2, 2), (1, 1), (1, 1), (1, 1), (7, 7), (7, 7), (6, 6), (1, 1), (1, 1), (2, 2), (2, 2), (4, 4), (1, 1)]


**GENERATING EVALUATION MATRIX**

In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
# confusion matrix
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))


# accuracy_score
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[8 0 0 0 0 0 0]
 [0 4 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 2 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 2 0]
 [0 0 0 0 0 0 2]]
Classification report: 
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         8
           2       1.00      1.00      1.00         4
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         2
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         2

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

Accuracy of the model:  1.0


**CHECK OVERFITTING OF THE MODEL**

In [39]:
Animals.score(X_train,Y_train)

1.0

# MODEL_3:- PRUNING THE DECISION TREE MODEL

In [45]:
#predicting using the Decision_Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DT=DecisionTreeClassifier(random_state=10, 
                                         criterion="gini",
                                         splitter="best", 
                                         min_samples_leaf=2,
                                         min_samples_split=2,
                                         max_depth=2, 
                                         )


#fit the model on the data and predict the values
model_DT.fit(X_train,Y_train)
Y_pred=model_DT.predict(X_test)


from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

[[8 0 0 0 0 0 0]
 [0 4 0 0 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 2 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 2 0 0 0]
 [0 0 0 2 0 0 0]]
0.7
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         8
           2       1.00      1.00      1.00         4
           3       0.00      0.00      0.00         1
           4       0.25      1.00      0.40         2
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         2
           7       0.00      0.00      0.00         2

    accuracy                           0.70        20
   macro avg       0.32      0.43      0.34        20
weighted avg       0.62      0.70      0.64        20



In [47]:
Animals.score(X_train,Y_train)

1.0

# MODEL_4:- GRID SEARCH METHOD

In [48]:
from sklearn.ensemble import ExtraTreesClassifier

model_EXT=ExtraTreesClassifier( random_state=10, bootstrap=True) #fixed parameters should be passsed here

#parameters for trial and error should be passed here
parameter_space = {
    'n_estimators':[100,300,500,1000],       #np.arange(100, 1001,50),
    'max_depth':[10,15,8, 12],
    'min_samples_leaf':[1,3,4,5,6,7]
    }
from sklearn.model_selection import GridSearchCV #RandomizedSearchCV
clf = GridSearchCV(model_EXT, parameter_space, n_jobs=-1, cv=5)


In [49]:
clf.fit(X_train,Y_train)

In [50]:
print('Best parameters found:\n', clf.best_params_)

Best parameters found:
 {'max_depth': 10, 'min_samples_leaf': 1, 'n_estimators': 300}


In [51]:
clf.best_score_

0.95

# MODEL_5:- RANDOM FOREST

In [52]:
#predicting using the Random_Forest_Classifier
from sklearn.ensemble import RandomForestClassifier

model_RandomForest=RandomForestClassifier(n_estimators=100,
                                          random_state=10, bootstrap=True,
                                         n_jobs=-1)

#fit the model on the data and predict the values
model_RandomForest.fit(X_train,Y_train)

Y_pred=model_RandomForest.predict(X_test)

In [53]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
# confusion matrix
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))


# accuracy_score
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[8 0 0 0 0 0 0]
 [0 4 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 2 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 2 0]
 [0 0 0 0 0 0 2]]
Classification report: 
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         8
           2       1.00      1.00      1.00         4
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         2
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         2

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

Accuracy of the model:  1.0


# CONCLUSION

In [None]:
Out of above all the models GridSearch method is giving 95% of accuracy.
While other are giving accuracy of 1