Maple Leaves Ltd is a start-up company which makes herbs from different types of plants and its leaves. Currently the system they use to classify the trees which they import in a batch is quite manual. A labourer from his experience decides the leaf type and subtype of plant family. They have asked us to automate this process and remove any manual intervention from this process.

**Objective:** To classify the plant leaves by various classifiers from different metrics of the leaves and to choose the best classifier for future reference.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
import datetime as dt #work with date type
from scipy import stats #Stats frameworks
from sklearn.model_selection import train_test_split # Split

#Visualisation frameworks
import matplotlib.pyplot as plt
%matplotlib inline  
import seaborn as sns

**1) Import the train and test csv.**

In [2]:
#Loading train csv
train = pd.read_csv('C:/Users/gabri/Desktop/Data Science/Python/Python for Data Science - Edureka/Certification Project/train.csv')

#Printing first five rows and shape
print(train.shape)
train.head()

(990, 194)


Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,Acer_Opalus,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,Pterocarya_Stenoptera,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,Quercus_Hartwissiana,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293
3,5,Tilia_Tomentosa,0.0,0.003906,0.023438,0.005859,0.021484,0.019531,0.023438,0.0,...,0.0,0.000977,0.0,0.0,0.020508,0.0,0.0,0.017578,0.0,0.047852
4,6,Quercus_Variabilis,0.005859,0.003906,0.048828,0.009766,0.013672,0.015625,0.005859,0.0,...,0.09668,0.0,0.021484,0.0,0.0,0.0,0.0,0.0,0.0,0.03125


In [3]:
#Loading train csv
test = pd.read_csv('C:/Users/gabri/Desktop/Data Science/Python/Python for Data Science - Edureka/Certification Project/test.csv')

#Printing first five rows and shape
print(test.shape)
test.head()

(594, 193)


Unnamed: 0,id,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,4,0.019531,0.009766,0.078125,0.011719,0.003906,0.015625,0.005859,0.0,0.005859,...,0.006836,0.0,0.015625,0.000977,0.015625,0.0,0.0,0.0,0.003906,0.053711
1,7,0.007812,0.005859,0.064453,0.009766,0.003906,0.013672,0.007812,0.0,0.033203,...,0.0,0.0,0.006836,0.001953,0.013672,0.0,0.0,0.000977,0.037109,0.044922
2,9,0.0,0.0,0.001953,0.021484,0.041016,0.0,0.023438,0.0,0.011719,...,0.12891,0.0,0.000977,0.0,0.0,0.0,0.0,0.015625,0.0,0.0
3,12,0.0,0.0,0.009766,0.011719,0.017578,0.0,0.003906,0.0,0.003906,...,0.012695,0.015625,0.00293,0.036133,0.013672,0.0,0.0,0.089844,0.0,0.008789
4,13,0.001953,0.0,0.015625,0.009766,0.039062,0.0,0.009766,0.0,0.005859,...,0.0,0.042969,0.016602,0.010742,0.041016,0.0,0.0,0.007812,0.009766,0.007812


**2) Import the required classification libraries along with pandas, numpy, seaborn etc)**

**3) Then import the classifiers from them (Randomforest, SVM, NaiveBayes, DecisionTrees)**

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import LabelEncoder

**4) After this create a function to encode the labels of the strings given in the dataset.**

**5) You can do the above step using label encoder. With this you are creating some labels from train set as test set. The test set we imported is for testing the best classifier accuracy once we choose it.**

In [5]:
cols_object = train.select_dtypes([np.object]).columns
cols_object

Index(['species'], dtype='object')

In [6]:
train.species.unique()

array(['Acer_Opalus', 'Pterocarya_Stenoptera', 'Quercus_Hartwissiana',
       'Tilia_Tomentosa', 'Quercus_Variabilis', 'Magnolia_Salicifolia',
       'Quercus_Canariensis', 'Quercus_Rubra', 'Quercus_Brantii',
       'Salix_Fragilis', 'Zelkova_Serrata', 'Betula_Austrosinensis',
       'Quercus_Pontica', 'Quercus_Afares', 'Quercus_Coccifera',
       'Fagus_Sylvatica', 'Phildelphus', 'Acer_Palmatum',
       'Quercus_Pubescens', 'Populus_Adenopoda', 'Quercus_Trojana',
       'Alnus_Sieboldiana', 'Quercus_Ilex', 'Arundinaria_Simonii',
       'Acer_Platanoids', 'Quercus_Phillyraeoides', 'Cornus_Chinensis',
       'Liriodendron_Tulipifera', 'Cytisus_Battandieri',
       'Rhododendron_x_Russellianum', 'Alnus_Rubra',
       'Eucalyptus_Glaucescens', 'Cercis_Siliquastrum',
       'Cotinus_Coggygria', 'Celtis_Koraiensis', 'Quercus_Crassifolia',
       'Quercus_Kewensis', 'Cornus_Controversa', 'Quercus_Pyrenaica',
       'Callicarpa_Bodinieri', 'Quercus_Alnifolia', 'Acer_Saccharinum',
       'Prun

In [7]:
train.species

0                Acer_Opalus
1      Pterocarya_Stenoptera
2       Quercus_Hartwissiana
3            Tilia_Tomentosa
4         Quercus_Variabilis
               ...          
985     Magnolia_Salicifolia
986              Acer_Pictum
987       Alnus_Maximowiczii
988            Quercus_Rubra
989           Quercus_Afares
Name: species, Length: 990, dtype: object

So this is our target column. Since it's categorical and not ordinal, let's label enconde it.

In [8]:
le = LabelEncoder()

train['species'] = le.fit_transform(train['species'].values)

train.species

0       3
1      49
2      65
3      94
4      84
       ..
985    40
986     5
987    11
988    78
989    50
Name: species, Length: 990, dtype: int32

**6) Then extract the values from train set by stratifying them and dividing it into 80:20 ratio**

In [9]:
X = train.drop('species',axis=1)
y = train['species']

#Printing results
print('X shape:', X.shape, '\ny shape:', y.shape)
X.head()

X shape: (990, 193) 
y shape: (990,)


Unnamed: 0,id,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,0.001953,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293
3,5,0.0,0.003906,0.023438,0.005859,0.021484,0.019531,0.023438,0.0,0.013672,...,0.0,0.000977,0.0,0.0,0.020508,0.0,0.0,0.017578,0.0,0.047852
4,6,0.005859,0.003906,0.048828,0.009766,0.013672,0.015625,0.005859,0.0,0.0,...,0.09668,0.0,0.021484,0.0,0.0,0.0,0.0,0.0,0.0,0.03125


In [10]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

**7) Now your X train, X test, Y train, Y test are ready.**

**8) We currently don’t know which is the best classifier on the dataset. So,we apply all 4 of them.**

In [27]:
#Creating model and fitting
RFC = RandomForestClassifier(random_state=5).fit(X_train, y_train)
SVC_model = SVC(probability=True).fit(X_train, y_train)
Gaussian = GaussianNB().fit(X_train, y_train)
DTC = DecisionTreeClassifier().fit(X_train, y_train)

In [43]:
#Probs predicting (log loss)- had to predicted in train data so log loss would work due to 
#different number of classes
RFC_probs = RFC.predict_proba(X_train)
SVC_probs = SVC_model.predict_proba(X_train)
Gaussian_probs = Gaussian.predict_proba(X_train)
DTC_probs = DTC.predict_proba(X_train)

#Predicting
RFC_predict = RFC.predict(X_train)
SVC_predict = SVC_model.predict(X_train)
Gaussian_predict = Gaussian.predict(X_train)
DTC_predict = DTC.predict(X_train)

**9)Create the classifiers class and initialize all the respective classifiers**

**10)Then run the X train & X test datasets through classifiers calculating the log loss and accuracy of the result**

**11)Choose the classifier which has the best accuracy**

In [44]:
accuracy_score(y_train, RFC_predict)

1.0

In [49]:
from sklearn.metrics import log_loss, accuracy_score


print('RandomForestClassifier log loss / accuracy score:\n',
      log_loss(y_train, RFC_probs),' / ', accuracy_score(y_train, RFC_predict))

print('RandomForestClassifier log loss / accuracy score:\n',
      log_loss(y_train, SVC_probs),' / ', accuracy_score(y_train, SVC_predict))

print('RandomForestClassifier log loss / accuracy score:\n',
      log_loss(y_train, Gaussian_probs),' / ', accuracy_score(y_train, Gaussian_predict))

print('RandomForestClassifier log loss / accuracy score:\n',
      log_loss(y_train, DTC_probs),' / ', accuracy_score(y_train, DTC_predict))

RandomForestClassifier log loss / accuracy score:
 0.2250476436181116  /  1.0
RandomForestClassifier log loss / accuracy score:
 4.678758783793658  /  0.025252525252525252
RandomForestClassifier log loss / accuracy score:
 0.045577898097906966  /  0.9949494949494949
RandomForestClassifier log loss / accuracy score:
 9.88692853737702e-14  /  1.0


RandomForestClassifier seems to be the best fit.

**12) Then try to predict the result on the import test.csv dataset**

In [51]:
final_predict = DTC.predict(test)
final_predict

array([51, 58,  1, 19, 49, 43, 58, 28, 84,  1, 69, 74, 46, 10, 52, 46, 45,
       30, 13, 71, 61, 68, 57, 27,  1, 70, 28, 15, 35, 70, 53, 74,  6,  2,
        4, 36, 14, 55,  3, 10,  8, 32, 48,  9, 71, 70, 54, 96,  8, 89, 17,
       80, 54, 94, 14, 30, 62, 33, 51, 10, 88, 56, 21, 59, 65, 12, 48, 84,
       13,  4, 54, 57, 29,  7, 21, 98, 79, 84, 25, 10, 61, 97, 58, 24,  1,
        2, 55, 84, 40, 22, 48, 90, 25, 21, 36, 56, 50, 95,  7, 89, 98, 27,
        3, 85, 31, 84, 12, 96, 64, 72, 92, 93, 67, 29,  8, 88, 69, 40,  6,
       57, 34, 90, 28, 17, 88, 27, 56, 44, 38, 96, 68, 34, 41, 61, 18, 97,
       29, 28, 85, 81, 64, 56, 86, 62, 60, 28, 95, 64, 34, 34, 95, 20, 59,
       35, 86,  1, 83, 38, 43, 83, 20, 60, 46, 96, 22, 79, 86, 87, 54, 97,
       75, 21, 22, 21, 17, 49, 37, 94, 27, 29, 15, 45,  7, 54, 43, 77, 30,
       41, 40, 48, 89, 72, 42, 11, 30, 95, 18, 91, 29, 64, 80,  6, 78, 45,
       43,  9, 78, 90, 44, 89, 73, 91,  2, 59,  0, 96, 70, 50, 22, 78, 60,
       55, 44, 38,  5, 60