<a href="https://colab.research.google.com/github/MKolaksazov/Machine-Learning/blob/master/Kaggle%20competitions/Leaf_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Leaf Classification
####**Can you see the random forest for the leaves?**

There are estimated to be nearly half a million species of plant in the world. Classification of species has been historically problematic and often results in duplicate identifications.

The objective of this playground competition is to use binary leaf images and extracted features, including shape, margin & texture, to accurately identify 99 species of plants. Leaves, due to their volume, prevalence, and unique characteristics, are an effective means of differentiating plant species. They also provide a fun introduction to applying techniques that involve image-based features.

As a first step, try building a classifier that uses the provided pre-extracted features. Next, try creating a set of your own features. Finally, examine the errors you're making and see what you can do to improve.

# Uploading the data set from Kaggle

1) Installing and downloading Kaggle; 
</br>2) Copying Json file Kaggle.json;
</br>3) Downloading the competition database;

In [1]:
! pip install kaggle # In case the latter doesn't work



In [2]:
! mkdir ~/.kaggle

from google.colab import files # uploading the kaggle.json file
files.upload() 

! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download -c leaf-classification
! mkdir unzip
! unzip train.csv.zip -d unzip
! unzip test.csv.zip -d unzip

Saving kaggle.json to kaggle.json
Downloading test.csv.zip to /content
  0% 0.00/215k [00:00<?, ?B/s]
100% 215k/215k [00:00<00:00, 57.0MB/s]
Downloading sample_submission.csv.zip to /content
  0% 0.00/6.15k [00:00<?, ?B/s]
100% 6.15k/6.15k [00:00<00:00, 6.51MB/s]
Downloading train.csv.zip to /content
  0% 0.00/371k [00:00<?, ?B/s]
100% 371k/371k [00:00<00:00, 52.7MB/s]
Downloading images.zip to /content
 95% 32.0M/33.8M [00:00<00:00, 38.8MB/s]
100% 33.8M/33.8M [00:00<00:00, 76.2MB/s]
Archive:  train.csv.zip
  inflating: unzip/train.csv         
Archive:  test.csv.zip
  inflating: unzip/test.csv          


In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math

In [4]:
## Read data from the CSV file

data = pd.read_csv('unzip/train.csv')
parent_data = data.copy()    ## Always a good idea to keep a copy of original data
ID = data.pop('id')

In [5]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from keras.utils.np_utils import to_categorical

y = data.pop('species')
y = LabelEncoder().fit(y).transform(y)
print(y.shape)

X = StandardScaler().fit(data).transform(data)
print(X.shape)

y_cat = to_categorical(y)
print(y_cat.shape)

(990,)
(990, 192)
(990, 99)


In [6]:
from keras.models import Sequential
from keras.layers import Dense,Dropout,Activation

model = Sequential()
model.add(Dense(1024, input_dim=192, activation='tanh')) #init='uniform',
model.add(Dropout(0.2))
model.add(Dense(1024, activation='sigmoid'))
model.add(Dropout(0.1))
model.add(Dense(99, activation='softmax'))

model.compile(loss='categorical_crossentropy',optimizer='rmsprop', metrics = ["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1024)              197632    
                                                                 
 dropout (Dropout)           (None, 1024)              0         
                                                                 
 dense_1 (Dense)             (None, 1024)              1049600   
                                                                 
 dropout_1 (Dropout)         (None, 1024)              0         
                                                                 
 dense_2 (Dense)             (None, 99)                101475    
                                                                 
Total params: 1,348,707
Trainable params: 1,348,707
Non-trainable params: 0
_________________________________________________________________


In [7]:
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_loss', patience=280)
history = model.fit(X,y_cat,batch_size=192,
                    epochs=800 ,verbose=0, validation_split=0.1, callbacks=[early_stopping])

In [9]:
print()
print("train/val loss ratio: ", min(history.history['loss'])/min(history.history['val_loss']))
print('train_acc: ',max(history.history['accuracy']))
print('train_loss: ',min(history.history['loss']))


train/val loss ratio:  4.075133727211505e-05
train_acc:  1.0
train_loss:  3.3581962100015517e-08


In [10]:
test = pd.read_csv('unzip/test.csv')
index = test.pop('id')
test = StandardScaler().fit(test).transform(test)
yPred = model.predict(test)

yPred = pd.DataFrame(yPred,index=index,columns=sorted(parent_data.species.unique()))

fp = open('submission_nn_kernel.csv','w')
fp.write(yPred.to_csv())

803087