# Solution to Kaggle Problem 'Plant Seedlingds Classification

***

**Name：AI-23**

**Submission Date：16-01-2018**

***

# Abstract

In this probelem, images of plant seedlings from twelve different species are provideed. The task is to accurately predict the species name of the images from the 'test' image directory. 

The main purpose of solving this problem are:
* Using Keras Deep model like CNN to classify image file
* Using pretrained Keras model to reduce the learning time and increasing efficiency

# Convolutional Neural Network Approach

CNN (Convolution Neural Network) is a state of the art model for image classification problem. Since in this experiment was to classify the test images into twelve species, CNN is used as the backbone of our solution. In the [first approach](#f) approach a cnn model consisting od total six convolution layers, three maxpooling layers,five dropout layers, one flatten layer and two fully connected neural network. In the [second approach](#s) a pretrained keras cnn model 'Xception' is imported and its weights are used as the initial weights of our CNN model. At first, weigths of all layers incuding Convnet layer are tuned since this gives increase in the training and validation accuracy per epoch more quickly. Later, when the training and validation accuracy increasee enough and training accuracy starts to take over the validation accuracy all the layer in the Xception model is 'freezed'. This gives better fine tuning in terms of validation accuracy as it prevents 'overfitting' and gives good accuracy in less amount of time.

<a id="f"></a>
### First Approach

In this approch a homemade untrained CNN model is used to solve the problem. First necessary library files are imported. 

In [7]:
import numpy as np 
import pandas as pd 
import os, cv2
from tqdm import tqdm
from sklearn.model_selection import StratifiedShuffleSplit

from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten,MaxPool2D
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from keras.applications import *


Then structures are created for train and test lavel mapping.

In [2]:
x_train = []
x_test = []
y_train = []

df_test = pd.read_csv('sample_submission.csv')

label_map = {   "Black-grass"               :0,
                "Charlock"                  :1,
                "Cleavers"                  :2,
                "Common Chickweed"          :3,
                "Common wheat"              :4,
                "Fat Hen"                   :5,
                "Loose Silky-bent"          :6,
                "Maize"                     :7,
                "Scentless Mayweed"         :8,
                "Shepherds Purse"           :9,
                "Small-flowered Cranesbill" :10,
                "Sugar beet"                :11}

dim = 256

Then train data are prepared by reading the images files, resizing them and labelling them according to their directories.

In [3]:
dirs = os.listdir("../seeddata/train/")
for k in tqdm(range(len(dirs))):    # Directory
    files = os.listdir("../seeddata/train/{}".format(dirs[k]))
    for f in range(len(files)):     # Files
        img = cv2.imread('../seeddata/train/{}/{}'.format(dirs[k], files[f]))
        targets = np.zeros(12)
        targets[label_map[dirs[k]]] = 1 
        x_train.append(cv2.resize(img, (dim, dim)))
        y_train.append(targets)
        
y_train = np.array(y_train, np.uint8)
x_train = np.array(x_train, np.float32)

print(x_train.shape)
print(y_train.shape)


100%|██████████| 12/12 [00:50<00:00,  4.09s/it]


(4750, 256, 256, 3)
(4750, 12)


StratifiedShuffleSplit is used for balanced split for all classes to craete the train and validation data.

In [4]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.16, random_state=42) # Want a balanced split for all the classes
for train_index, test_index in sss.split(x_train, y_train):
    print("Using {} for training and {} for validation".format(len(train_index), len(test_index)))
    x_train, x_valid = x_train[train_index], x_train[test_index]
    y_train, y_valid = y_train[train_index], y_train[test_index]


Using 3990 for training and 760 for validation


Then image augmentation, epochs, learning rate and batch size and a callback function is defined.

In [5]:
datagen = ImageDataGenerator( horizontal_flip=True, 
                              vertical_flip=True)
                                      
weights = os.path.join('', 'weights.h5')

epochs = 3
learning_rate = 0.0001
batch_size = 32

callbacks = [ EarlyStopping(monitor='val_loss', patience=5, verbose=0), 
              ModelCheckpoint(weights, monitor='val_loss', save_best_only=True, verbose=0),
              ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=0, mode='auto', epsilon=0.0001, cooldown=0, min_lr=0)]


The CNN architechture is defined after oberving some performance over and training and validation accuracy.

In [26]:
# Set the CNN model 
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> BatchNormalization -> Out

model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (dim,dim,3)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(BatchNormalization())
model.add(Dropout(0.8))
model.add(Dense(12, activation = "softmax"))

The model is trained over the dataset in total 10 epochs in two steps, each step consisting of two steps for resource constraint. 

In [28]:
model.compile(loss='categorical_crossentropy', optimizer=optimizers.Adam(lr=learning_rate), metrics=['accuracy'])

# ------ TRAINING ------
model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    steps_per_epoch=len(x_train)/batch_size, 
                    validation_data=datagen.flow(x_valid, y_valid, batch_size=batch_size), 
                    validation_steps=len(x_valid)/batch_size,
                    callbacks=callbacks,
                    epochs=epochs, 
                    verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa3897a23c8>

Then the labels of test images are predicted and saved in the 'rawcnnsubmission.csv' file

In [29]:
for f, species in tqdm(df_test.values, miniters=100):
    img = cv2.imread('../seeddata/test/{}'.format(f))
    x_test.append(cv2.resize(img, (dim, dim)))

x_test = np.array(x_test, np.float32)
print(x_test.shape)

if os.path.isfile(weights):
    model.load_weights(weights)

p_test = model.predict(x_test, verbose=1)

preds = []
for i in range(len(p_test)):
    pos = np.argmax(p_test[i])
    preds.append(list(label_map.keys())[list(label_map.values()).index(pos)])
    
df_test['species'] = preds
df_test.to_csv('rawcnnsubmission.csv', index=False)

100%|██████████| 794/794 [00:02<00:00, 271.04it/s]


(794, 256, 256, 3)


#### Kaggle Submission Score

The Kaggle score (Mean F-Score) is 0.50629. (The screen shot of the score was taken later)

![Kaggle Score](plantseed050.png)

<a id="s"></a>
### Second Approach

In this approach, a fully connected top layer is mounted on pretrained keras model called 'Xception'. The initial weights are taken from the pretrained model. At first, all the layers including the Convnet layers are tuned since this gives faster increase in the accuracy. After reaching sufficient accuracy without any occurance of over-fitting the the tuning of the layers of the base madel (Xception) is turned off by freezing them to obtained more generalized and fine-tunned model. The preproccesing of the data like preparing training and test data are done as the same way of the [first approach](#f)

In [1]:
import numpy as np 
import pandas as pd 
import os, cv2
from tqdm import tqdm
from sklearn.model_selection import StratifiedShuffleSplit

from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from keras.applications import *


Using TensorFlow backend.


In [2]:
x_train = []
x_test = []
y_train = []

df_test = pd.read_csv('sample_submission.csv')

label_map = {   "Black-grass"               :0,
                "Charlock"                  :1,
                "Cleavers"                  :2,
                "Common Chickweed"          :3,
                "Common wheat"              :4,
                "Fat Hen"                   :5,
                "Loose Silky-bent"          :6,
                "Maize"                     :7,
                "Scentless Mayweed"         :8,
                "Shepherds Purse"           :9,
                "Small-flowered Cranesbill" :10,
                "Sugar beet"                :11}

dim = 256

In [3]:
# Preparing training data
dirs = os.listdir("../seeddata/train/")
for k in tqdm(range(len(dirs))):    # Directory
    files = os.listdir("../seeddata/train/{}".format(dirs[k]))
    for f in range(len(files)):     # Files
        img = cv2.imread('../seeddata/train/{}/{}'.format(dirs[k], files[f]))
        targets = np.zeros(12)
        targets[label_map[dirs[k]]] = 1 
        x_train.append(cv2.resize(img, (dim, dim)))
        y_train.append(targets)
        
y_train = np.array(y_train, np.uint8)
x_train = np.array(x_train, np.float32)

print(x_train.shape)
print(y_train.shape)


100%|██████████| 12/12 [00:49<00:00,  4.79s/it]


(4750, 256, 256, 3)
(4750, 12)


In [4]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.16, random_state=42) 
for train_index, test_index in sss.split(x_train, y_train):
    print("Using {} for training and {} for validation".format(len(train_index), len(test_index)))
    x_train, x_valid = x_train[train_index], x_train[test_index]
    y_train, y_valid = y_train[train_index], y_train[test_index]


Using 3990 for training and 760 for validation


In [13]:
datagen = ImageDataGenerator( horizontal_flip=True, 
                              vertical_flip=True)
                                      
weights = os.path.join('', 'weights.h5')

epochs = 5
learning_rate = 0.0001
batch_size = 32

callbacks = [ EarlyStopping(monitor='val_loss', patience=5, verbose=0), 
              ModelCheckpoint(weights, monitor='val_loss', save_best_only=True, verbose=0),
              ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=0, mode='auto', epsilon=0.0001, cooldown=0, min_lr=0)]


Adding the pretrained 'Xception' model as the base model of the fully connected neural network.

In [6]:
base_model = Xception(input_shape=(dim, dim, 3), include_top=False, weights='imagenet', pooling='avg') # Average pooling reduces output dimensions
x = base_model.output
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(12, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)

Option for the freezing the layers of the 'Xception' model and loading weight saved best weights.

In [14]:
# Freeze layers not in classifier due to loading imagenet weights
for layer in base_model.layers:
     layer.trainable = False

# print(model.summary())

# Load any existing weights
# if os.path.isfile(weights):
#     model.load_weights(weights)
    

The tuning of all the layers are done in three epoch for its slow speed and over-fitting tendancy and then the convnet layers are freezed and the top neural network is tuned for five epochs.

In [15]:
model.compile(loss='categorical_crossentropy', optimizer=optimizers.Adam(lr=learning_rate), metrics=['accuracy'])

# ------ TRAINING ------
model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    steps_per_epoch=len(x_train)/batch_size, 
                    validation_data=datagen.flow(x_valid, y_valid, batch_size=batch_size), 
                    validation_steps=len(x_valid)/batch_size,
                    callbacks=callbacks,
                    epochs=epochs, 
                    verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f5d24d676a0>

Preparing test data and prediction results.

In [16]:
for f, species in tqdm(df_test.values, miniters=100):
    img = cv2.imread('../seeddata/test/{}'.format(f))
    x_test.append(cv2.resize(img, (dim, dim)))

x_test = np.array(x_test, np.float32)
print(x_test.shape)

if os.path.isfile(weights):
    model.load_weights(weights)

p_test = model.predict(x_test, verbose=1)

preds = []
for i in range(len(p_test)):
    pos = np.argmax(p_test[i])
    preds.append(list(label_map.keys())[list(label_map.values()).index(pos)])
    
df_test['species'] = preds
df_test.to_csv('submission.csv', index=False)

100%|██████████| 794/794 [00:02<00:00, 271.57it/s]


(794, 256, 256, 3)


#### Kaggle Score

The Kaggle score (Mean F-Score) for this approach is 0.94836.

![Kaggle Score](preperfreeze94836.png)

### Conclusion

Quick and betetr accuracy of this solution is obtained by importing pretrained model and tuninf all the layers at first and freeze them after getting sufficient accuracy without any occurance of over-fitting.