# Using image classification to label plant species
Firstly, importing all necessary packages needed and setting some plot style preferences.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import cv2

In [2]:
plt.rcParams['figure.figsize'] = (21.0, 13.0)
plt.rcParams['font.size'] = 18.0
sns.set_style('darkgrid')
sns.set_palette('pastel')

As I have access to a GPU on my PC I will quickly check if TensorFlow is able to find it.

In [3]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


That all checks out. Now we can pre-process the multi-labeled image data downloaded from the [Kaggle competition page](https://www.kaggle.com/c/plant-seedlings-classification/overview). Firstly, we need to check how many different plant labels we have to begin with, as this will define the set of all classification labels that can possibly be assigned by our model. The image data available under $\texttt{train.zip}$ has the labels of the plant species as the name of the subdirectory containing the images of that plant species, so we can list all subdirectories of the data to see all classification labels and save as a Python list.

In [4]:
species_list = os.listdir('./train')
species_list

['Black-grass',
 'Charlock',
 'Cleavers',
 'Common Chickweed',
 'Common wheat',
 'Fat Hen',
 'Loose Silky-bent',
 'Maize',
 'Scentless Mayweed',
 'Shepherds Purse',
 'Small-flowered Cranesbill',
 'Sugar beet']

Then we can consider constructing a coherent data set (most likely a tensor) which contains the plant species label as the first column, the numerical plant species label as the second column and the image data as the third column. This is done for ease of further use of the data in that we won't need to deal with constantly going into each of the above subdirectories of the data. After some research into how this can be done, I found the following to be the most straight forward method to iteratively go through each file in each subdirectory while keeping track of the subdirectory label.

In [5]:
full_data = list()
for numerical_species, species in enumerate(species_list):
    for file_name in os.listdir(os.path.join('./train', species)):
        full_data.append([species, numerical_species, './train/{}/{}'.format(species, file_name)])
full_data = pd.DataFrame(full_data, columns=['species', 'numerical_species', 'file_name'])
full_data.head()

Unnamed: 0,species,numerical_species,file_name
0,Black-grass,0,./train/Black-grass/0050f38b3.png
1,Black-grass,0,./train/Black-grass/0183fdf68.png
2,Black-grass,0,./train/Black-grass/0260cffa8.png
3,Black-grass,0,./train/Black-grass/05eedce4d.png
4,Black-grass,0,./train/Black-grass/075d004bc.png


We can observe the total image count.

In [6]:
full_data.shape

(4750, 3)

As well as the total number of images in each of the given species.

In [7]:
full_data['species'].value_counts()

Loose Silky-bent             654
Common Chickweed             611
Scentless Mayweed            516
Small-flowered Cranesbill    496
Fat Hen                      475
Charlock                     390
Sugar beet                   385
Cleavers                     287
Black-grass                  263
Shepherds Purse              231
Common wheat                 221
Maize                        221
Name: species, dtype: int64

It is important to note that we have only 221 images of two of the plant species when it comes to defining a training/testing data split. As we will need to ensure that a good amount of each plant species is present in the training data for a cohesive prediction model.

Next we can consider the quality of the provided images. After a small look through some of the photos I have noticed that the images vary in size quite a bit. To rectify this we can consider setting a base image size and transform all images to this agreed size. Arbitrarily we can select square dimension 300 by 300 pixels and transform using Python's Computer Vision 2 package ($\texttt{cv2}$).

In [8]:
for file in full_data['file_name']:
    image = cv2.imread(file)
    resized_image = cv2.resize(image, (300, 300), interpolation = cv2.INTER_LINEAR)
    resized_file = file[:-4]
    resized_file += '-resized.png'
    cv2.imwrite(resized_file, resized_image)

In [9]:
from sklearn.model_selection import train_test_split
training_data, testing_data = train_test_split(full_data, test_size = 0.3, random_state = 6)
training_data['species'].value_counts()

Loose Silky-bent             442
Common Chickweed             432
Scentless Mayweed            363
Small-flowered Cranesbill    350
Fat Hen                      336
Charlock                     266
Sugar beet                   255
Cleavers                     211
Black-grass                  191
Shepherds Purse              169
Maize                        159
Common wheat                 151
Name: species, dtype: int64