# Capstone Project

"There are estimated to be nearly half a million species of plant in the world. Classification of species has been historically problematic and often results in duplicate identifications. Automating plant recognition might have many applications, including:

Species population tracking and preservation
Plant-based medicinal research
Crop and food supply management
Leaf Classification

The objective of this playground competition is to use binary leaf images and extracted features, including shape, margin & texture, to accurately identify 99 species of plants. Leaves, due to their volume, prevalence, and unique characteristics, are an effective means of differentiating plant species. They also provide a fun introduction to applying techniques that involve image-based features.

As a first step, try building a classifier that uses the provided pre-extracted features. Next, try creating a set of your own features. Finally, examine the errors you're making and see what you can do to improve.

Acknowledgments:

Kaggle is hosting this competition for the data science community to use for fun and education. This dataset originates from leaf images collected by  
James Cope, Thibaut Beghin, Paolo Remagnino, & Sarah Barman of the Royal Botanic Gardens, Kew, UK.

Charles Mallah, James Cope, James Orwell. Plant Leaf Classification Using Probabilistic Integration of Shape, Texture and Margin Features. Signal Processing, Pattern Recognition and Applications, in press. 2013.

We thank the UCI machine learning repository for hosting the dataset."

In [1]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC 
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import preprocessing

### Import Data and Data Exploration

In [35]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

#print test, train

le = LabelEncoder().fit(train.species) 

#save values for submission
classes = list(le.classes_)
test_ids = test.id 

test = test.drop(['id'], axis=1)
train = train.drop([ 'id'], axis=1)
print train
print test
# TODO: Calculate number of features
n_features = len(list(train.columns[1:])
)
n_featurest = len(list(test.columns[1:])
)

print n_features, n_featurest




                          species   margin1   margin2   margin3   margin4  \
0                     Acer_Opalus  0.007812  0.023438  0.023438  0.003906   
1           Pterocarya_Stenoptera  0.005859  0.000000  0.031250  0.015625   
2            Quercus_Hartwissiana  0.005859  0.009766  0.019531  0.007812   
3                 Tilia_Tomentosa  0.000000  0.003906  0.023438  0.005859   
4              Quercus_Variabilis  0.005859  0.003906  0.048828  0.009766   
5            Magnolia_Salicifolia  0.070312  0.093750  0.033203  0.001953   
6             Quercus_Canariensis  0.021484  0.031250  0.017578  0.009766   
7                   Quercus_Rubra  0.000000  0.000000  0.037109  0.050781   
8                 Quercus_Brantii  0.005859  0.001953  0.033203  0.015625   
9                  Salix_Fragilis  0.000000  0.000000  0.009766  0.037109   
10                Zelkova_Serrata  0.019531  0.031250  0.001953  0.005859   
11          Betula_Austrosinensis  0.001953  0.001953  0.023438  0.025391   

### Data Visualization

### File descriptions

train.csv - the training set

test.csv - the test set

sample_submission.csv - a sample submission file in the correct format

images - the image files (each image is named with its corresponding id)


### Data fields


id - an anonymous id unique to an image

margin_1, margin_2, margin_3, ..., margin_64 - each of the 64 attribute vectors for the margin feature

shape_1, shape_2, shape_3, ..., shape_64 - each of the 64 attribute vectors for the shape feature

texture_1, texture_2, texture_3, ..., texture_64 - each of the 64 attribute vectors for the texture feature


### Preprocessing - Identifying features & target columns

In [24]:

from sklearn.preprocessing import LabelEncoder

# Extract feature columns
feature_cols = list(train.columns[1:])

# Extract target column 
target_col = train.columns[0] 

# Show the list of columns
#print "Feature columns:\n{}".format(feature_cols)
#print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = train[feature_cols]
y_all = train[target_col] #Must use label encoder to get LogLoss
#print X_all    #, y_all

#le = LabelEncoder().fit(y_all)
#y_all = le.transform(y_all)
print len(y_all), len(X_all)
# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

990 990

Feature values:
    margin1   margin2   margin3   margin4   margin5   margin6   margin7  \
0  0.007812  0.023438  0.023438  0.003906  0.011719  0.009766  0.027344   
1  0.005859  0.000000  0.031250  0.015625  0.025391  0.001953  0.019531   
2  0.005859  0.009766  0.019531  0.007812  0.003906  0.005859  0.068359   
3  0.000000  0.003906  0.023438  0.005859  0.021484  0.019531  0.023438   
4  0.005859  0.003906  0.048828  0.009766  0.013672  0.015625  0.005859   

   margin8   margin9  margin10    ...      texture55  texture56  texture57  \
0      0.0  0.001953  0.033203    ...       0.007812   0.000000   0.002930   
1      0.0  0.000000  0.007812    ...       0.000977   0.000000   0.000000   
2      0.0  0.000000  0.044922    ...       0.154300   0.000000   0.005859   
3      0.0  0.013672  0.017578    ...       0.000000   0.000977   0.000000   
4      0.0  0.000000  0.005859    ...       0.096680   0.000000   0.021484   

   texture58  texture59  texture60  texture61  texture6

In [25]:
from sklearn.cross_validation import StratifiedShuffleSplit

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=1, stratify=y_all)
#stratify ensures that relative class frequencies is approximately preserved in each train and validation fold.
#test train split with stratify does not account for having all classes


from sklearn import cross_validation


    
print len(y_test)

198


In [26]:

from sklearn.metrics import f1_score
from time import time

def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, average='weighted')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

In [34]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss


clf=SVC(probability=True)
clf.fit(X_train, y_train)


train_predict(clf, X_train, y_train, X_test, y_test)

y_pred = clf.predict(X_test)

#print y_pred

#for element in pred:
#    if element not in y_test:
#        print element
#print pred, y_test
accuracy = accuracy_score(y_test, y_pred)

print accuracy

#pred_logloss = clf.predict_proba(X_test)
#logloss = log_loss(y_test, y_pred)


#print logloss







Training a SVC using a training set size of 792. . .
Trained model in 1.7300 seconds
Made predictions in 0.2230 seconds.
F1 score for training set: 0.8321.
Made predictions in 0.0540 seconds.
F1 score for test set: 0.8051.
0.823232323232


  'precision', 'predicted', average, warn_for)


In [40]:
pred = clf.predict_proba(test)

# Submission
df = pd.DataFrame(pred, columns=classes)
df.insert(0, 'id', test_ids )
df.reset_index()
df.to_csv('submission.csv', index = False)