 # SoyBean Predictions

In this project, we are going to try to predict soybean diseases. The classes are soybean diseases and there are 19 of them. There are 35 features, now let's get started.

Start with importing libraries

In [18]:
import pandas as pd
import numpy as np
import sklearn as sl
from collections import defaultdict


 ## Load and Transform Data

 ### Create Column Names

In [19]:
Columns = ['Disease','date','plant-stand','precip','temp','hail','crop-hist','area-damaged','severity','seed-tmt',
           'germination','plant-growth','leaves','leafspots-halo','leafspots-marg','leafspots-size','leaf-shread',
           'leaf-malf','leaf-mild','stem','lodging','stem-cankers','canker-lesion','fruiting-bodies','external decay',
           'mycelium','int-discolor','sclerotia','fruit-pods','fruit spots','seed','mold-growth','seed-discolor',
           'seed-size','shriveling','roots']


 one thing to note, the missing values in this dataset are denoted by a "?", so we will be changing that to a NaN type

 ### Read Data

In [20]:
soybeansU = pd.read_csv('Soybean.csv')
soybeansU.columns = Columns
soybeansU = soybeansU.replace('?',np.nan)


 Now lets check the info about this data

In [21]:
soybeansU.info()
soybeansU.shape


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 36 columns):
Disease            306 non-null object
date               305 non-null object
plant-stand        298 non-null object
precip             295 non-null object
temp               299 non-null object
hail               265 non-null object
crop-hist          305 non-null object
area-damaged       305 non-null object
severity           265 non-null object
seed-tmt           265 non-null object
germination        270 non-null object
plant-growth       305 non-null object
leaves             306 non-null int64
leafspots-halo     281 non-null object
leafspots-marg     281 non-null object
leafspots-size     281 non-null object
leaf-shread        280 non-null object
leaf-malf          281 non-null object
leaf-mild          276 non-null object
stem               305 non-null object
lodging            265 non-null object
stem-cankers       295 non-null object
canker-lesion      295 non-null object

(306, 36)

 So the dataframe is 306 rows by 36 columns, with the classes, the diseases in the first column and the attributes in the rest o the columns. now lets get a look at the data.

In [22]:
soybeansU.head(35) # Check to make sure it was imported prpperly 


Unnamed: 0,Disease,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,...,int-discolor,sclerotia,fruit-pods,fruit spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots
0,diaporthe-stem-canker,4,0,2,1,0.0,2,0,2.0,1.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
1,diaporthe-stem-canker,3,0,2,1,0.0,1,0,2.0,1.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
2,diaporthe-stem-canker,3,0,2,1,0.0,1,0,2.0,0.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
3,diaporthe-stem-canker,6,0,2,1,0.0,2,0,1.0,0.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
4,diaporthe-stem-canker,5,0,2,1,0.0,3,0,1.0,0.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
5,diaporthe-stem-canker,5,0,2,1,0.0,2,0,1.0,1.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
6,diaporthe-stem-canker,4,0,2,1,1.0,1,0,1.0,0.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
7,diaporthe-stem-canker,6,0,2,1,0.0,3,0,1.0,1.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
8,diaporthe-stem-canker,4,0,2,1,0.0,2,0,2.0,0.0,...,0,0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0
9,charcoal-rot,6,0,0,2,0.0,1,3,1.0,1.0,...,2,1,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0


 Looks good. It looks like the '?' were replaced with NaNs.
 
 If you look at the data you notice all the of data are numerical. This is actually categories decoded by numbers. 0 to represent the first category, 1 to represent the 2 category and so on. This is the breakdown
  (Note, the ? was changed to NaN, so ? in this case is for missing values)
 

    1. date:april,may,june,july,august,september,october,?.
    2. plant-stand:	normal,lt-normal,?.
    3. precip:		lt-norm,norm,gt-norm,?.
    4. temp:		lt-norm,norm,gt-norm,?.
    5. hail:		yes,no,?.
    6. crop-hist:	diff-lst-year,same-lst-yr,same-lst-two-yrs,same-lst-sev-yrs,?.
    7. area-damaged:	scattered,low-areas,upper-areas,whole-field,?.
    8. severity:	minor,pot-severe,severe,?.
    9. seed-tmt:	none,fungicide,other,?.
    10. germination:	90-100%,80-89%,lt-80%,?.
    11. plant-growth:	norm,abnorm,?.
    12. leaves:		norm,abnorm.
    13. leafspots-halo:	absent,yellow-halos,no-yellow-halos,?.
    14. leafspots-marg:	w-s-marg,no-w-s-marg,dna,?.
    15. leafspot-size:	lt-1/8,gt-1/8,dna,?.
    16. leaf-shread:	absent,present,?.
    17. leaf-malf:	absent,present,?.
    18. leaf-mild:	absent,upper-surf,lower-surf,?.
    19. stem:		norm,abnorm,?.
    20. lodging:    	yes,no,?.
    21. stem-cankers:	absent,below-soil,above-soil,above-sec-nde,?.
    22. canker-lesion:	dna,brown,dk-brown-blk,tan,?.
    23. fruiting-bodies:	absent,present,?.
    24. external decay:	absent,firm-and-dry,watery,?.
    25. mycelium:	absent,present,?.
    26. int-discolor:	none,brown,black,?.
    27. sclerotia:	absent,present,?.
    28. fruit-pods:	norm,diseased,few-present,dna,?.
    29. fruit spots:	absent,colored,brown-w/blk-specks,distort,dna,?.
    30. seed:		norm,abnorm,?.
    31. mold-growth:	absent,present,?.
    32. seed-discolor:	absent,present,?.
    33. seed-size:	norm,lt-norm,?.
    34. shriveling:	absent,present,?.
    35. roots:		norm,rotted,galls-cysts,?.


 ### Handling missing values

Lets check how many missing values each attribute has.

In [23]:
soybeansU.isna().sum()


Disease             0
date                1
plant-stand         8
precip             11
temp                7
hail               41
crop-hist           1
area-damaged        1
severity           41
seed-tmt           41
germination        36
plant-growth        1
leaves              0
leafspots-halo     25
leafspots-marg     25
leafspots-size     25
leaf-shread        26
leaf-malf          25
leaf-mild          30
stem                1
lodging            41
stem-cankers       11
canker-lesion      11
fruiting-bodies    35
external decay     11
mycelium           11
int-discolor       11
sclerotia          11
fruit-pods         25
fruit spots        35
seed               29
mold-growth        29
seed-discolor      35
seed-size          29
shriveling         35
roots               7
dtype: int64

 There seems to be quite a bit of missing values, with the coloumn with the highest amount of missing values being lodging. We will handle this shortly, but first, let's look at the classes.

 There are 19 classes and they are described as such:

 Class Distribution:
 1. diaporthe-stem-canker: 10
 2. charcoal-rot: 10
 3. rhizoctonia-root-rot: 10
 4. phytophthora-rot: 40
 5. brown-stem-rot: 20
 6. powdery-mildew: 10
 7. downy-mildew: 10
 8. brown-spot: 40
 9. bacterial-blight: 10
 10. bacterial-pustule: 10
 11. purple-seed-stain: 10
 12. anthracnose: 20
 13. phyllosticta-leaf-spot: 10
 14. alternarialeaf-spot: 40
 15. frog-eye-leaf-spot: 40
 16. diaporthe-pod-&-stem-blight: 6
 17. cyst-nematode: 6
 18. 2-4-d-injury: 1
 19. herbicide-injury: 4

 Let's map these classes.

In [24]:
class_mapping = {label:idx for idx, label in 
                 enumerate(np.unique(soybeansU["Disease"]), start = 1)}
class_mapping


{'2-4-d-injury': 1,
 'alternarialeaf-spot': 2,
 'anthracnose': 3,
 'bacterial-blight': 4,
 'bacterial-pustule': 5,
 'brown-spot': 6,
 'brown-stem-rot': 7,
 'charcoal-rot': 8,
 'cyst-nematode': 9,
 'diaporthe-pod-&-stem-blight': 10,
 'diaporthe-stem-canker': 11,
 'downy-mildew': 12,
 'frog-eye-leaf-spot': 13,
 'herbicide-injury': 14,
 'phyllosticta-leaf-spot': 15,
 'phytophthora-rot': 16,
 'powdery-mildew': 17,
 'purple-seed-stain': 18,
 'rhizoctonia-root-rot': 19}

 As you see the right side of each class is their new mapping starting from 1 and ending at 19
 
 So, now the new mapping is:
 - '2-4-d-injury': 1,
 - 'alternarialeaf-spot': 2,
 - 'anthracnose': 3,
 - 'bacterial-blight': 4,
 - 'bacterial-pustule': 5,
 - 'brown-spot': 6,
 - 'brown-stem-rot': 7,
 - 'charcoal-rot': 8,
 - 'cyst-nematode': 9,
 - 'diaporthe-pod-&-stem-blight': 10,
 - 'diaporthe-stem-canker': 11,
 - 'downy-mildew': 12,
 - 'frog-eye-leaf-spot': 13,
 - 'herbicide-injury': 14,
 - 'phyllosticta-leaf-spot': 15,
 - 'phytophthora-rot': 16,
 - 'powdery-mildew': 17,
 - 'purple-seed-stain': 18,
 - 'rhizoctonia-root-rot': 19
 
 

 Now let's apply it to the disease column.

In [25]:
soybeansU['Disease'] = soybeansU['Disease'].map(class_mapping)
soybeansU.head(10)


Unnamed: 0,Disease,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,...,int-discolor,sclerotia,fruit-pods,fruit spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots
0,11,4,0,2,1,0,2,0,2,1,...,0,0,0,4,0,0,0,0,0,0
1,11,3,0,2,1,0,1,0,2,1,...,0,0,0,4,0,0,0,0,0,0
2,11,3,0,2,1,0,1,0,2,0,...,0,0,0,4,0,0,0,0,0,0
3,11,6,0,2,1,0,2,0,1,0,...,0,0,0,4,0,0,0,0,0,0
4,11,5,0,2,1,0,3,0,1,0,...,0,0,0,4,0,0,0,0,0,0
5,11,5,0,2,1,0,2,0,1,1,...,0,0,0,4,0,0,0,0,0,0
6,11,4,0,2,1,1,1,0,1,0,...,0,0,0,4,0,0,0,0,0,0
7,11,6,0,2,1,0,3,0,1,1,...,0,0,0,4,0,0,0,0,0,0
8,11,4,0,2,1,0,2,0,2,0,...,0,0,0,4,0,0,0,0,0,0
9,8,6,0,0,2,0,1,3,1,1,...,2,1,0,4,0,0,0,0,0,0


 As you can see, the column `Disease` is now numbers.
 now we will impute the NaNs with the most frequent class in each colomn

In [26]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values = np.NaN, strategy= 'most_frequent')
imp = imp.fit(soybeansU)
imputed_data = imp.transform(soybeansU.values)
imputed_df = pd.DataFrame(imputed_data,columns = soybeansU.columns)
soybeansU.update(imputed_df)
for col in soybeansU.columns:
    soybeansU[col] = soybeansU[col].astype(int)
soybeansU.isna().sum()


Disease            0
date               0
plant-stand        0
precip             0
temp               0
hail               0
crop-hist          0
area-damaged       0
severity           0
seed-tmt           0
germination        0
plant-growth       0
leaves             0
leafspots-halo     0
leafspots-marg     0
leafspots-size     0
leaf-shread        0
leaf-malf          0
leaf-mild          0
stem               0
lodging            0
stem-cankers       0
canker-lesion      0
fruiting-bodies    0
external decay     0
mycelium           0
int-discolor       0
sclerotia          0
fruit-pods         0
fruit spots        0
seed               0
mold-growth        0
seed-discolor      0
seed-size          0
shriveling         0
roots              0
dtype: int64

As you can see, each of the columns, there is no NaNs.  Now, lets see data once more

In [27]:
soybeansU.head(35)

Unnamed: 0,Disease,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,...,int-discolor,sclerotia,fruit-pods,fruit spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots
0,11,4,0,2,1,0,2,0,2,1,...,0,0,0,4,0,0,0,0,0,0
1,11,3,0,2,1,0,1,0,2,1,...,0,0,0,4,0,0,0,0,0,0
2,11,3,0,2,1,0,1,0,2,0,...,0,0,0,4,0,0,0,0,0,0
3,11,6,0,2,1,0,2,0,1,0,...,0,0,0,4,0,0,0,0,0,0
4,11,5,0,2,1,0,3,0,1,0,...,0,0,0,4,0,0,0,0,0,0
5,11,5,0,2,1,0,2,0,1,1,...,0,0,0,4,0,0,0,0,0,0
6,11,4,0,2,1,1,1,0,1,0,...,0,0,0,4,0,0,0,0,0,0
7,11,6,0,2,1,0,3,0,1,1,...,0,0,0,4,0,0,0,0,0,0
8,11,4,0,2,1,0,2,0,2,0,...,0,0,0,4,0,0,0,0,0,0
9,8,6,0,0,2,0,1,3,1,1,...,2,1,0,4,0,0,0,0,0,0


Everthing is encoded so now lets see how many levels each column has.

In [28]:
soybeansU.nunique()

Disease            19
date                7
plant-stand         2
precip              3
temp                3
hail                2
crop-hist           4
area-damaged        4
severity            3
seed-tmt            3
germination         3
plant-growth        2
leaves              2
leafspots-halo      3
leafspots-marg      3
leafspots-size      3
leaf-shread         2
leaf-malf           2
leaf-mild           3
stem                2
lodging             2
stem-cankers        4
canker-lesion       4
fruiting-bodies     2
external decay      2
mycelium            2
int-discolor        3
sclerotia           2
fruit-pods          4
fruit spots         4
seed                2
mold-growth         2
seed-discolor       2
seed-size           2
shriveling          2
roots               3
dtype: int64

As you can see, each column has atleast 2 levels. Any classifier will not work properly with so many levels as the classifiers would think the columns are ordinal, rather than nominal So we have to encode it. Lets use Label Binarizer to Binary encode each column

In [29]:
from sklearn.preprocessing import OneHotEncoder
Y_df = soybeansU.iloc[:,0]
X_df = soybeansU.iloc[:,1:]
X = OneHotEncoder(categories= 'auto', sparse=True).fit_transform(X_df.values)
Y = Y_df.values
print(X)
type(X)

  (0, 4)	1.0
  (0, 7)	1.0
  (0, 11)	1.0
  (0, 13)	1.0
  (0, 15)	1.0
  (0, 19)	1.0
  (0, 21)	1.0
  (0, 27)	1.0
  (0, 29)	1.0
  (0, 32)	1.0
  (0, 35)	1.0
  (0, 37)	1.0
  (0, 38)	1.0
  (0, 43)	1.0
  (0, 46)	1.0
  (0, 47)	1.0
  (0, 49)	1.0
  (0, 51)	1.0
  (0, 55)	1.0
  (0, 56)	1.0
  (0, 61)	1.0
  (0, 63)	1.0
  (0, 67)	1.0
  (0, 69)	1.0
  (0, 70)	1.0
  :	:
  (305, 35)	1.0
  (305, 37)	1.0
  (305, 40)	1.0
  (305, 42)	1.0
  (305, 45)	1.0
  (305, 47)	1.0
  (305, 50)	1.0
  (305, 51)	1.0
  (305, 55)	1.0
  (305, 56)	1.0
  (305, 58)	1.0
  (305, 62)	1.0
  (305, 66)	1.0
  (305, 68)	1.0
  (305, 70)	1.0
  (305, 72)	1.0
  (305, 75)	1.0
  (305, 80)	1.0
  (305, 81)	1.0
  (305, 85)	1.0
  (305, 87)	1.0
  (305, 89)	1.0
  (305, 91)	1.0
  (305, 93)	1.0
  (305, 96)	1.0


scipy.sparse.csr.csr_matrix

There are 305 rows and now 96 columns now. This representation only shows that has a non zero values meaning all other coordinates that arent shown are zero, ie a sparse matrix, as expected. now lets put this into training and validation

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)


Lets see the shape of each train and test from X and y respectively

In [31]:

X_train.shape,X_test.shape

((244, 98), (62, 98))

there is not a 'one size fits all' classifier for datasets, so what we are going to do is implement a handful of classifiers, fine tune the hyper parameters and get the best result for each, and compare the classifiers to each one. First lets start with a Perceptron Classifier.

Perceptron is essentially a single layer neural network. Now let's begin

In [32]:
from sklearn.linear_model import Perceptron
ppn = Perceptron(max_iter= 40, eta0= 0.001, tol=0.001,alpha= 0.000000001, random_state=0)
ppn.fit(X_train, y_train)

Perceptron(alpha=1e-09, class_weight=None, early_stopping=False, eta0=0.001,
           fit_intercept=True, max_iter=40, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

Now that we have fit a perceptron model, lets see how many are misclassified.

In [33]:
y_pred= ppn.predict(X_test)
print('Misclassified samples: %d' % (y_test != y_pred).sum())

Misclassified samples: 8


It missclassified 8 and there are 62 samples in the test. So that the misclassification error would be 8/62 $\approx$ 0.13 

but in practice, it is easier to see the classification accuracy

In [34]:
from sklearn.metrics import accuracy_score
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))  

Accuracy: 0.87


so with perceptron, we had 82-87 percent accuracy  with 40 interations and a alpha of 0.0001, not bad with a percepatron, lets try something else. Lets try Naive bayes. Naive bayes works well with sparse data like this.
Naive Bayes is a classification alorithm that assumes features are independent from each other. In particular we are going to use a Bernoulli Naive bayes as our features are discrete.

In [41]:
from sklearn.naive_bayes import BernoulliNB

nb = BernoulliNB()
nb.fit(X,Y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [43]:
y2_pred = nb.predict(X_test)

Now that we fit and predicted, lets see how it fit

In [45]:
print('Misclassified samples: %d' % (y_test != y2_pred).sum())

Misclassified samples: 5


we misclassified 5 lets see the accuracy

In [46]:
print('Accuracy: %.2f' % accuracy_score(y_test, y2_pred))

Accuracy: 0.92


Even better than perceptatron we got 92% accuracy. Which makes sense. Naive bayes it well suited to sparse matrices with dicrete binarized features, so it's no surprise this algorithm performs well.