# Soybean Classification

The primary goal of this project is to develop a predictive model capable of accurately diagnosing soybean diseases. Utilizing machine learning techniques, we aim to analyze a given soybean dataset to predict the type of disease affecting the crop based on various symptomatic features.

### Data set Overview 

The dataset for this project is obtained in ARFF format, a standard data file format for machine learning tasks. It comprises categorical features that describe the physical symptoms observed in soybean crops, each potentially indicative of specific diseases. The features encompass a range of plant symptoms such as lesion characteristics, plant growth, and environmental conditions during plant development. The dataset's target variable is a multi-class label representing various soybean diseases.

#### Loading the data

In [3]:
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff
import matplotlib.pyplot as plt
import seaborn as sns

raw_data = loadarff('dataset_42_soybean.arff')
df = pd.DataFrame(raw_data[0])

df

Unnamed: 0,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,...,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots,class
0,b'october',b'normal',b'gt-norm',b'norm',b'yes',b'same-lst-yr',b'low-areas',b'pot-severe',b'none',b'90-100',...,b'absent',b'norm',b'dna',b'norm',b'absent',b'absent',b'norm',b'absent',b'norm',b'diaporthe-stem-canker'
1,b'august',b'normal',b'gt-norm',b'norm',b'yes',b'same-lst-two-yrs',b'scattered',b'severe',b'fungicide',b'80-89',...,b'absent',b'norm',b'dna',b'norm',b'absent',b'absent',b'norm',b'absent',b'norm',b'diaporthe-stem-canker'
2,b'july',b'normal',b'gt-norm',b'norm',b'yes',b'same-lst-yr',b'scattered',b'severe',b'fungicide',b'lt-80',...,b'absent',b'norm',b'dna',b'norm',b'absent',b'absent',b'norm',b'absent',b'norm',b'diaporthe-stem-canker'
3,b'july',b'normal',b'gt-norm',b'norm',b'yes',b'same-lst-yr',b'scattered',b'severe',b'none',b'80-89',...,b'absent',b'norm',b'dna',b'norm',b'absent',b'absent',b'norm',b'absent',b'norm',b'diaporthe-stem-canker'
4,b'october',b'normal',b'gt-norm',b'norm',b'yes',b'same-lst-two-yrs',b'scattered',b'pot-severe',b'none',b'lt-80',...,b'absent',b'norm',b'dna',b'norm',b'absent',b'absent',b'norm',b'absent',b'norm',b'diaporthe-stem-canker'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,b'april',b'?',b'?',b'?',b'?',b'?',b'upper-areas',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'2-4-d-injury'
679,b'april',b'lt-normal',b'?',b'lt-norm',b'?',b'diff-lst-year',b'scattered',b'?',b'?',b'?',...,b'?',b'dna',b'?',b'?',b'?',b'?',b'?',b'?',b'rotted',b'herbicide-injury'
680,b'june',b'lt-normal',b'?',b'lt-norm',b'?',b'diff-lst-year',b'scattered',b'?',b'?',b'?',...,b'?',b'dna',b'?',b'?',b'?',b'?',b'?',b'?',b'rotted',b'herbicide-injury'
681,b'april',b'lt-normal',b'?',b'lt-norm',b'?',b'same-lst-yr',b'whole-field',b'?',b'?',b'?',...,b'?',b'dna',b'?',b'?',b'?',b'?',b'?',b'?',b'rotted',b'herbicide-injury'


We can notice that the data is encoded into byte literals, we need to decode it into string data

In [4]:
df = df.applymap(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)

df.head(3)

  df = df.applymap(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)


Unnamed: 0,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,...,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots,class
0,october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker
1,august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker
2,july,normal,gt-norm,norm,yes,same-lst-yr,scattered,severe,fungicide,lt-80,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker


### EDA

Now that we've discovered the data set, we need to start with the EDA.

##### Cleaning and chacking the data

We started by checking the data for any missing values or duplicate values and found the following: 

The data set does not contain any missing values in its features which is very good.  
However there are 52 duplicate rows. These duplicate rows could skew our analysis and model training, so they need to be removed. 

After removing the duplicate rows the data set will contain 631 unique entries.

#### Feature Analysis

Next we're going to move to feature analysis, for that we will start by looking at the distribution of categories within 
a few selected features and the target variable 'class' to understand the diversity and the frewquency of the diseases. Let's start with plant-stand, precip, and the class (target variable)

**Feature Analysis Summary**

Plant-Stand:

    Normal: 333 instances
    Less than Normal: 266 instances
    Missing (denoted by '?'): 32 instances

Precipitation (Precip):

    Greater than Normal: 422 instances
    Normal: 103 instances
    Less than Normal: 72 instances
    Missing (denoted by '?'): 34 instances

**Target Variable (Class) Distribution Summary**

The target variable, representing different soybean diseases, shows a varied distribution among 19 classes:

    Alternaria Leaf Spot: 
    The most frequent class with 88 instances.

    Frog Eye Leaf Spot:
    81 instances.

    Brown Spot: 
    77 instances.

    Phytophthora Rot: 
    71 instances.

    Classes like Diaporthe Stem Canker, Purple Seed Stain, Phyllosticta Leaf Spot, Bacterial Pustule, Charcoal Rot, Bacterial Blight, 
    Downy Mildew, Powdery Mildew, and Rhizoctonia Root Rot each have 20 instances.

    Herbicide Injury is the least common, with only 8 instances.

The presence of missing values denoted by '?' in features such as plant-stand and precip suggests we should decide how to handle these during preprocessing. Options include imputation, removal, or treating them as a separate category, depending on their impact on the model.


<img src="Plant_Stand_Prec.png" width="500" >


Distribution of Plant Stand: 

     The dataset contains more instances with a 'normal' plant stand than 'lt-normal' (less than normal), with a small number of entries labeled as unknown ('?').

Distribution of Precipitation: 

    Most of the entries have 'gt-norm' (greater than normal) precipitation, with fewer instances having 'norm' (normal) and 'lt-norm' (less than normal) levels. Similar to 'plant-stand', there are some unknown values here as well

Distribution of Soybean Diseases (Class): 
    
    The target variable shows a varied distribution of diseases, with some being more common than others. 'Alternarialeaf-spot', 'frog-eye-leaf-spot', and 'brown-spot' are among the most frequent diseases, whereas 'herbicide-injury' is less common. This uneven distribution indicates that some diseases are more prevalent under certain conditions, which could be essential for predicting the disease based on environmental and plant conditions.

We can notice the presence of these '?' values which we will take care of soon

### Exploring Relationships


We'll start by summarizing the relationship between plant-stand, precip, and a few selected diseases to provide insights that can guide further analysis or modeling decisions.

Based on the simplified analysis focusing on the top three diseases (alternarialeaf-spot, brown-spot, and frog-eye-leaf-spot), we can observe how these diseases relate to plant-stand and precip conditions:

#### **Plant Stand Summary:**

Alternarialeaf Spot: 

    Appears more frequently in normal plant stand conditions (55 occurrences) compared to less than normal conditions (33 occurrences).


Brown Spot: 

    Also more common in normal plant stand conditions (48 occurrences) than in less than normal (29 occurrences).

    
Frog-Eye Leaf Spot:

    This disease has the highest occurrence in normal plant stand conditions (57 occurrences) compared to less than normal (24 occurrences).

#### **Precipitation Summary:**

Alternarialeaf Spot: 

    Significantly more prevalent under conditions of greater than normal precipitation (79 occurrences), with fewer occurrences under normal precipitation (9 occurrences).
Brown Spot: 

    Similar to Alternarialeaf Spot, more common in greater than normal precipitation (67 occurrences) compared to normal precipitation (10 occurrences).

Frog-Eye Leaf Spot:

    Follows the same trend, with most occurrences under greater than normal precipitation (71 occurrences) and fewer under normal conditions (10 occurrences).

**These summaries suggest that the occurrence of these top soybean diseases is more frequent under conditions of normal plant stand and greater than normal precipitation. This could imply that certain environmental conditions, such as ample water supply, might favor the development of these diseases, especially in healthier plants with a normal stand.**

### Data Preparation

 #### Now we need to move to a very important step which is data preparation and we will be doing the following:

##### Encoding Categorical Variables: 

Since SVM (Support Vector Machine) works with numerical data, we need to convert all categorical variables into a numerical format. One common method is one-hot encoding, where each category value is converted into a new categorical column and assigned a 1 or 0 (notation for true/false) value to the column. However, given the nature of the dataset, label encoding (which assigns a unique integer to each category value) might be more space-efficient due to the large number of categories.

##### Handling Missing Values:

We will be handling missing values which are labeled by '?'.

The '?' values will be replaced by 'unknown'


The dataset contains a total of 1,913 instances of '?' values spread across various features, with no single feature overwhelmingly affected. 

Missing values were replaced with the most frequent value within each column. This method ensures that the imputed values are consistent with the existing distribution of categories within each feature, thereby preserving the dataset's overall characteristics. Additionally, this approach allows for the seamless integration of imputed data into subsequent steps of the analysis pipeline, including feature encoding and selection, without introducing biases or artifacts that could compromise the model's performance. By imputing missing values in this manner, the dataset was made more robust and suitable for training a reliable classification model to predict soybean diseases.

##### Feature Scaling

 SVMs are sensitive to the scale of the input features, so it's important to scale the features to a similar range

We'll start with encoding the categorical variables. We'll use label encoding for simplicity and efficiency, given the dataset's structure. After encoding, we'll proceed with feature scaling and then split the dataset into training and testing sets.

##### Splitting the Data Set

After scaling we'll split the dataset into training and testing sets to prepare for model training and evaluation.

Training Set: 504 samples

Testing Set: 127 samples

### Understanding the Features

Cramer's V is a great choice for assessing the association between categorical variables, providing a statistic that measures the strength of association between two nominal variables, independent of the number of categories.

To apply Cramer's V, we'll compute the statistic for each pair of features in the dataset, then visualize these associations in a heatmap similar to the correlation matrix heatmap. This approach will help us understand the relationships between features better.

<img src="Cramer.png" width="700">

Now that we've spotted the correlations we can actually use a very efficient method for feature selection that recursively removes
the least important features based on the model's coefficients or feature importance. 

<img src="RFE.png" width="700" >

The Recursive Feature Elimination (RFE) process with cross-validation (RFECV) suggests that the optimal number of features for predicting soybean diseases using an SVM model is 19. The plot shows how the cross-validation score varies with the number of features selected, indicating that having around 19 features provides the best balance between model complexity and accuracy.

This result demonstrates the potential to reduce the dataset from 35 features to 19 without compromising model performance, which can lead to a simpler, more interpretable model.

### The SVM model 

Importing the ressources we need

In [5]:
import numpy as np
import pandas as pd
from scipy.io import arff
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.impute import SimpleImputer

Loading the dataset

In [6]:
data, meta = arff.loadarff('dataset_42_soybean.arff')
df = pd.DataFrame(data)

Decoding byte strings and handling missing values

In [7]:
df = df.applymap(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)

  df = df.applymap(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)


Imputing missing values with the most frequent value in each column

In [8]:
imputer = SimpleImputer(missing_values='?', strategy='most_frequent')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

Removing duplicate rows

In [9]:
df_imputed = df_imputed.drop_duplicates()

Separating features and target variable

In [10]:
y = df_imputed['class']
X = df_imputed.drop('class', axis=1)

Defining categorical features for one-hot encoding

In [11]:
categorical_features = X.columns

Preprocessor for one-hot encoding categorical features

In [12]:
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features)
])

SVM classifier and RFECV for feature selection

In [13]:
svm_clf = SVC(kernel='linear', C=1, random_state=42)
rfecv = RFECV(estimator=svm_clf, step=1, cv=StratifiedKFold(5), scoring='accuracy')

Pipeline including preprocessor, RFECV, and SVM classifier

In [14]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('feature_selection', rfecv),
                           ('svm', svm_clf)])

Splitting data into training and testing sets

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fitting the pipeline to the training data

In [16]:
pipeline.fit(X_train, y_train)

Making predictions on the test data

In [17]:
y_pred = pipeline.predict(X_test)

Evaluating the model

In [18]:
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

Accuracy: 0.9444444444444444
Classification Report:
                              precision    recall  f1-score   support

               2-4-d-injury       1.00      1.00      1.00         3
        alternarialeaf-spot       0.86      0.90      0.88        20
                anthracnose       1.00      1.00      1.00         7
           bacterial-blight       1.00      1.00      1.00         4
          bacterial-pustule       1.00      1.00      1.00         5
                 brown-spot       1.00      0.87      0.93        15
             brown-stem-rot       1.00      1.00      1.00        10
               charcoal-rot       1.00      1.00      1.00         3
              cyst-nematode       1.00      1.00      1.00         1
diaporthe-pod-&-stem-blight       1.00      1.00      1.00         3
      diaporthe-stem-canker       1.00      1.00      1.00         5
               downy-mildew       1.00      1.00      1.00         1
         frog-eye-leaf-spot       0.84      0.84 