# Probem Identification

<b>The classification of dry beans using high-resolution images poses a multifaceted challenge due to the diverse features extracted from the grains. <i>The objective of this project is to develop an effective classification model for seven different registered varieties of dry beans</i>. The dataset, comprising 16 features derived from 13,611 grain images, introduces complexities such as varied dimensions and shape forms. Challenges include:</b>

> 1. <i>Multivariate Nature</i>
> 2. <i>Image-based Classification</i>
> 3. <i>Data Preprocessing</i>
> 3. <i>Data Pipelining</i>
<br>

<b>By addressing these challenges, the project aims to contribute to the development of a reliable and accurate classification system for dry beans, facilitating uniform seed classification based on high-resolution images.</b>



In [2]:
!pip install scikit-learn

## A. Data manipulation
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1.2 for data splitting
from sklearn.model_selection import train_test_split

## B. Transformers for predictors:

# 1.3 Class for imputing missing values
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
from sklearn.impute import SimpleImputer

# 1.4 One hot encode categorical data--Convert to dummy
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
from sklearn.preprocessing import OneHotEncoder

# 1.5 Scale numeric data
from sklearn.preprocessing import StandardScaler

## C. Transformer for target:

# 1.6 Label encode target column
from sklearn.preprocessing import LabelEncoder


## D. Composite Transformers:

# 1.7 Class for applying multiple data transformation
#     jobs parallely
from sklearn.compose import ColumnTransformer

# 1.8 Pipeline class: Class for applying multiple
#     data transformations sequentially
from sklearn.pipeline import Pipeline

## E. Estimator

# 1.9 Estimator
# Ref: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# User guide: https://scikit-learn.org/stable/modules/tree.html
from sklearn.ensemble import RandomForestClassifier 

# 1.10 To plot pipeline diagram
from sklearn import set_config



In [3]:
# 1.11 Display outputs of all commands from a cell--not just of the last command
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
# Import warnings module
import warnings
# Do not print warnings on screen
warnings.filterwarnings("ignore")

In [5]:
raw_excel=pd.read_excel("C://SOUPARNA//Data Analytics//Machine Learning//Project//End Term Project//Part-1//Dry_Bean_Dataset.xlsx")

In [6]:
df=raw_excel.copy()
df

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.272750,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.998430,SEKER
2,29380,624.110,212.826130,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.333680,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.941900,0.999166,SEKER
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,42097,759.696,288.721612,185.944705,1.552728,0.765002,42508,231.515799,0.714574,0.990331,0.916603,0.801865,0.006858,0.001749,0.642988,0.998385,DERMASON
13607,42101,757.499,281.576392,190.713136,1.476439,0.735702,42494,231.526798,0.799943,0.990752,0.922015,0.822252,0.006688,0.001886,0.676099,0.998219,DERMASON
13608,42139,759.321,281.539928,191.187979,1.472582,0.734065,42569,231.631261,0.729932,0.989899,0.918424,0.822730,0.006681,0.001888,0.676884,0.996767,DERMASON
13609,42147,763.779,283.382636,190.275731,1.489326,0.741055,42667,231.653248,0.705389,0.987813,0.907906,0.817457,0.006724,0.001852,0.668237,0.995222,DERMASON


In [8]:
#Finding no of rows and columns
df.shape

(13611, 17)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13611 entries, 0 to 13610
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Area             13611 non-null  int64  
 1   Perimeter        13611 non-null  float64
 2   MajorAxisLength  13611 non-null  float64
 3   MinorAxisLength  13611 non-null  float64
 4   AspectRation     13611 non-null  float64
 5   Eccentricity     13611 non-null  float64
 6   ConvexArea       13611 non-null  int64  
 7   EquivDiameter    13611 non-null  float64
 8   Extent           13611 non-null  float64
 9   Solidity         13611 non-null  float64
 10  roundness        13611 non-null  float64
 11  Compactness      13611 non-null  float64
 12  ShapeFactor1     13611 non-null  float64
 13  ShapeFactor2     13611 non-null  float64
 14  ShapeFactor3     13611 non-null  float64
 15  ShapeFactor4     13611 non-null  float64
 16  Class            13611 non-null  object 
dtypes: float64(1

In [10]:
#The total number of elements present in this dataset
df.size

231387

In [11]:
#Check the datatypes of the different columns in the dataframe
df.dtypes

Area                 int64
Perimeter          float64
MajorAxisLength    float64
MinorAxisLength    float64
AspectRation       float64
Eccentricity       float64
ConvexArea           int64
EquivDiameter      float64
Extent             float64
Solidity           float64
roundness          float64
Compactness        float64
ShapeFactor1       float64
ShapeFactor2       float64
ShapeFactor3       float64
ShapeFactor4       float64
Class               object
dtype: object

In [12]:
df.nunique()

Area               12011
Perimeter          13416
MajorAxisLength    13543
MinorAxisLength    13543
AspectRation       13543
Eccentricity       13543
ConvexArea         12066
EquivDiameter      12011
Extent             13535
Solidity           13526
roundness          13543
Compactness        13543
ShapeFactor1       13543
ShapeFactor2       13543
ShapeFactor3       13543
ShapeFactor4       13543
Class                  7
dtype: int64

<b>Except Class all the features have multiple unique values , therefore we have only one categorical variable which is our target variable.</b>

In [13]:
df['Class'].value_counts()

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64

In [14]:
from sklearn.preprocessing import LabelEncoder


le=LabelEncoder()
df['Class']=le.fit_transform(df['Class'])
# Display the resulting DataFrame
df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,5
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,5
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,5
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,5
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,5


In [41]:
df['Class'].value_counts()

3    3546
6    2636
5    2027
4    1928
2    1630
0    1322
1     522
Name: Class, dtype: int64

In [15]:
df.columns

Index(['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
       'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
       'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
       'ShapeFactor3', 'ShapeFactor4', 'Class'],
      dtype='object')

In [16]:
# Check for duplicated values
duplicated_values = df.duplicated().sum()
print("\nDuplicated Values:", duplicated_values)


Duplicated Values: 68


In [17]:
#X = df.iloc[:,:-1]
#y = df.Class

In [19]:
# Split the data into features (X) and target variable (y)
X = df.drop(['Class'],axis=1)
y = df['Class']

In [20]:
X.columns

Index(['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
       'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
       'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
       'ShapeFactor3', 'ShapeFactor4'],
      dtype='object')

In [21]:
y.name

'Class'

# Data Pipelining

In [22]:
# Define preprocessing steps for numeric and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [23]:
categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [44]:
# 10.5
# Test transformers:
# Feed data to each pipe to see if it is working
# It is like testing a sub-component
# before full-plumbing is done.
# No error should come.

numeric_transformer.fit_transform(X[numeric_features])



array([[-0.84074853, -1.1433189 , -1.30659814, ...,  2.40217287,
         1.92572347,  0.83837103],
       [-0.82918764, -1.01392388, -1.39591111, ...,  3.10089314,
         2.68970162,  0.77113842],
       [-0.80715717, -1.07882906, -1.25235661, ...,  2.23509147,
         1.84135576,  0.91675514],
       ...,
       [-0.37203825, -0.44783294, -0.45047814, ...,  0.28920441,
         0.33632829,  0.39025114],
       [-0.37176543, -0.42702856, -0.42897404, ...,  0.22837538,
         0.2489734 ,  0.03644001],
       [-0.37135619, -0.38755718, -0.2917356 , ..., -0.12777587,
        -0.2764814 ,  0.71371948]])

In [25]:
## Combine the transformers into a preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [26]:
# 10.7 Test column transformet
preprocessor.fit_transform(X)

array([[-0.84074853, -1.1433189 , -1.30659814, ...,  2.40217287,
         1.92572347,  0.83837103],
       [-0.82918764, -1.01392388, -1.39591111, ...,  3.10089314,
         2.68970162,  0.77113842],
       [-0.80715717, -1.07882906, -1.25235661, ...,  2.23509147,
         1.84135576,  0.91675514],
       ...,
       [-0.37203825, -0.44783294, -0.45047814, ...,  0.28920441,
         0.33632829,  0.39025114],
       [-0.37176543, -0.42702856, -0.42897404, ...,  0.22837538,
         0.2489734 ,  0.03644001],
       [-0.37135619, -0.38755718, -0.2917356 , ..., -0.12777587,
        -0.2764814 ,  0.71371948]])

In [27]:
# Duplicate Removal Pipeline
duplicate_removal_pipeline = Pipeline([
    ('drop_duplicates', None)  # No need for additional steps, as we want to keep the first record
])

In [28]:
# Random Forest Classifier Pipeline
rf_classifier_pipeline = Pipeline([
    ('classifier', RandomForestClassifier(n_estimators=100,random_state=42))
])

In [29]:
# Create the pipeline by combining preprocessor and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('duplicate_removal', duplicate_removal_pipeline),
                            ('rf_classifier', rf_classifier_pipeline)])

In [30]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [31]:
X_train.columns

Index(['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
       'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
       'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
       'ShapeFactor3', 'ShapeFactor4'],
      dtype='object')

In [33]:
y_train.name

'Class'

In [34]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [35]:
# Make predictions on the test set
y_pred = pipeline.predict(X_test)

In [36]:
# Evaluate the model
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')

Model Accuracy: 0.93


In [37]:
from sklearn.metrics import classification_report, confusion_matrix

In [367]:
# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.89      0.92       261
           1       1.00      1.00      1.00       117
           2       0.93      0.93      0.93       317
           3       0.90      0.91      0.91       671
           4       0.97      0.95      0.96       408
           5       0.97      0.93      0.95       413
           6       0.88      0.86      0.87       536

   micro avg       0.93      0.91      0.92      2723
   macro avg       0.94      0.93      0.93      2723
weighted avg       0.93      0.91      0.92      2723
 samples avg       0.91      0.91      0.91      2723



In [38]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[236   0  17   0   1   1   6]
 [  0 117   0   0   0   0   0]
 [ 13   0 299   0   3   1   1]
 [  0   0   0 620   2   4  45]
 [  2   0   7   3 390   0   6]
 [  3   0   0  14   0 387   9]
 [  0   0   1  52   5   7 471]]


# Predicting with a new data point

In [39]:
# Now, let's use a sample new data point to make predictions
new_data_point = pd.DataFrame({
    'Area': [34875],
    'Perimeter': [683.2],
    'MajorAxisLength': [450],
    'MinorAxisLength': [189.7458],
    'AspectRation': [1.43],
    'Eccentricity': [0.6317],
    'ConvexArea': [30105],
    'EquivDiameter': [204.16],
    'Extent': [0.63124],
    'Solidity': [0.7845],
    'roundness': [0.9153],
    'Compactness': [0.8928],
    'ShapeFactor1': [0.007345],
    'ShapeFactor2': [0.00245],
    'ShapeFactor3': [0.97123],
    'ShapeFactor4': [0.954963]
})

In [43]:
# Use the pipeline to make predictions on the new data point
predicted_class = pipeline.predict(new_data_point)
#print("\nSample Data for Prediction:\n", predicted_class)
print(f'Predicted Class for New Data Point: {predicted_class[0]}')
print(f'Class 3 represents: DERMASON')

Predicted Class for New Data Point: 3
Class 3 represents: DERMASON


# # Observation & Interpretation

>1. <b>Precision, Recall, and F1-Score:</b>
•	Overall, the model demonstrates high performance, with precision, recall, and F1-scores above 0.90 for most classes.         

>2. <b>Confusion Matrix:</b>
•	Class 3 (DERMASON) has relatively higher misclassifications, especially with Class 6 (HOROZ), evident from the non-zero values in the corresponding row and column.

>3. <b>Sample Prediction:</b>
•	Utilizing the pipeline, a new data point is predicted to belong to Class 3 (DERMASON).
•	The accuracy of the model on the test set is reported as 0.93, indicating a high level of correct predictions.

>4. <b>Interpretation:</b>
•	The model exhibits a strong ability to differentiate between the seven dry bean classes based on the provided features.
•	Misclassifications, particularly between certain classes, may indicate areas for potential improvement.
•	The sample prediction on a new data point aligns with the expected outcome, supporting the model's generalization ability.
