# 1) What the data is about?

Dataset taken is the Images of 13,611 grains of 7 different registered DRY BEANS with a high-resolution camera. A user-friendly interface was designed using the MATLAB graphical user interface (GUI). Bean images obtained by computer vision system (CVS) were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimension and 4 shape forms, were obtained from the grains.

Attribute information:-

1. Area (A): The area of a bean zone and the number of pixels within its boundaries.
2. Perimeter (P): Bean circumference is defined as the length of its border.
3. Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.
4. Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. Aspect ratio (K): Defines the relationship between L and l.
6. Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.
7. Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area.
9. Extent (Ex): The ratio of the pixels in the bounding box to the bean area.
10. Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. Roundness (R): Calculated with the following formula: (4piA)/(P^2)
12. Compactness (CO): Measures the roundness of an object: Ed/L
13. ShapeFactor1 (SF1)
14. ShapeFactor2 (SF2)
15. ShapeFactor3 (SF3)
16. ShapeFactor4 (SF4)
17. Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)

# 2) What type of benefit you might hope to get from data mining?

The primary objective is to provide a method for obtaining uniform seed varieties from the production which is in the form of population. It helps to determine the type of products that come to the market and will also provide the product parameters that will determine the price. Also, the parameters obtained in the study will constitute the data set that will enable the use of data mining and artificial intelligence methods.

# 3) Discuss data quality issues 

a) Are there problems with the data?

No, Data is completely ready to fit into an algorithm. No such significant isssues like missing values, outliers etc lies within this dataset.'

b) What might be an appropriate response to the quality issues?

The dataset taken is completely fine and doesnot have quality issues.
If in case exist,We can treat them as

i)missing values- Simple Imputer can be used or simply imputing mean values for continous data and mode values fo categorical data 

ii)outliers- Removing the outlier datapoints

iii)categorical attributes- One hot encoding will be the best solution

iv)MultiCollinearity- Lasso regression for removing multicollinearity problems 

# Reading Data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from random import randrange
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
df=pd.read_excel("Dry_Bean_Dataset.xlsx",header=0)

In [3]:
df

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.272750,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.998430,SEKER
2,29380,624.110,212.826130,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.333680,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.941900,0.999166,SEKER
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,42097,759.696,288.721612,185.944705,1.552728,0.765002,42508,231.515799,0.714574,0.990331,0.916603,0.801865,0.006858,0.001749,0.642988,0.998385,DERMASON
13607,42101,757.499,281.576392,190.713136,1.476439,0.735702,42494,231.526798,0.799943,0.990752,0.922015,0.822252,0.006688,0.001886,0.676099,0.998219,DERMASON
13608,42139,759.321,281.539928,191.187979,1.472582,0.734065,42569,231.631261,0.729932,0.989899,0.918424,0.822730,0.006681,0.001888,0.676884,0.996767,DERMASON
13609,42147,763.779,283.382636,190.275731,1.489326,0.741055,42667,231.653248,0.705389,0.987813,0.907906,0.817457,0.006724,0.001852,0.668237,0.995222,DERMASON


In [4]:
X=df.iloc[:,:-1]
y=df.iloc[:,-1]

# SCALING DATA

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled=scaler.fit_transform(X)

In [7]:
X_scaled = pd.DataFrame(X_scaled)

In [8]:
df_scaled=pd.concat([X_scaled,y], axis=1)

In [9]:
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,Class
0,-0.840749,-1.143319,-1.306598,-0.631153,-1.565053,-2.185720,-0.841451,-1.063341,0.289087,0.367613,1.423867,1.839116,0.680786,2.402173,1.925723,0.838371,SEKER
1,-0.829188,-1.013924,-1.395911,-0.434445,-1.969784,-3.686040,-0.826102,-1.044217,0.697477,-0.462907,0.231054,2.495449,0.367967,3.100893,2.689702,0.771138,SEKER
2,-0.807157,-1.078829,-1.252357,-0.585735,-1.514291,-2.045336,-0.808704,-1.008084,0.578195,0.518417,1.252865,1.764843,0.603129,2.235091,1.841356,0.916755,SEKER
3,-0.785741,-0.977215,-1.278825,-0.439290,-1.741618,-2.742211,-0.773975,-0.973337,0.671260,-2.241767,0.515049,2.081715,0.401718,2.515075,2.204250,-0.197985,SEKER
4,-0.781239,-1.097384,-1.380471,-0.266663,-2.117993,-4.535028,-0.784286,-0.966080,0.476020,0.804772,1.874992,2.765330,0.118268,3.270983,3.013462,0.939640,SEKER
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,-0.373471,-0.446083,-0.366669,-0.363055,-0.123703,0.153343,-0.378191,-0.364148,-0.716284,0.684173,0.727872,0.032433,0.261425,0.055630,-0.006086,0.760813,DERMASON
13607,-0.373334,-0.456336,-0.450053,-0.257015,-0.432979,-0.165141,-0.378662,-0.363962,1.022933,0.774384,0.818807,0.362794,0.110384,0.285117,0.328393,0.722659,DERMASON
13608,-0.372038,-0.447833,-0.450478,-0.246456,-0.448618,-0.182940,-0.376143,-0.362197,-0.403392,0.591370,0.758468,0.370533,0.104269,0.289204,0.336328,0.390251,DERMASON
13609,-0.371765,-0.427029,-0.428974,-0.266742,-0.380735,-0.106960,-0.372851,-0.361825,-0.903414,0.143717,0.581753,0.285098,0.141906,0.228375,0.248973,0.036440,DERMASON


# CHECK FOR NULL VALUES

In [10]:
df.isnull().sum()

Area               0
Perimeter          0
MajorAxisLength    0
MinorAxisLength    0
AspectRation       0
Eccentricity       0
ConvexArea         0
EquivDiameter      0
Extent             0
Solidity           0
roundness          0
Compactness        0
ShapeFactor1       0
ShapeFactor2       0
ShapeFactor3       0
ShapeFactor4       0
Class              0
dtype: int64

# TRAIN-TEST SPLIT

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size=0.2,random_state=42)

# K-FOLD IMPLEMENTATION

In [22]:
class kFold:
    
    def __init__(self):
        pass
    
    def Split(self, dataset,Folds):
        dataSplit = list()
        dataCopy = dataset
        foldSize = int(dataCopy.shape[0] / Folds)
        for _ in range(Folds):
            fold = list()
            while len(fold) < foldSize:
                r=randrange(dataCopy.shape[0])
                index = dataCopy.index[r]
                fold.append(dataCopy.loc[index].values.tolist())
                dataCopy=dataCopy.drop(index)
            dataSplit.append(np.array(fold))    
        return dataSplit
    
    
    def crossvalscore(self, dataset, Folds, mod ,*args):
        model = mod
        data = self.Split(dataset,Folds)
        scores = []
        acc_scr= []
        for i in range(Folds):
            r = list(range(Folds))
            r.pop(i)
            for j in r :
                if j == r[0]:
                    file = data[j]
                else:    
                    file=np.concatenate((file,data[j]), axis=0)
            model.fit(file[:,:-1],file[:,-1])
            predicted = model.predict(data[i][:,:-1])
            actual=data[i][:,-1]
            scores.append((accuracy_score(actual, predicted)*100))
                 
        print('KFold-Scores: %s' % scores)
        print('\nMean Accuracy: %.3f%%' % ((sum(scores)/float(len(scores)))))
        print('\n %s' % classification_report(actual, predicted))
    



# 1.DECISION TREE CLASSIFIER

In [23]:
from sklearn.tree import DecisionTreeClassifier

params_DT = {'criterion' : ['gini', 'entropy'],
             'max_leaf_nodes': list(range(2, 100)) , 
             'min_samples_split': list(range(2, 10)), 
             'max_depth': list(range(2, 12))}
DT_cv = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), params_DT, verbose=1, cv=10, n_jobs=-1,random_state=42)
DT_cv.fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


RandomizedSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=42),
                   n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11],
                                        'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8,
                                                           9, 10, 11, 12, 13,
                                                           14, 15, 16, 17, 18,
                                                           19, 20, 21, 22, 23,
                                                           24, 25, 26, 27, 28,
                                                           29, 30, 31, ...],
                                        'min_samples_split': [2, 3, 4, 5, 6, 7,
                                                              8, 9]},
                   random_state=42, verbose=1)

# Best Parameters for Decision Tree

In [24]:
print('Best Estimator: %s' % DT_cv.best_estimator_)

Best Estimator: DecisionTreeClassifier(max_depth=9, max_leaf_nodes=99, min_samples_split=3,
                       random_state=42)


In [25]:
DT=kFold()

In [26]:
DT.crossvalscore(df_scaled,10,DT_cv.best_estimator_)

KFold-Scores: [91.62380602498163, 91.6972814107274, 90.66862601028656, 89.27259368111683, 91.18295371050698, 91.10947832476121, 91.03600293901543, 90.52167523879501, 91.25642909625276, 91.6972814107274]

Mean Accuracy: 91.007%

               precision    recall  f1-score   support

    BARBUNYA       0.96      0.86      0.91       125
      BOMBAY       0.98      1.00      0.99        51
        CALI       0.89      0.91      0.90       141
    DERMASON       0.92      0.92      0.92       371
       HOROZ       0.94      0.97      0.95       215
       SEKER       0.94      0.94      0.94       180
        SIRA       0.87      0.87      0.87       278

    accuracy                           0.92      1361
   macro avg       0.93      0.93      0.93      1361
weighted avg       0.92      0.92      0.92      1361



# 2.RANDOM FOREST CLASSIFIER

In [41]:
from sklearn.ensemble import RandomForestClassifier

n_estimators = [int(x) for x in np.linspace(start = 10, stop = 80, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [2,4]
min_samples_split = [2, 5]
min_samples_leaf = [1, 2]
bootstrap = [True, False]
params_RF = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

RF_cv = RandomizedSearchCV(RandomForestClassifier(random_state=42,n_jobs=-1), params_RF, verbose=1, cv=10, n_jobs=-1,random_state=42)
RF_cv.fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


RandomizedSearchCV(cv=10,
                   estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [2, 4],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2],
                                        'min_samples_split': [2, 5],
                                        'n_estimators': [10, 17, 25, 33, 41, 48,
                                                         56, 64, 72, 80]},
                   random_state=42, verbose=1)

# Best Parameters for Random Forest

In [30]:
print('Best Estimator: %s' % RF_cv.best_estimator_)

Best Estimator: RandomForestClassifier(max_depth=4, max_features='sqrt', min_samples_leaf=2,
                       n_estimators=33, n_jobs=-1, random_state=42)


In [31]:
RF=kFold()

In [32]:
RF.crossvalscore(df_scaled,10,RF_cv.best_estimator_)

KFold-Scores: [88.53783982365907, 87.80308596620132, 88.09698750918442, 87.14180749448934, 88.09698750918442, 88.53783982365907, 88.02351212343865, 89.6399706098457, 87.95003673769287, 90.52167523879501]

Mean Accuracy: 88.435%

               precision    recall  f1-score   support

    BARBUNYA       0.83      0.69      0.76       108
      BOMBAY       1.00      1.00      1.00        46
        CALI       0.82      0.90      0.86       152
    DERMASON       0.93      0.93      0.93       361
       HOROZ       0.98      0.92      0.95       221
       SEKER       0.96      0.94      0.95       201
        SIRA       0.84      0.90      0.87       272

    accuracy                           0.91      1361
   macro avg       0.91      0.90      0.90      1361
weighted avg       0.91      0.91      0.91      1361



# 3.KNN CLASSIFIER

In [33]:
from sklearn.neighbors import KNeighborsClassifier

leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]

params_KNN = {'leaf_size':leaf_size, 
              'n_neighbors':n_neighbors, 
              'p':p}

KNN_cv = RandomizedSearchCV(KNeighborsClassifier(), params_KNN, verbose=1, cv=10, n_jobs=-1,random_state=42)
KNN_cv.fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


RandomizedSearchCV(cv=10, estimator=KNeighborsClassifier(), n_jobs=-1,
                   param_distributions={'leaf_size': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29, 30, ...],
                                        'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8,
                                                        9, 10, 11, 12, 13, 14,
                                                        15, 16, 17, 18, 19, 20,
                                                        21, 22, 23, 24, 25, 26,
                                                        27, 28, 29],
                                        'p': [1, 2]},
                   random_state=42, verbose=1)

# Best Parameters for KNN

In [34]:
print('Best Estimator: %s' % KNN_cv.best_estimator_)

Best Estimator: KNeighborsClassifier(leaf_size=38, n_neighbors=12)


In [35]:
KNN=kFold()

In [36]:
KNN.crossvalscore(df_scaled,10,KNN_cv.best_estimator_)

KFold-Scores: [92.87288758265981, 93.24026451138869, 92.65246142542249, 92.13813372520205, 92.21160911094783, 92.2850844966936, 92.94636296840558, 92.43203526818516, 92.65246142542249, 92.35855988243938]

Mean Accuracy: 92.579%

               precision    recall  f1-score   support

    BARBUNYA       0.95      0.86      0.90       129
      BOMBAY       1.00      1.00      1.00        56
        CALI       0.87      0.95      0.91       157
    DERMASON       0.92      0.95      0.93       358
       HOROZ       0.96      0.90      0.93       204
       SEKER       0.97      0.95      0.96       197
        SIRA       0.87      0.88      0.88       260

    accuracy                           0.92      1361
   macro avg       0.93      0.93      0.93      1361
weighted avg       0.92      0.92      0.92      1361



# 4. NAIVE BAYES CLASSIFIER

In [37]:
from sklearn.naive_bayes import GaussianNB

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
NB_cv = RandomizedSearchCV(GaussianNB(), params_NB, cv=10,verbose=1,n_jobs=-1,random_state=42)
NB_cv.fit(X_train , y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


RandomizedSearchCV(cv=10, estimator=GaussianNB(), n_jobs=-1,
                   param_distributions={'var_smoothing': array([1.00000000e+00, 8.11130831e-01, 6.57933225e-01, 5.33669923e-01,
       4.32876128e-01, 3.51119173e-01, 2.84803587e-01, 2.31012970e-01,
       1.87381742e-01, 1.51991108e-01, 1.23284674e-01, 1.00000000e-01,
       8.11130831e-02, 6.57933225e-02, 5.33669923e-02, 4.32876128e-02,
       3.511191...
       1.23284674e-07, 1.00000000e-07, 8.11130831e-08, 6.57933225e-08,
       5.33669923e-08, 4.32876128e-08, 3.51119173e-08, 2.84803587e-08,
       2.31012970e-08, 1.87381742e-08, 1.51991108e-08, 1.23284674e-08,
       1.00000000e-08, 8.11130831e-09, 6.57933225e-09, 5.33669923e-09,
       4.32876128e-09, 3.51119173e-09, 2.84803587e-09, 2.31012970e-09,
       1.87381742e-09, 1.51991108e-09, 1.23284674e-09, 1.00000000e-09])},
                   random_state=42, verbose=1)

# Best Parameters for Naive Bayes

In [38]:
print('Best Estimator: %s' % NB_cv.best_estimator_)

Best Estimator: GaussianNB(var_smoothing=0.0002848035868435802)


In [39]:
NB=kFold()

In [40]:
NB.crossvalscore(df_scaled,10,NB_cv.best_estimator_)

KFold-Scores: [88.90521675238794, 89.86039676708303, 88.31741366642176, 88.53783982365907, 90.30124908155767, 90.30124908155767, 91.03600293901543, 90.66862601028656, 89.12564290962528, 90.52167523879501]

Mean Accuracy: 89.758%

               precision    recall  f1-score   support

    BARBUNYA       0.91      0.87      0.89       137
      BOMBAY       1.00      1.00      1.00        51
        CALI       0.87      0.91      0.89       158
    DERMASON       0.91      0.90      0.90       359
       HOROZ       0.93      0.95      0.94       182
       SEKER       0.94      0.96      0.95       215
        SIRA       0.85      0.83      0.84       259

    accuracy                           0.91      1361
   macro avg       0.92      0.92      0.92      1361
weighted avg       0.90      0.91      0.90      1361

