# Classification Model For Predicting The Class of Wheat Kernels

Author: Nishant Sahni

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import normalize
from sklearn.svm import SVC
from sklearn.kernel_ridge import KernelRidge
from math import ceil, floor

The data will now be loaded onto a pandas dataframe using the read_table() function.

In [2]:
data = pd.read_table('/Users/Nishant/Desktop/Machine Learning/Exam/classification_wheat_kernel_data.txt', 
                     delim_whitespace=True, header=0, names=('area', 'perimeter', 'compactness', 
                                                             'kernel_length', 'kernel_width', 'asym', 
                                                             'groove_length', 'type'))

The summary statistics of the data will then be examined with the help of the following code. The data.corr() function gives the correlation of every feature with every other feature. The data.describe() function gives some summary statistics for each feature including count, mean, minimum value, maximum value, etc. The data.isna() is used to determine if there are any missing values or erroneous values in the data.

In [38]:
data.head()
print(data.keys())

print(data.describe())
print(data.corr())
print(data.isna())

Index(['area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width',
       'asym', 'groove_length', 'type'],
      dtype='object')
             area   perimeter  compactness  kernel_length  kernel_width  \
count  210.000000  210.000000   210.000000     210.000000    210.000000   
mean    14.847524   14.559286     0.870999       5.628533      3.258605   
std      2.909699    1.305959     0.023629       0.443063      0.377714   
min     10.590000   12.410000     0.808100       4.899000      2.630000   
25%     12.270000   13.450000     0.856900       5.262250      2.944000   
50%     14.355000   14.320000     0.873450       5.523500      3.237000   
75%     17.305000   15.715000     0.887775       5.979750      3.561750   
max     21.180000   17.250000     0.918300       6.675000      4.033000   

             asym  groove_length        type  
count  210.000000     210.000000  210.000000  
mean     3.700201       5.408071    2.000000  
std      1.503557       0.491480    0.818448

By examining the above information we determine that the data has missing values and examine the data set to find unformatted information. This is then rectified before we proceed.

Then we move on to obtain the correlation of every feature with the target (type).

In [4]:
for item in data:
	if item != 'type':
		corr = float(data[item].corr(data['type']))
		print(item, corr)

area -0.3460578672033167
perimeter -0.32789969778257677
compactness -0.5310070238941204
kernel_length -0.2572687006481211
kernel_width -0.4234628716721287
asym 0.5772727110447099
groove_length 0.024301043067281567


After exploring the information above it is decided to not drop any features.

In [5]:
X = (pd.DataFrame(data, columns=(['area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width', 'asym', 'groove_length']))).as_matrix()
y = (pd.DataFrame(data, columns=(['type']))).as_matrix()

As we can see from above, the X and y values are then loaded from the data set. This data is then split into training and testing set by a 80:20 split. This is done so that the model can be trained with the training data and the prediction accuracy can be measured with respect to the testing data.

In [6]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)

# Logistic Regression

We now apply Logistic Regression to our data and use GridSearchCV to perform 5-Fold cross validation and to select the best parameters. Cross validation is used to avoid over fitting by training multiple models on a certain number of subsets of the data and then evaluating the model.

In [22]:
log = LogisticRegression()
logreg = GridSearchCV(log, cv=5, param_grid={})
logreg.fit(x_train, np.ravel(y_train))

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Cross validation score and a few other statistics are then obtained for the model.

In [23]:
print("")
print("SCORES FOR LOGISTIC REGRESSION:")
print("")

print("Gridsearch CV score: ", logreg.best_score_)
print("Training set score: ", logreg.score(x_train, y_train))
print("Linear accuracy score: ", logreg.score(x_test, y_test))


SCORES FOR LOGISTIC REGRESSION:

Gridsearch CV score:  0.89880952381
Training set score:  0.922619047619
Linear accuracy score:  0.952380952381


We now predict the test data with our trained model.

In [24]:
predictions = logreg.predict(x_test)

The best parameters selected by GridSearchCV and the prediction scores for Logistic Regression are as follows:

In [25]:
print("Best Parameters Selected: ", logreg.best_params_)
print("Accuracy score for the prediction: ", accuracy_score(y_test, predictions))
print("Confusion Matrix:") 
print(confusion_matrix(y_test, predictions))

Best Parameters Selected:  {}
Accuracy score for the prediction:  0.952380952381
Confusion Matrix:
[[14  1  1]
 [ 0 16  0]
 [ 0  0 10]]


As we can see from above, the accuracy score and confusion matrix was obtained.

# SVM (Hard and Soft Margin)

Now we will try both hard and soft margin SVM with our data.

In [27]:
svc = SVC()
svm_soft = GridSearchCV(svc, cv=5, param_grid={'C': [0.1, 0.5, 1, 2, 5], 'kernel': ['linear', 'poly', 'rbf']})

As we can see from the above code, the best parameters for Soft Margin SVM were to be selected using GridSearchCV, and 5-fold cross validation was used. The values of C can be 0.1, 0.5, 1, 2 or 5, as specified in the question, and the kernels can be picked between linear, polynomial and gaussian. We then train the model with our training data.

In [28]:
svm_soft.fit(x_train, np.ravel(y_train))

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 0.5, 1, 2, 5], 'kernel': ['linear', 'poly', 'rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

A set of scores is then obtained for our model including the cross validation score.

In [30]:
print("")
print("SCORES FOR SOFT MARGIN SVM:")
print("")

print("Gridsearch CV score: ", svm_soft.best_score_)
print("Training set score: ", svm_soft.score(x_train, y_train))
print("Linear accuracy score: ", svm_soft.score(x_test, y_test))
print("Best Parameters Selected: ", svm_soft.best_params_)


SCORES FOR SOFT MARGIN SVM:

Gridsearch CV score:  0.964285714286
Training set score:  0.988095238095
Linear accuracy score:  0.97619047619
Best Parameters Selected:  {'C': 1, 'kernel': 'poly'}


We can see from above that the best parameters selected for Soft Margin SVM by GridSearchCV are C = 1 and kernel = polynomial.

The test data was then predicted with our trained SVM model.

In [31]:
svm_soft_predictions = svm_soft.predict(x_test)

Some prediction scores are then obtained.

In [36]:
print("Accuracy score for the prediction: ", accuracy_score(y_test, svm_soft_predictions))
print("Confusion Matrix:") 
print(confusion_matrix(y_test, svm_soft_predictions))

Accuracy score for the prediction:  0.97619047619
Confusion Matrix:
[[15  0  1]
 [ 0 16  0]
 [ 0  0 10]]


The prediction accuracy score and confusion matrix are then obtained as seen above.

We next move on to try Hard Margin SVM with our data. GridSearchCV is used to select between C values of 100 or 1000, and linear, polynomial or gaussian kernels.

In [20]:
svm_hard = GridSearchCV(svc, cv=5, param_grid={'C': [100, 1000], 'kernel': ['linear', 'poly', 'rbf']})

The model is trained with the training data.

In [21]:
svm_hard.fit(x_train, np.ravel(y_train))

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [100, 1000], 'kernel': ['linear', 'poly', 'rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

GridSearch cross validation score and the best hyperparameters selected are shown as follows: 

In [33]:
print("")
print("SCORES FOR HARD MARGIN SVM:")
print("")
print("Gridsearch CV score: ", svm_hard.best_score_)
print("Training set score: ", svm_hard.score(x_train, y_train))
print("Linear accuracy score: ", svm_hard.score(x_test, y_test))
print("Best Parameters Selected: ", svm_hard.best_params_)


SCORES FOR HARD MARGIN SVM:

Gridsearch CV score:  0.964285714286
Training set score:  1.0
Linear accuracy score:  0.97619047619
Best Parameters Selected:  {'C': 100, 'kernel': 'poly'}


The value of C is selected as 100 and the kernel is selected as polynomial by GridSearchCV. We then make predictions based on the test data.

In [34]:
svm_hard_predictions = svm_hard.predict(x_test)

The accuracy score for the prediction and the confusion matrix is then obtained.

In [35]:
print("Accuracy score for the prediction: ", accuracy_score(y_test, svm_hard_predictions))
print("Confusion Matrix:") 
print(confusion_matrix(y_test, svm_hard_predictions))

Accuracy score for the prediction:  0.97619047619
Confusion Matrix:
[[15  0  1]
 [ 0 16  0]
 [ 0  0 10]]


# Kernalized Ridge Regression

A modified version of Kernalized Ridge is then attempted where the results are normalized so as to be applicable for classification.

In [31]:
kernreg = KernelRidge()

The training and testing data is then normalized.

In [32]:
x_train_normalized = normalize(x_train, norm='l2')
x_test_normalized = normalize(x_test, norm='l2')

GridSearchCV is then used with 5-fold cross validation along with the specified parameters to obtain the best ones. The model is then trained.

In [33]:
ridge = GridSearchCV(kernreg, cv=5, param_grid=[{'kernel': ['linear']}, {'alpha': [1], 'kernel': ['poly'], 'gamma': [1], 'degree': [2, 3, 4]}, {'kernel': ['rbf'], 'gamma': [0.1, 0.5, 1, 2, 4]}])
ridge.fit(x_train_normalized, np.ravel(y_train))

GridSearchCV(cv=5, error_score='raise',
       estimator=KernelRidge(alpha=1, coef0=1, degree=3, gamma=None, kernel='linear',
      kernel_params=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'kernel': ['linear']}, {'alpha': [1], 'kernel': ['poly'], 'gamma': [1], 'degree': [2, 3, 4]}, {'kernel': ['rbf'], 'gamma': [0.1, 0.5, 1, 2, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

A series of scores are then obtained. Also we can see below the parameter values selected by GridSearchCV.

In [34]:
print("")
print("SCORES FOR KERNALIZED RIDGE:")
print("")

print("Gridsearch score: ", ridge.best_score_)
print("Training set score: ", ridge.score(x_train_normalized, y_train))
print("Linear accuracy score: ", ridge.score(x_test_normalized, y_test))
print("Best Parameters Selected: ", ridge.best_params_)


SCORES FOR KERNALIZED RIDGE:

Gridsearch score:  0.515191774704
Training set score:  0.564450288044
Linear accuracy score:  0.521164690278
Best Parameters Selected:  {'alpha': 1, 'degree': 4, 'gamma': 1, 'kernel': 'poly'}


The best parameters selected by GridSearchCV are shown above. We then predict the class and normalize the results to obtain the accuracy score and confusion matrix.

In [35]:
ridge_predictions = ridge.predict(x_test_normalized)
ridge_predictions = ridge_predictions.tolist()
for i in range(0, len(ridge_predictions), 1):
	if ridge_predictions[i] < 1 or ridge_predictions[i] > 1.5 or ridge_predictions[i] > 2.5:
		ridge_predictions[i] = ceil(ridge_predictions[i])
	else:
		ridge_predictions[i] = floor(ridge_predictions[i])

ridge_predictions = np.asarray(ridge_predictions)

The following scores are obtained:

In [36]:
print("Accuracy score for the prediction: ", accuracy_score(y_test, ridge_predictions))
print("Confusion Matrix:") 
print(confusion_matrix(y_test, ridge_predictions))

Accuracy score for the prediction:  0.52380952381
Confusion Matrix:
[[5 8 3 0]
 [3 9 4 0]
 [0 0 8 2]
 [0 0 0 0]]


# By observing the accuracy scores and confusion matrices for the models, it is concluded that both, Hard Margin SVM with C = 100 and a polynomial kernel, and Soft Margin SVM with a polynomial kernel and C = 1 are good models for this data where both give an accuracy score of 0.97619047619. The accuracy score is a measure of the correct predictions made in comparision to the total number of predictions made. The confusion matrix on the other hand, is a matrix which shows the actual predictions on the x axis and the accuracy on the y axis. The cells represent the total number of predictions made.