# Classification Model For Predicting Class Labels Based on B, G and R Values

Author: Nishant Sahni

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import normalize
from sklearn.svm import SVC
from sklearn.kernel_ridge import KernelRidge
from math import ceil, floor

The data is first loaded onto a pandas dataframe and some summary statistics are obtained. The correlation between attribute columns is also obtained. The data.describe() function gives some summary statistics for each feature including count, mean, minimum value, maximum value, etc. The data.isna() is used to determine if there are any missing values or erroneous values in the data.

In [3]:
data = pd.read_table('/Users/Nishant/Desktop/Machine Learning/Exam/classification_data.tsv', delim_whitespace=True, header=0)
data.head()
print(data.keys())
print(data.describe())
print(data.corr())
print(data.isna())

Index(['Red', 'Green', 'Blue', 'Class'], dtype='object')
                 Red          Green           Blue          Class
count  245057.000000  245057.000000  245057.000000  245057.000000
mean      125.065446     132.507327     123.177151       1.792461
std        62.255653      59.941197      72.562165       0.405546
min         0.000000       0.000000       0.000000       1.000000
25%        68.000000      87.000000      70.000000       2.000000
50%       139.000000     153.000000     128.000000       2.000000
75%       176.000000     177.000000     164.000000       2.000000
max       255.000000     255.000000     255.000000       2.000000
            Red     Green      Blue     Class
Red    1.000000  0.855250  0.496376  0.092030
Green  0.855250  1.000000  0.660098 -0.120327
Blue   0.496376  0.660098  1.000000 -0.569958
Class  0.092030 -0.120327 -0.569958  1.000000
          Red  Green   Blue  Class
0       False  False  False  False
1       False  False  False  False
2       False 

After examining the above information the correlation between each feature and the target (Class) is obtained to further analyse the data.

In [6]:
for item in data:
	if item != 'Class':
		corr = float(data[item].corr(data['Class']))
		print(item, corr)

Red 0.0920300916444954
Green -0.12032744045817019
Blue -0.5699582232198895


We observe the above data, and decide to not drop any features.

The X and y values are then loaded onto a pandas dataframe.

In [7]:
X = (pd.DataFrame(data, columns=(['Red', 'Blue', 'Green']))).as_matrix()
y = (pd.DataFrame(data, columns=(['Class']))).as_matrix()

The data is then split into train and test set by a 80:20 split. This is done so that the model can be trained with the training data and the prediction accuracy can be measured with respect to the testing data.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)

# Logistic Regression

Logistic Regression is first tried on the data.

In [8]:
log = LogisticRegression()

GridSearchCV is then used to conduct 5-fold cross validation and to determine the hyperparameter C value. Cross validation is used to avoid over fitting by training multiple models on a certain number of subsets of the data and then evaluating the model. The training data is then used to train the model.

In [9]:
logreg = GridSearchCV(log, cv=5, param_grid={'C': [0.1, 1, 10, 100, 1000]})
logreg.fit(x_train, np.ravel(y_train))

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10, 100, 1000]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

The cross validation score and some other statistics are obtained as follows.

In [10]:
print("")
print("SCORES FOR LOGISTIC REGRESSION:")
print("")
print("Gridsearch CV score: ", logreg.best_score_)
print("Training set score: ", logreg.score(x_train, y_train))
print("Linear accuracy score: ", logreg.score(x_test, y_test))


SCORES FOR LOGISTIC REGRESSION:

Gridsearch CV score:  0.918896171797
Training set score:  0.918885970058
Linear accuracy score:  0.918672978046


The test data is then used to predict the class with the trained model.

In [11]:
predictions = logreg.predict(x_test)

The best parameters selected by GridSearchCV, the accuracy score of the model and the confusion matrix are then obtained as follows.

In [12]:
print("Best Parameters Selected: ", logreg.best_params_)
print("Accuracy score for the prediction: ", accuracy_score(y_test, predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

Best Parameters Selected:  {'C': 0.1}
Accuracy score for the prediction:  0.918672978046
Confusion Matrix:
[[ 8494  1854]
 [ 2132 36532]]


# SVM (Hard and Soft Margin)

Next SVC is tried with the data, both hard and soft.

In [13]:
svc = SVC()

GridSearchCV is used to determine the C values from the list provided in the question and the kernel between linear, polynomial or gaussian. GridSearchCV also conducts 5-fold cross validation. The hardness or softness of SVM depends on the value of C selected. The higher the value selected the harder the margin. 100 and 1000 are values added to param_grid for C to check for Hard Margin SVM. 

In [14]:
svm = GridSearchCV(svc, cv=5, param_grid={'C': [0.1, 0.5, 1, 2, 5, 100, 1000], 'kernel': ['linear', 'poly', 'rbf']})

The data is then normalized to improve performance with SVM and decrease convergence time.

In [15]:
x_train_normalized = normalize(x_train, norm='l2')
x_test_normalized = normalize(x_test, norm='l2')

The model is then trained with the training data.

In [16]:
svm.fit(x_train_normalized, np.ravel(y_train))

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 0.5, 1, 2, 5, 100, 1000], 'kernel': ['linear', 'poly', 'rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

The cross validation score and some other statistics are then calculated.

In [17]:
print("")
print("SCORES FOR SVM:")
print("")
print("Gridsearch CV score:", svm.best_score_)
print("Training set score: ", svm.score(x_train_normalized, y_train))
print("Linear accuracy score: ", svm.score(x_test_normalized, y_test))
print("Best Parameters Selected: ", svm.best_params_)


SCORES FOR SVM:

Gridsearch CV score: 0.997046596445
Training set score:  0.997087403402
Linear accuracy score:  0.997470007345
Best Parameters Selected:  {'C': 1000, 'kernel': 'rbf'}


As we can see from above, GridSearchCV selects C = 1000 and a rbf kernel as the best parameters. Thus a Hard Margin SVM gives the best results for this data. The test data is then used to predict and calculate the final model accuracy.

In [18]:
svm_predictions = svm.predict(x_test_normalized)

The accuracy score and the confusion matrix for the model are then obtained.

In [19]:
print("Accuracy score for the prediction: ", accuracy_score(y_test, svm_predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, svm_predictions))

Accuracy score for the prediction:  0.997470007345
Confusion Matrix:
[[10337    11]
 [  113 38551]]


# Kernelized Ridge Regression

A modified version of Kernalized Ridge Regression is then attempted where the results are normalized so as to be applicable for classification.

In [9]:
kernreg = KernelRidge()

Since this is a non-linear model, the data is normalized to improve performance.

In [10]:
x_train_normalized = normalize(x_train, norm='l2')
x_test_normalized = normalize(x_test, norm='l2')

GridSearchCV is then used to conduct 5-fold cross validation and to select between the hyperparameters specified in the question.

In [11]:
ridge = GridSearchCV(kernreg, cv=5, param_grid=[{'kernel': ['linear']}, {'alpha': [1], 'kernel': ['poly'], 'gamma': [1], 'degree': [2, 3]}, {'kernel': ['rbf'], 'gamma': [0.1, 0.5, 1, 2, 4]}])

The model is then trained with the training data and the hyperparameters selected by GridSearchCV.

In [None]:
ridge.fit(x_train_normalized, np.ravel(y_train))

The cross validation score, the best hyperparameters selected and some other statistics are then obtained.

In [None]:
print("")
print("SCORES FOR KERNALIZED RIDGE:")
print("")
print("Gridsearch CV score: ", ridge.best_score_)
print("Training set score: ", ridge.score(x_train_normalized, y_train))
print("Linear accuracy score: ", ridge.score(x_test_normalized, y_test))
print("Best Parameters Selected: ", ridge.best_params_)

We then predict the class and normalize the results to obtain the accuracy score and confusion matrix.

In [None]:
ridge_predictions = ridge.predict(x_test_normalized)
ridge_predictions = ridge_predictions.tolist()
for i in range(0, len(ridge_predictions), 1):
	if ridge_predictions[i] < 1 or ridge_predictions[i] > 1.5 or ridge_predictions[i] > 2.5:
		ridge_predictions[i] = ceil(ridge_predictions[i])
	else:
		ridge_predictions[i] = floor(ridge_predictions[i])

ridge_predictions = np.asarray(ridge_predictions)

The accuracy score and confusion matrix for the prediction are then obtained.

In [None]:
print("Accuracy score for the prediction: ", accuracy_score(y_test, ridge_predictions))
print("Confusion Matrix:") 
print(confusion_matrix(y_test, ridge_predictions))

This model was left to train for over 12 hours in Jupyter notebook but still didn't complete training.

# After observing the prediction results and accuracy scores of the above models, it is concluded that Hard Margin SVM with a gaussian kernel and C = 1000 gives the best results with this dataset. Its accuracy score is 0.997470007345. The accuracy score is a measure of the correct predictions made in comparision to the total number of predictions made. The confusion matrix on the other hand, is a matrix which shows the actual predictions on the x axis and the accuracy on the y axis. The cells represent the total number of predictions made.