We have seen that we can ft an SVM with a non-linear kernel in order
to perform classifcation using a non-linear decision boundary. We will
now see that we can also obtain a non-linear decision boundary by
performing logistic regression using non-linear transformations of the
features.

## Preprocessing

In [0]:
import numpy as np
import pandas as pd

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**a. Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary
between them. For instance, you can do this as follows:**

In [0]:
x1 = np.random.uniform(size=500) - 0.5
x2 = np.random.uniform(size=500) - 0.5
y = 1 * (x1**2 - x2**2 > 0)

**b. Plot the observations, colored according to their class labels.
Your plot should display X1 on the x-axis, and X2 on the y-axis.**

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=x1[y==0], y=x2[y==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=x1[y==1], y=x2[y==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X2', color='green', fontsize=20)

Clearly, there is a non-linear decision boundary.

**c. Fit a logistic regression model to the data, using X1 and X2 as
predictors.**

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [0]:
X = pd.concat([pd.DataFrame([x1]), pd.DataFrame([x2])], axis=0).T
X.columns = ['X1', 'X2']
X.head()

In [0]:
logfit = LogisticRegression(solver='liblinear').fit(X, y.ravel())

**d. Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The
decision boundary should be linear.**

In [0]:
from sklearn.metrics import confusion_matrix, classification_report

In [0]:
Y = pd.DataFrame([y]).T
df = pd.concat([Y, X], axis=1)
df.columns = ['Y', 'X1', 'X2']
df.head()

In [0]:
logpred = pd.DataFrame([logfit.predict(X)]).T
logpred.columns = ['Y_PRED']
logpred.head()

In [0]:
df = pd.concat([logpred, df], axis=1,  sort=False)
df.head()

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X1[df.Y_PRED==0], y=df.X2[df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X1[df.Y_PRED==1], y=df.X2[df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X2', color='green', fontsize=20)

In [0]:
conf_mat = pd.DataFrame(confusion_matrix(df.Y, df.Y_PRED).T, index = [0, 1], columns = [0, 1])
conf_mat

In [0]:
class_rep = classification_report(df.Y, df.Y_PRED)
print(class_rep)

Therefore, there is a linear decision boundary.

**e. Now fit a logistic regression model to the data using non-linear
functions of X1 and X2 as predictors (e.g. X_1^2, X1×X2, log(X2),
and so forth).**

$X_1$ x $X_2$

In [0]:
x1x2 = x1*x2

X = pd.concat([pd.DataFrame([x1]), pd.DataFrame([x2]), pd.DataFrame([x1x2])], axis=0).T
X.columns = ['X1', 'X2', 'X1 x X2']
X.head()

In [0]:
logfitX1X2 = LogisticRegression(solver='liblinear').fit(X, y.ravel())

In [0]:
Y = pd.DataFrame([y]).T
df = pd.concat([Y, X], axis=1)
df.columns = ['Y', 'X1', 'X2', 'X1 x X2']
df.head()

In [0]:
logpred = pd.DataFrame([logfitX1X2.predict(X)]).T
logpred.columns = ['Y_PRED']
logpred.head()

In [0]:
df = pd.concat([logpred, df], axis=1,  sort=False)
df.head()

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X1[df.Y_PRED==0], y=df['X1 x X2'][df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X1[df.Y_PRED==1], y=df['X1 x X2'][df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X1 x X2', color='green', fontsize=20)

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X2[df.Y_PRED==0], y=df['X1 x X2'][df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X2[df.Y_PRED==1], y=df['X1 x X2'][df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X1 x X2', color='green', fontsize=20)

In [0]:
conf_mat = pd.DataFrame(confusion_matrix(df.Y, df.Y_PRED).T, index = [0, 1], columns = [0, 1])
conf_mat

In [0]:
class_rep = classification_report(df.Y, df.Y_PRED)
print(class_rep)

X_1^2

In [0]:
x12 = x1**2

X = pd.concat([pd.DataFrame([x1]), pd.DataFrame([x2]), pd.DataFrame([x12])], axis=0).T
X.columns = ['X1', 'X2', 'X1^2']
X.head()

In [0]:
logfitX12 = LogisticRegression(solver='liblinear').fit(X, y.ravel())

In [0]:
Y = pd.DataFrame([y]).T
df = pd.concat([Y, X], axis=1)
df.columns = ['Y', 'X1', 'X2', 'X1^2']
df.head()

In [0]:
logpred = pd.DataFrame([logfitX12.predict(X)]).T
logpred.columns = ['Y_PRED']
logpred.head()

In [0]:
df = pd.concat([logpred, df], axis=1,  sort=False)
df.head()

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X1[df.Y_PRED==0], y=df['X1^2'][df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X1[df.Y_PRED==1], y=df['X1^2'][df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X1^2', color='green', fontsize=20)

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X2[df.Y_PRED==0], y=df['X1^2'][df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X2[df.Y_PRED==1], y=df['X1^2'][df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X1^2', color='green', fontsize=20)

In [0]:
conf_mat = pd.DataFrame(confusion_matrix(df.Y, df.Y_PRED).T, index = [0, 1], columns = [0, 1])
conf_mat

In [0]:
class_rep = classification_report(df.Y, df.Y_PRED)
print(class_rep)

X_2^2

In [0]:
x22 = x2**2

X = pd.concat([pd.DataFrame([x1]), pd.DataFrame([x2]), pd.DataFrame([x22])], axis=0).T
X.columns = ['X1', 'X2', 'X2^2']
X.head()

In [0]:
logfitX22 = LogisticRegression(solver='liblinear').fit(X, y.ravel())

In [0]:
Y = pd.DataFrame([y]).T
df = pd.concat([Y, X], axis=1)
df.columns = ['Y', 'X1', 'X2', 'X2^2']
df.head()

In [0]:
logpred = pd.DataFrame([logfitX12.predict(X)]).T
logpred.columns = ['Y_PRED']
logpred.head()

In [0]:
df = pd.concat([logpred, df], axis=1,  sort=False)
df.head()

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X1[df.Y_PRED==0], y=df['X2^2'][df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X1[df.Y_PRED==1], y=df['X2^2'][df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X2^2', color='green', fontsize=20)

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(x=df.X2[df.Y_PRED==0], y=df['X2^2'][df.Y_PRED==0], cmap='viridis', c='orange', s=500, marker='o', alpha=0.75)
plt.scatter(x=df.X2[df.Y_PRED==1], y=df['X2^2'][df.Y_PRED==1], cmap='viridis', c='green', s=500, marker='o', alpha=0.75)
plt.title('observations', color='m', fontsize=30)
plt.xlabel('X1', color='orange', fontsize=20)
plt.ylabel('X2^2', color='green', fontsize=20)

In [0]:
conf_mat = pd.DataFrame(confusion_matrix(df.Y, df.Y_PRED).T, index = [0, 1], columns = [0, 1])
conf_mat

In [0]:
class_rep = classification_report(df.Y, df.Y_PRED)
print(class_rep)

Therefore, the non-linear boundaries don't really explain the true decision boundary well ($X1^2$ is still able to approximate the true decision boundary somewhat).

**g. Fit a support vector classifer to the data with X1 and X2 as
predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted
class labels.**

In [0]:
from sklearn.svm import SVC

In [0]:
X = pd.concat([pd.DataFrame([x1]), pd.DataFrame([x2])], axis=0).T
X.columns = ['X1', 'X2']
X.head()

In [0]:
Y = pd.DataFrame([y]).T
df = pd.concat([Y, X], axis=1)
df.columns = ['Y', 'X1', 'X2']
df.head()

In [0]:
svmfit = SVC(kernel='linear', C=10).fit(X, y)

In [0]:
def svmplot(svc, X, y, height=0.02, buffer=0.25):
    x_min, x_max = X.X1.min()-buffer, X.X1.max()+buffer
    y_min, y_max = X.X2.min()-buffer, X.X2.max()+buffer
    xx, yy = np.meshgrid(np.arange(x_min, x_max, height), np.arange(y_min, y_max, height))
    Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.2)
    
plt.xkcd()
plt.figure(figsize=(25, 10))
svmplot(svmfit, X, y)
colors = ['orange' if yy == 1 else 'green' for yy in y]
plt.scatter(X.X1[:],X.X2[:], marker='o', s=250, c=colors, alpha=0.75)
plt.title('support vector classifier', color='m', fontsize=30)
plt.xlabel('X[:, 0]', color='green', fontsize=20)
plt.ylabel('X[:, 1]', color='orange', fontsize=20)

In [0]:
svmpred = pd.DataFrame([svmfit.predict(X)]).T
svmpred.columns = ['Y_PRED']
svmpred.head()

In [0]:
df = pd.concat([svmpred, df], axis=1,)
df.head()

In [0]:
conf_mat = pd.DataFrame(confusion_matrix(df.Y, df.Y_PRED).T, index = [0, 1], columns = [0, 1])
conf_mat

In [0]:
class_rep = classification_report(df.Y, df.Y_PRED)
print(class_rep)

Support vector classifier with linear decision boundary doesn't provide significant improvements over logistic regression.

**h. Fit a SVM using a non-linear kernel to the data. Obtain a class
prediction for each training observation. Plot the observations,
colored according to the predicted class labels.**

In [0]:
svmfit = SVC(C=10, kernel='rbf', gamma=1).fit(X, y)

In [0]:
X = pd.concat([pd.DataFrame([x1]), pd.DataFrame([x2])], axis=0).T
X.columns = ['X1', 'X2']
X.head()

In [0]:
Y = pd.DataFrame([y]).T
df = pd.concat([Y, X], axis=1)
df.columns = ['Y', 'X1', 'X2']
df.head()

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
svmplot(svmfit, X, y)
colors = ['orange' if yy == 1 else 'green' for yy in y]
plt.scatter(X.X1[:],X.X2[:], marker='o', s=250, c=colors, alpha=0.75)
plt.title('support vector machine', color='m', fontsize=30)
plt.xlabel('X[:, 0]', color='green', fontsize=20)
plt.ylabel('X[:, 1]', color='orange', fontsize=20)

In [0]:
svmpred = pd.DataFrame([svmfit.predict(X)]).T
svmpred.columns = ['Y_PRED']
svmpred.head()

In [0]:
df = pd.concat([svmpred, df], axis=1,)
df.head()

In [0]:
conf_mat = pd.DataFrame(confusion_matrix(df.Y, df.Y_PRED).T, index = [0, 1], columns = [0, 1])
conf_mat

In [0]:
class_rep = classification_report(df.Y, df.Y_PRED)
print(class_rep)

SVM provides an extremely significant improvement in prediction over logistic regression as well as support vector classifier.

**i. Comment on results.**

This question shows the power of support vector machines over other linear measures like logistic regression and support vector classifiers. This can be seen through predictive precision of different models above.