#<H1 align=center> __A STUDY ON PERFORMANCE AND ANALYSIS OF__ *SKLEARN* __MODELS ON__ *DIGITS* __DATASET__


##<H4> <B>Introduction and importing necessary libraries</B></H4>

<p>Machine Learning or ML in short is a very powerful emerging tool that has revolutionised almost all walks of life. One such application of machine learning lies in the field of Computer Vision which is object detection and recongition.</p>
<p>In this notebook, various learning approaches have been discussed and their performance is analysed on a curated dataset for the task of handwritten digits recongnition.</p>

<H4> Libraries used: </H4>
<OL>
<LI> scikit-learn: scikit-learn or sklearn is the most widely used library for Machine Learning apllications in Python. It features various classification, regression and clustering algorithms such as support-vector machines(SVMs), multi-layer perceptrons(MLP), random forests and k-means.
<LI> numpy:  
<LI> matplotlib:
<LI> seaborn
</OL>

In [None]:
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
%matplotlib inline

In [None]:
!pip install wandb -qqqU

In [None]:
import wandb
wandb.init(project="MNIST",name="Run-1")

##<H4> <B>Dataset Analysis </B></H4>

The <b>Digits dataset of scikit-learn library</b> has been selected for analysis in this experiment. It is a dataset containing <b>1797 images of size 8 X 8 pixels</b>. These images are of handwritten digits ranging from 0 to 9.

Before starting the analysis, let us first understand the dataset. Understanding the dataset in question is the one of the key tasks for a ML engineer.

So first, we will load the dataset. After that let us go through the documentation maintained in 'DESCR' attribute.

In [None]:
#load the dataset
digits = load_digits()
# Get the documentation
print(digits.DESCR)


In [None]:
print(type(digits))

The dataset contains images of 0-9 digits. It also has a 'target_names' attribute that gives the labels that were used to label the dataset.

In [None]:
# target_names is basically a list of numbers from 0 to 9
print(digits.target_names)

Similar to the other scikit-learn datasets, this dataset is also built and maintained in a numpy ndarray object. <br>Let us confirm the same.

In [None]:
print(type(digits.data))
print(type(digits.target))

The dataset consists of 1797 images of size 8X8. These images are each stacked in one dimension as a 64 dimensional vector. Hence, <b>the size of the dataset is 1797 X 64</b>. <br>The ground truth labels are maintained in the <b>target vector of size 1797</b>. Each image has a corresponding ground truth label. <b>There are no missing labels in this dataset</b>

In [None]:
print("Size of the dataset: ",digits.data.shape)
print("Size of the target vector: ",digits.target.shape)

The 64 features are the 64 pixels of the images. They are named as 'pixel_r_c' where *r* stands for row number and *c* stands for column number.

In [None]:
count = 0
# It can be observed that the 64 dim feature vector is indeed a 1D stacked version of the corresponding image
for feature in digits.feature_names:
  print(feature,end = ", ")
  count += 1
  if(count==8):
     print("\n")
     count = 0

The fact that the 64 dim feature vector is the 1D stacked version of the corresponding image can be furtner confirmed by the code written in following cell.

In [None]:
print(digits.images.shape)
# It can be observed that the 64 dimensional feature vector is obtained by stacking the corresponding image in 1D
(np.reshape(digits.images,(digits.images.shape[0],-1)) == digits.data).all()

<B> SUMMARY</B>
1. Number of images in dataset=1797.
2. Total number of labels=10 ( 0 to 9)
3. The images being 8X8, 64 different features are considered per image( each pixel is considered as a feature).

The following code displays the first 100 images of the dataset.

In [None]:
# Display the first 100 images of the dataset
# A subplot is created to display the 100 images.
fig,ax=plt.subplots(nrows=10,ncols=10,figsize=(10,10))
plt.suptitle('Displaying 100 images',va='bottom',fontweight ="bold")
for index in range(100):
    plt.subplot(10, 10, index+1)
    plt.title("Label: {}".format(digits.target[index]),)
    plt.imshow(digits.images[index], cmap='gray_r')
    plt.axis('off')
plt.subplots_adjust()
plt.show()

##<H4> <B>Train Test Split </B></H4>
While splitting the dataset into train and test dataset, the sizes of train and test dataset are taken to be 0.7 and 0.3 of the entire dataset respectively.<br>
This is implemented using the train_test_split function in sklearn module.

In [None]:
from sklearn.model_selection import train_test_split as split
X_train, X_test, y_train, y_test = split(digits.data,digits.target,test_size=0.2,random_state=42)

In [None]:
config = {
    "NAME" : "SK-LEARN DIGITS DATASET",
    "NUMBER OF IMAGES" : str(digits.images.shape[0]),
    "SIZE OF EACH IMAGE" : str(digits.images.shape[1:]),
    "SIZE OF FEATURE VECTOR" : str(np.product(digits.images.shape[1:])),
    "CLASS NAMES" : " ".join(digits.target_names.astype(str)),
    "TRAIN TEST SPLIT" : 0.8
}
wandb.config.update(config)

##<H4> <B>Feature Selection</B></H4>

In [None]:
#from sklearn.feature_selection import chi2,mutual_info_classif,SelectKBest


In [None]:
# p_values = chi2(X_train,y_train)[1]
# plt.stem(p_values)

In [None]:
# np.where(p_values>0.1)

In [None]:
# info_gain = mutual_info_classif(X_train,y_train)
# plt.stem(info_gain)

In [None]:
# info_gain[np.where(p_values>0.1)]

In [None]:
# np.where(info_gain<0.1)

In [None]:
# X_train = np.delete(X_train,np.where(p_values>0.1),1)
# X_test = np.delete(X_test,np.where(p_values>0.1),1)

##<H4> <B>Feature Scaling </B></H4>
Feature scaling is implemented using the StandardScaler class.<br>
StandardScaler() normalises input x according to the following relationship -<br>
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$
z = (x - u) / s<br>
where,<br> u is the mean of the training samples if with_mean=True (default option) or zero if with_mean=False<br> and s is the standard deviation of the training samples if with_std=True (default option) or one if with_std=False

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled=scaler.fit_transform(X_train,y_train)
X_test_scaled=scaler.transform(X_test)

A very useful tool to check accuracy, precision and recall is the <b>confusion matrix.</B><br>
By definition a confusion matrix $C$ is such that $C_{ij}$ is equal to the number of observations known to be in group $i$ and predicted to be in group $j$

Confusion matrix with predicted labels as the ground truth labels gives the distribution of the labels.

In [None]:
from sklearn import metrics
conf_mat = metrics.confusion_matrix(y_test,y_test)
print(conf_mat)

In [None]:
occur_array = np.unique(y_test, return_counts=True)[1]
freq_dict = {idx: occur_array[idx] for idx in range(10)}
print(freq_dict)
(np.diag(conf_mat) == occur_array).all()

Now that we are confortable with the dataset, we are ready to train various classification models defined in the sklearn library itself.

#<H1 align=center> __A STUDY ON PERFORMANCE AND ANALYSIS OF__ *SKLEARN* __MODELS ON__ *DIGITS* __DATASET__<br>
<H1 align=center> <b>PART-2 EXPERIMENTATION USING SKLEARN MODELS </H1>

##<H2> K-Nearest Neighbour</H2>

Let us first try with a simple algorithm - [Nearest Neighbour Classification](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification) <br>
Advantage -
Due to it being non-parametric, often successful when decision boundary is very irregular.

Following the analysis done [here](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py) the decision regions are plotted for three of the classes compared together.
We will consider the following two sets of classes - <br>
<OL>
<LI> (5,6,8)
<LI> (4,9,3)
</OL>  We will also plot the decision regions for all the classes.


In [None]:
from sklearn import neighbors, datasets
from sklearn.inspection import DecisionBoundaryDisplay

In [None]:
from sklearn.decomposition import KernelPCA
pca = KernelPCA(n_components=2,kernel = 'rbf')
x_vis = pca.fit_transform(X_train_scaled)

In [None]:
cmap_light = ListedColormap(["palegreen", "lightcoral","lavender", "wheat","thistle","antiquewhite","aquamarine","ivory","cyan","lightpink"])
cmap_bold = ["chocolate", "darkmagenta", "red","darkgreen","darkgoldenrod","darkblue","darkslategray","saddlebrown","darkorange","deeppink"]

In [None]:
for n_neighbors in [3,10,15,20,30,50]:
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights="uniform")
    clf.fit(x_vis,y_train)

    _, ax = plt.subplots()
    DecisionBoundaryDisplay.from_estimator(clf,tempx,cmap=cmap_light,ax=ax,response_method="predict",plot_method="pcolormesh",shading="auto")

    # Plot also the training points
    sns.scatterplot(x=x_vis[:, 0],y=x_vis[:, 1],hue=y_train,palette=cmap_bold,alpha=1.0,edgecolor="black")
    plt.title("10-Class classification (k = %i, weights = '%s')" % (n_neighbors,"uniform"))
plt.show()

##<H2> PCA</H2>

Let us also implement [Principal Component Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn-decomposition-pca) on the dataset.<br>
We will use [K-Means clustering algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn-cluster-kmeans) after reducing dimension of the data.
This is to mitigate the [effect of high dimensionality](https://towardsdatascience.com/the-curse-of-dimensionality-50dc6e49aa1e) on the performance of K-Means model.

In [None]:
from sklearn.decomposition import PCA
pca2 = PCA(n_components=X_train_scaled.shape[1])
x_vis2 = pca2.fit_transform(X_train_scaled)
x_vis2.shape

In [None]:
plt.plot(pca2.explained_variance_ratio_)
plt.xticks(list(range(0,X_train_scaled.shape[1],10)) + [X_train_scaled.shape[1] - 1])
plt.show()

In [None]:
var_thresh = 0.01
np.argmin(pca2.explained_variance_ratio_[pca2.explained_variance_ratio_>var_thresh])

In [None]:
first_2 = x_vis[:,:2]
first_2_var = pca2.explained_variance_ratio_[:2]
pca2 = PCA(n_components=2)
x_vis2 = pca2.fit_transform(X_train_scaled)
np.allclose(first_2,x_vis2)

In [None]:
print(first_2_var)
print(pca2.explained_variance_ratio_)

##<H2>KMeans -</H1>

In [None]:
from sklearn.cluster import KMeans

In [None]:
import io
import sys
# Credits : https://stackoverflow.com/questions/65683128/how-to-plot-the-cost-inertia-values-in-sklearn-kmeans
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout

cls = KMeans(n_clusters = 10,verbose = 3,copy_x = True)
out = cls.fit_transform(x_vis2,y_train)
printed = new_stdout.getvalue()  #<- store printed output
sys.stdout = old_stdout

#Extract inertia values
inertia_list = []
stop = False
for i in printed.split('\n'):
  if('inertia' in i):
    inertia_list.append(float(i.split('inertia ')[1][:-1]))
    stop = True
  if(('Initialization' in i) and stop):
    break
#Plot
fig = plt.figure()
plt.plot(inertia_list)
plt.title("Inertia per iteration")
plt.xlabel("Iteration")
plt.ylabel("Inertia")

In [None]:
wandb.log({"KMeans Inertia":fig})

In [None]:
out.shape

In [None]:
h = 0.02 # Step size of the mesh.

# Plot the decision boundary.
x_min, x_max = x_vis2[:, 0].min() - 1, x_vis2[:, 0].max() + 1
y_min, y_max = x_vis2[:, 1].min() - 1, x_vis2[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = cls.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.clf()

cmap_light = ListedColormap(["palegreen", "lightcoral","lavender", "wheat","thistle","antiquewhite","aquamarine","ivory","cyan","lightpink"])
cmap_bold = ["chocolate", "darkmagenta", "red","darkgreen","darkgoldenrod","darkblue","darkslategray","saddlebrown","darkorange","deeppink"]
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    cmap=cmap_light,
    aspect="auto",
    origin="lower",
)

sns.scatterplot(x = x_vis2[:, 0], y = x_vis2[:, 1],hue = y_train,palette = cmap_bold, size=2)
# Plot the centroids as a white X
centroids = cls.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    linewidths=3,
    color="black",
    zorder=10,
)
plt.title(
    "K-means clustering on the digits dataset (Linear PCA-reduced data)\n"
    "Centroids are marked with black cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

In [None]:
wandb.log({"KMeans Decision Boundary":fig})

In [None]:
wandb.sklearn.plot_elbow_curve(cls,x_vis)

[Good Reference for Silhouette plot](https://towardsdatascience.com/elbow-method-is-not-sufficient-to-find-best-k-in-k-means-clustering-fc820da0631d)

See Elbow plot [here](https://wandb.ai//kaushal-jadhav/MNIST/reports/undefined-23-06-30-19-18-45---Vmlldzo0Nzc0MDYy?accessToken=dy7inl74y4xypqo5gpe6olvobrd10q25ijodpyqvasfnid47gzswtnir8zg235cj)

In [None]:
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout

out = cls.fit_transform(x_vis,y_train)
printed = new_stdout.getvalue()  #<- store printed output
sys.stdout = old_stdout

#Extract inertia values
inertia_list = []
stop = False
for i in printed.split('\n'):
  if('inertia' in i):
    inertia_list.append(float(i.split('inertia ')[1][:-1]))
    stop = True
  if(('Initialization' in i) and stop):
    break
#Plot
fig = plt.figure()
plt.plot(inertia_list)
plt.title("Inertia per iteration")
plt.xlabel("Iteration")
plt.ylabel("Inertia")

In [None]:
h = 0.02 # Step size of the mesh.

# Plot the decision boundary.
x_min, x_max = x_vis[:, 0].min() - 1, x_vis[:, 0].max() + 1
y_min, y_max = x_vis[:, 1].min() - 1, x_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = cls.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.clf()

cmap_light = ListedColormap(["palegreen", "lightcoral","lavender", "wheat","thistle","antiquewhite","aquamarine","ivory","cyan","lightpink"])
cmap_bold = ["chocolate", "darkmagenta", "red","darkgreen","darkgoldenrod","darkblue","darkslategray","saddlebrown","darkorange","deeppink"]
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    cmap=cmap_light,
    aspect="auto",
    origin="lower",
)

sns.scatterplot(x = x_vis[:, 0], y = x_vis[:, 1],hue = y_train,palette = cmap_bold, size=2)
# Plot the centroids as a white X
centroids = cls.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    linewidths=3,
    color="black",
    zorder=10,
)
plt.title(
    "K-means clustering on the digits dataset (Kernel PCA-reduced data)\n"
    "Centroids are marked with black cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

##<H2> Multi Layer Perceptron </H2>

First try-\
Using a Multi Layer Perceptron Classifier.\
It  is a supervised learning algorithm which uses an underlying Neural Network to perform the task of classification.\
It supports multi-class classification (in which a sample can belong to more than one class) by applying Softmax as the output function.For each class, the raw output is processed by the logistic function. Values larger or equal to 0.5 are rounded to 1, otherwise to 0. For a predicted output of a sample, the indices where the value is 1 represents the assigned classes of that sample.\
Optimization of the log-loss function is performed using LBFGS here.\
( LBFGS is an approximation to the [BFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) using limited memory.)
The number of hidden layers is chosen as 1000.

Good Resource for K-Fold CV - [Here](https://machinelearningmastery.com/k-fold-cross-validation/)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier

from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

t_mlp = MLPClassifier(max_iter = 1,random_state = 42)
params = {
    "hidden_layer_sizes":[(20,20),(20,),(10,30,10)],
    "activation": ["tanh","relu"],
    "batch_size": [20,40],
    "solver":["sgd","adam"],
    "alpha": [0.0001,0.01],
    "learning_rate": ["constant","adaptive"],
    "nesterovs_momentum": [True,False],
    "learning_rate_init": [0.001,0.01],
    "momentum": [0.9,0.99]
}

clf = GridSearchCV(t_mlp,params,cv = 5)

# To ignore convergence warnings
# Ref - https://stackoverflow.com/questions/53784971/how-to-disable-convergencewarning-using-sklearn

@ignore_warnings(category=ConvergenceWarning)
def fit(X_train,y_train):
  clf.fit(X_train,y_train)

fit(X_train_scaled,y_train)


In [None]:
print(clf.best_params_)
print(clf.cv_results_['mean_test_score'].max())

In [None]:
#from sklearn.neural_network import MLPClassifier

In [None]:
classifier=MLPClassifier(max_iter = 100,random_state = 42,**clf.best_params_,early_stopping=True)
model=classifier.fit(X_train_scaled,y_train)

In [None]:
fig = plt.figure()
print("Final Loss = ",model.loss_curve_[-1])
plt.plot(range(len(model.loss_curve_)),model.loss_curve_)
plt.title("Train Loss Curve")
plt.xlabel("Number of Iterations")
plt.ylabel("Train Loss")

In [None]:
wandb.log({"MLP Train Loss Curve":fig})

In [None]:
fig = plt.figure()
plt.plot(range(len(model.loss_curve_)),model.validation_scores_)
plt.title("Validation Loss Curve")
plt.xlabel("Number of Iterations")
plt.ylabel("Validation Loss")

In [None]:
wandb.log({"MLP Validation Loss Curve":fig})

In [None]:
print("Accuracy over test dataset = ",model.score(X_test_scaled,y_test))

In [None]:
y_predict = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)
wandb.sklearn.plot_classifier(model,X_train_scaled, X_test_scaled,y_train, y_test,y_predict,y_proba,range(10),model_name="MLP")

In [None]:
from sklearn import metrics
y_predict=model.predict(X_train_scaled)
metrics.accuracy_score(y_train, y_pred=y_predict)

In [None]:
prob = model.predict_proba(X_test_scaled)
max_prob = np.round(prob.max(axis=1),4)
max_prob_idx = prob.argmax(axis=1)

from collections import defaultdict
max_prob_dict = defaultdict(list)
for idx in range(max_prob_idx.shape[0]):
  max_prob_dict[max_prob_idx[idx]].append(max_prob[idx])

max_prob_dict = {key:np.average(val) for (key,val) in max_prob_dict.items()}
max_value = max(list(max_prob_dict.values()))
max_prob_dict = {key:val/max_value for (key,val) in max_prob_dict.items()}
plt.stem(list(max_prob_dict.keys()),list(max_prob_dict.values()))


##<H2> Random Forrest Classifier </H1>

In [None]:
from sklearn.ensemble import RandomForestClassifier as rfc

t_rfc = rfc(random_state = 42)
params = {
    "n_estimators":[100,500,1000],
    "criterion" : ["gini","entropy"]
}

clf = GridSearchCV(t_rfc,params,cv = 5)
clf.fit(X_train,y_train)

In [None]:
print(clf.best_params_)
print(clf.cv_results_['mean_test_score'].max())

In [None]:
classifier_2=rfc(random_state=42,**clf.best_params_)
model_2=classifier_2.fit(X_train_scaled,y_train)
print("Accuracy over test dataset = ",model_2.score(X_test_scaled,y_test) )
y_predict_2=model_2.predict(X_test_scaled)
print("Confidence matrix over test dataset:")
conf_mat_pred_2 = metrics.confusion_matrix(y_test,y_predict_2)
print(conf_mat_pred_2)
print("% of Incorrect predictions:")
a = np.diag((conf_mat - conf_mat_pred_2)).astype(np.float64)
b = np.diag(conf_mat.astype(np.float64))
print(np.round((100*(a/b)),2))

In [None]:
fig = plt.figure()
plt.plot(model_2.feature_importances_)
plt.title("Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Importance Score")

In [None]:
wandb.log({"Feature Importance of Random Forest Classifer":fig})

In [None]:
y_predict = model_2.predict(X_test_scaled)
y_proba = model_2.predict_proba(X_test_scaled)
wandb.sklearn.plot_classifier(model_2,X_train_scaled, X_test_scaled,y_train, y_test,y_predict,y_proba,range(10),model_name="RFC")

##<H2>Extra Trees Classifier</H1>

In [None]:
from sklearn.ensemble import ExtraTreesClassifier as etc

t_etc = etc(random_state = 42)
params = {
    "n_estimators":[100,500,1000],
    "criterion" : ["gini","entropy"]
}

clf = GridSearchCV(t_etc,params,cv = 5)
clf.fit(X_train,y_train)

In [None]:
print(clf.best_params_)
print(clf.cv_results_['mean_test_score'].max())

In [None]:
# from sklearn.ensemble import ExtraTreesClassifier as etc
classifier_3=etc(random_state=42,**clf.best_params_)
model_3=classifier_3.fit(X_train_scaled,y_train)
print("Accuracy over test dataset = ",model_3.score(X_test_scaled,y_test) )
y_predict_3=model_3.predict(X_test_scaled)
print("Confidence matrix over test dataset:")
conf_mat_pred_3 = metrics.confusion_matrix(y_test,y_predict_3)
print(conf_mat_pred_3)
print("% of Incorrect predictions:")
a = np.diag((conf_mat - conf_mat_pred_3)).astype(np.float64)
b = np.diag(conf_mat.astype(np.float64))
print(np.round((100*(a/b)),2))

In [None]:
y_predict = model_3.predict(X_test_scaled)
y_proba = model_3.predict_proba(X_test_scaled)
wandb.sklearn.plot_classifier(model_3,X_train_scaled, X_test_scaled,y_train, y_test,y_predict,y_proba,range(10),model_name="ETC")

##<H2>Support Vectors Classifier</H1>

In [None]:
from sklearn.svm import SVC
classifier_4=SVC(C=10,gamma='auto',kernel='rbf')
model_4=classifier_4.fit(X_train_scaled,y_train)
print(model_4.score(X_test_scaled,y_test))

In [None]:
y_predict = model_4.predict(X_test_scaled)
y_proba = model_4.predict_proba(X_test_scaled)
wandb.sklearn.plot_classifier(model_4,X_train_scaled, X_test_scaled,y_train, y_test,y_predict,y_proba,range(10),model_name="SVM")

##<H2> The Best Approach!</H2>

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, VarianceThreshold
f_reduced=SelectKBest(k=30)
vt=VarianceThreshold(threshold=0.1)
X_short=vt.fit_transform(X=X_train_scaled)
X_test_short=vt.transform(X_test_scaled)
p=make_pipeline(f_reduced,classifier_4)
model_5=p.fit(X_short,y_train)
print(model_5.score(X_test_short,y_test))

In [None]:
y_predict_5=model_5.predict(X_test_short)
print(metrics.confusion_matrix(y_test,y_predict_5))

##<H2>Finally Visualise using T-SNE!

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def tsne_viz(data, n_components = 2):
    tsne = TSNE(n_components = n_components,perplexity=30.0,verbose=1,random_state = 0)
    return tsne.fit_transform(data)
def plot_representations(data, labels):
    fig = plt.figure(figsize = (15, 15))
    ax = fig.add_subplot(111)
    scatter = ax.scatter(data[:, 0], data[:, 1], c = labels, cmap = 'hsv',label=labels)
    plt.legend()
    plt.show()
tsne_data = tsne_viz(X_train_scaled)
plot_representations(tsne_data,y_train)

# TO DO
Kernel PCA <br>
T-SNE with hyperparameter tuning