# Noise cancellation

Let's see another use case of PCA: noise reduction. In this exercise, we will need to classify handwritten digits. Unfortunately, the dataset is extremely noisy. Let's see how PCA can help us!

1. Import `pandas` and then download data from these two urls:
    * images 👉👉 <a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/noisy_digits.csv" target="_blank">Download</a>

    * labels 👉👉 <a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/noisy_digits_labels.csv" target="_blank">Download</a>

In [1]:
import pandas as pd
import plotly.express as px
# import plotly.graph_objects as go
# import plotly.io as pio
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.decomposition import PCA





In [2]:
url_X = "https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/noisy_digits.csv"
url_y = "https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/noisy_digits_labels.csv"

# digits = pd.read_csv(url_X)
# labels = pd.read_csv(url_y)

digits = pd.read_csv("../../12_assets/06_unsupervised_ML/noisy_digits.csv")
labels = pd.read_csv("../../12_assets/06_unsupervised_ML/noisy_digits_labels.csv")

digits.head()

Unnamed: 0.1,Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,-176.321245,-52.74385,143.604939,-53.567749,80.289381,20.686379,-197.423519,-229.401206,-221.900009,...,34.40707,1.950735,-25.095565,133.684095,-21.664094,-94.305438,-55.987821,-89.929231,40.394774,-214.75448
1,1,-158.421239,16.371695,62.810879,263.533916,-193.92032,-25.366668,107.062706,125.403427,83.536343,...,-5.419961,-123.58403,20.240434,-25.699206,-128.54593,52.525885,-54.214887,-133.842624,-30.141215,210.408665
2,2,-290.34375,81.5865,12.615232,146.567851,111.233602,-188.989259,-101.464605,-107.015195,-13.069827,...,-139.909318,-85.214133,167.495617,62.402411,-144.40297,152.26395,-4.687051,-59.270131,-93.1936,188.229794
3,3,-208.84059,136.190431,38.552191,-67.825346,24.316303,176.103673,31.581298,-163.582673,29.777077,...,-131.385192,40.329733,-10.111639,163.497435,41.010287,-21.408008,328.274235,-15.341672,121.570863,151.757537
4,4,-328.876288,-42.8629,174.651874,-228.833439,71.909654,-97.206392,48.048853,-34.071313,3.820465,...,-135.713199,-71.396796,155.237981,-141.860908,155.657335,166.60976,-52.911774,267.150703,-36.749672,131.913772


In [3]:
digits.describe()

Unnamed: 0.1,Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
count,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,...,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0
mean,20999.5,0.508865,0.161267,-0.136474,0.504606,-0.392227,-1.146996,0.065945,-0.932743,1.202804,...,0.329851,0.121329,-0.655444,-0.100407,0.276097,-0.1404,-0.625908,1.870961,-0.466101,-0.757627
std,12124.49999,125.424203,125.957019,124.81408,124.209803,125.421539,125.156993,124.927893,124.995896,125.198035,...,124.154492,124.397062,124.896809,124.596065,124.836661,125.61645,124.783298,125.710628,125.45683,125.542711
min,0.0,-536.272379,-563.581377,-596.266187,-515.063938,-532.306749,-513.45721,-543.357776,-537.067105,-487.779478,...,-461.635368,-506.789421,-474.601499,-499.90778,-509.357382,-525.725612,-535.190814,-504.797562,-511.439969,-519.477568
25%,10499.75,-83.800118,-84.915914,-84.807013,-82.854817,-85.834707,-84.97027,-83.641462,-85.161672,-82.8879,...,-84.349026,-83.802971,-85.49139,-84.701319,-84.307686,-84.130186,-84.127703,-83.530383,-84.919834,-85.62035
50%,20999.5,0.860217,0.886333,0.763704,1.024988,-0.613646,-0.94217,0.750785,-1.703407,1.237868,...,-0.523211,0.359867,0.253873,-0.603624,0.632699,-0.794485,-0.84812,1.677145,-1.445551,-0.848889
75%,31499.25,84.104507,84.746673,84.124227,84.558812,84.176809,83.098343,84.426144,83.654107,85.635613,...,84.494976,85.120277,83.715392,83.297809,85.881156,84.257741,84.000764,86.770448,83.957442,84.193126
max,41999.0,498.64154,521.528036,505.180363,497.784712,557.447725,496.63909,502.00272,511.211431,501.472247,...,496.220224,506.10578,608.211734,533.828554,585.756919,484.635537,471.13596,581.003313,505.34585,515.791515


In [4]:
digits_bckup, labels_bckup = digits, labels

2. Remove first columns of labels and images

In [5]:
digits = digits.iloc[:, 1:]
labels = labels.iloc[:, 1:]

3. Visualize images by using `numpy` and `plotly`
*    You can use `.reshape()` of `numpy` 👉 images are 28x28 pixels 😉
*   You can also use `.imshow` from plotly to visualize images
* This post may help you : https://stackoverflow.com/questions/64268081/creating-a-subplot-of-images-with-plotly

In [6]:
fig = px.imshow(digits.values[0:10].reshape(10, 28, 28), binary_string=True, facet_col=0, facet_col_wrap=5)
fig.show()

4. Use `train_test_split` from `sklearn` to split your dataset into a train and a test set.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(digits, labels, random_state=0)
X_train.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
5837,24.28918,45.983646,256.352134,-26.261889,11.68343,-205.429795,-137.313894,29.045135,95.337022,-90.620796,...,13.850705,50.352009,-28.988928,57.604085,36.024349,15.966126,9.261994,126.795466,-181.947605,-103.029471
26919,100.230434,44.166899,186.224206,307.291983,-266.608791,174.360175,-134.828176,215.091557,-205.00395,181.104424,...,143.917566,-33.805104,2.545959,-60.373496,65.902161,-185.31767,156.806654,-9.674685,-117.225176,47.191232
15177,-47.699431,-167.740414,238.069692,-98.327298,-224.500853,67.918157,1.317024,52.201427,117.235908,13.082838,...,62.49973,-78.223355,187.178508,-149.544457,186.401464,-55.541073,103.616753,318.404856,-229.352298,-34.503929
14046,-11.944284,-209.514843,113.523511,-5.111774,182.322677,73.826136,-77.099973,102.352925,60.394222,149.039467,...,-174.471094,42.355302,91.010817,-54.704095,-90.580398,81.801558,16.496764,-49.47695,253.968096,61.867205
22315,-85.086368,-133.186743,142.284695,-189.976676,-278.072652,130.695004,91.594462,33.159837,124.973469,-269.550718,...,23.206283,-227.845723,-64.189901,-132.225714,-126.693863,-76.637899,32.98208,-144.533851,116.305279,-75.537392


5. Normalize your train set and apply your normalization on your test set.

In [9]:
sc = StandardScaler()
X_sc_train = sc.fit_transform(X_train)
X_sc_test = sc.transform(X_test)

6. Import `SVC` from `sklearn.svm` and apply it on your dataset. Check out scores.

In [10]:
svc = SVC(random_state=0, verbose=True)
svc.fit(X_sc_train, y_train)
svc.score(X_sc_test, y_test)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



[LibSVM]

0.886

7. Import `PCA` from `sklearn.decomposition` and apply it on your train set.
> Keep only 15% of the explained variance

In [11]:
pca = PCA(.10)

X_pca_train = pca.fit_transform(X_sc_train)
X_pca_train[:5]

array([[-5.96208316, -0.42351373,  0.44470052,  0.57060786,  2.95242258,
        -1.17188072,  1.05069839, -0.13964367, -1.40303368, -2.92619999,
         0.38387206,  1.57951042, -0.54456549],
       [ 0.56433535, -1.22875461,  1.80909413, -2.79629372, -4.66531164,
         0.60056317, -1.57775989, -2.22058894, -0.82044612, -0.44829183,
         0.40847823, -1.15397558,  2.00880282],
       [-2.74876075,  0.57114036, -3.51933441, -1.01876267,  0.62326501,
         0.82822044, -0.02896527, -0.4580741 , -0.16667535,  4.56519067,
         0.75325915,  0.89821879, -0.92016151],
       [-2.35736473, -0.12577577, -1.94688851,  1.68370608, -4.65432857,
         5.51621668, -1.15504779,  1.41742162, -3.18671343,  2.80454839,
         0.6841405 ,  0.05205339,  0.63961574],
       [ 0.20698548, -4.81274701,  2.90348343,  4.22852925, -1.93502761,
         0.08684097,  0.63637024,  1.34539204,  1.92582796,  0.39527086,
        -1.84302689, -0.26067534, -3.43039347]])

8. Get the number of components

In [12]:
pca.components_[0].shape

(784,)

9. Apply PCA on your test set

In [13]:
X_pca_test = pca.transform(X_sc_test)
X_pca_test

array([[ 2.6491316 , -0.75974154,  7.93754891, ...,  1.64721785,
         1.8857297 , -1.97130382],
       [-1.55785053,  3.56492638, -1.64061527, ..., -3.02679   ,
        -1.97377365,  1.36121021],
       [-3.06767872,  5.21224128, -1.97658852, ..., -2.18662984,
         1.31071074,  1.54193635],
       ...,
       [-1.71394589,  2.19339732,  3.36030249, ..., -0.34881392,
         0.04000389,  0.32146327],
       [11.29534561, -0.64661267,  2.9625681 , ..., -1.59349459,
         0.74569828,  0.71080007],
       [ 3.58384581,  3.95782724,  6.30025409, ...,  3.87922446,
         0.85645358,  1.47063532]])

10. Visualize your new images after applying PCA
> NB: You will need to apply `.inverse_transform()` method on your PCA

In [14]:
fig = px.imshow(pca.inverse_transform(X_pca_train[0:10]).reshape(10, 28, 28), binary_string=True, facet_col=0, facet_col_wrap=5)
fig.show()

11. Train a new SVM on your data after applying PCA.

In [15]:
svc.fit(X_pca_train, y_train)
svc.score(X_pca_test, y_test)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



[LibSVM]

0.8999047619047619