<a href="https://colab.research.google.com/github/Bryan-Az/Dimensionality_Reduction/blob/main/Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import os
import shutil
import sys
import zipfile
import gensim
import re
import warnings
warnings.filterwarnings('ignore')

# Dimensionality Reduction
In this notebook assignment I will be using the following techniques:
1. PCA (Principal Component Analysis) - multivariate on distinct features
2. SVD (Singular Value Decomposition) - univariate on distinct feature
3. MDS (Multidimensional Scaling) - multivariate on distinct feature
4. ISOMap (Isometric Mapping)  - multivariate on distinct feature
5. LLE (Locally Linear Embedding) - multivariate on distinct feature

and seperately,

1. UMAP (Uniform Manifold Approximation and Projection)
2. t-SNE (t-Distributed Stochastic Neighbor Embedding)

A key difference between UMAP and t-SNE as compared to the other 5 techniques above is that they are more advanced, and they can capture the most highly dimensional, complex, non-linear relationships within the data. UMAP and t-SNE can preserve the local structure within the data and can reveal important clusters that PCA, SVD, and other techniques may miss.

# Data Loading

## Data from the National Gallery of Art's (NGA) OpenData Project

## Data from the Kaggle Synthetic dataset from Inscribo on Cyber-Security Attacks

In [4]:
# unpacking art_tables.zip using zipfile library and then load the two csv's (latinamerican_art.csv & non_latinamerican_art.csv)
# into a single dataframe
nga_art_sample = None
with zipfile.ZipFile('./data_samples/art_tables.zip', 'r') as zip_ref:
    zip_ref.extractall('./data_samples/')
    for file in zip_ref.namelist():
        if file.endswith('.csv'):
            # only sampling 628 rows from non_latinamerican.csv file & all from latinamerican, and selecting only title and nationality
            # latinamerican has very few rows so we are sampling all of them
            to_sample = True if 'non_latinamerican' in file else False
            if nga_art_sample is None:
                if to_sample:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628)
                else:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip')
            else:
                if to_sample:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628), nga_art_sample])
                else:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip'), nga_art_sample])
# delete the unzipped data directory
shutil.rmtree('./data_samples/art_tables')

print(nga_art_sample.shape)
nga_art_sample.head()

(1256, 40)


Unnamed: 0,iiifurl,iiifthumburl,accessioned,title,displayDate_created,roletype,role,forwarddisplayname,birthyear,deathyear,...,uuid,viewtype,sequence,width,height,maxpixels,assistivetext,depictstmsobjectid,objectid,constituentid
118873,https://api.nga.gov/iiif/970c3904-dec8-4808-96...,https://api.nga.gov/iiif/970c3904-dec8-4808-96...,1,Untitled,"American, born Russia (now Latvia), 1903 - 1970",artist,artist,Mark Rothko,1903,1970,...,970c3904-dec8-4808-9660-d1cc70ef32ea,primary,0.0,3675.0,4850.0,640.0,,67637.0,67637.0,1839.0
187164,https://api.nga.gov/iiif/eee3878b-c311-415a-8e...,https://api.nga.gov/iiif/eee3878b-c311-415a-8e...,1,Je ne m'étonne plus si celui-la se tentaint tr...,"French, 1808 - 1879",artist,artist,Honoré Daumier,1808,1879,...,eee3878b-c311-415a-8e78-84d3ccfe0a99,primary,0.0,4000.0,3569.0,,,197495.0,197495.0,1209.0
66848,https://api.nga.gov/iiif/551f4728-cac9-4a7c-b7...,https://api.nga.gov/iiif/551f4728-cac9-4a7c-b7...,1,Untitled,"American, born Canada, 1906 - 1978",artist,artist,Ralston Crawford,1906,1978,...,551f4728-cac9-4a7c-b76b-8a1cf0de50fb,primary,0.0,4000.0,3226.0,640.0,,123231.0,123231.0,1180.0
148940,https://api.nga.gov/iiif/bdacbdcb-b35f-44bb-be...,https://api.nga.gov/iiif/bdacbdcb-b35f-44bb-be...,1,The Apparition in the Forest,"Austrian, 1804 - 1871",artist,artist,Moritz von Schwind,1804,1871,...,bdacbdcb-b35f-44bb-bed6-a9157092337d,primary,0.0,5243.0,3747.0,,,139103.0,139103.0,14998.0
6074,https://api.nga.gov/iiif/07f2272a-7722-4c39-b1...,https://api.nga.gov/iiif/07f2272a-7722-4c39-b1...,1,American Landscape,"American, 1882 - 1967",artist,artist,Edward Hopper,1882,1967,...,07f2272a-7722-4c39-b175-0ffc660c7910,primary,0.0,1433.0,859.0,,,36511.0,36511.0,1404.0


## Data Pre-Processing

In [5]:
nga_art_sample.columns

Index(['iiifurl', 'iiifthumburl', 'accessioned', 'title',
       'displayDate_created', 'roletype', 'role', 'forwarddisplayname',
       'birthyear', 'deathyear', 'ulanid', 'artistofngaobject', 'nationality',
       'constituenttype', 'beginyear_artistAssigned', 'endyear_artistAssigned',
       'country_artistAssigned', 'zipcode_artistAssigned', 'medium',
       'dimensions', 'inscription', 'markings', 'attribution',
       'visualBrowserClassification', 'parentID', 'isVirtual', 'portfolio',
       'series', 'volume', 'watermarks', 'uuid', 'viewtype', 'sequence',
       'width', 'height', 'maxpixels', 'assistivetext', 'depictstmsobjectid',
       'objectid', 'constituentid'],
      dtype='object')

In [6]:
# selecting numerical columns from the dataset
selected_art_sample = nga_art_sample.loc[:, ['accessioned', 'birthyear', 'deathyear', 'width', 'height', 'medium', 'viewtype']]

In [7]:
# checking for null values
selected_art_sample.isna().sum()

accessioned      0
birthyear        5
deathyear      207
width            0
height           2
medium           0
viewtype         0
dtype: int64

In [8]:
selected_art_sample.reset_index(drop=False, inplace=True)

In [9]:
# imputing null values with mean and also converting the columns to int
selected_art_sample['birthyear'] = pd.to_numeric(selected_art_sample['birthyear'], errors='coerce')
selected_art_sample['birthyear'].fillna(selected_art_sample['birthyear'].mean(), inplace=True)

selected_art_sample['deathyear'] = pd.to_numeric(selected_art_sample['deathyear'], errors='coerce')
selected_art_sample['deathyear'].fillna(selected_art_sample['deathyear'].mean(), inplace=True)

selected_art_sample['width'] = pd.to_numeric(selected_art_sample['width'], errors='coerce')
selected_art_sample['width'].fillna(selected_art_sample['width'].mean(), inplace=True)

selected_art_sample['height'] = pd.to_numeric(selected_art_sample['height'], errors='coerce')
selected_art_sample['height'].fillna(selected_art_sample['height'].mean(), inplace=True)

# imputing the categorical string column 'medium' with the most frequent value
selected_art_sample['medium'].fillna(selected_art_sample['medium'].value_counts().index[0], inplace=True)
# imputing the categorical string column 'viewtype' with the most frequent value
selected_art_sample['viewtype'].fillna(selected_art_sample['viewtype'].value_counts().index[0], inplace=True)


In [10]:
selected_art_sample.loc[:, ['medium', 'viewtype']].describe()

Unnamed: 0,medium,viewtype
count,1256,1256
unique,371,3
top,gelatin silver print,primary
freq,164,1244


In [11]:
selected_art_sample.describe()

Unnamed: 0,index,birthyear,deathyear,width,height
count,1256.0,1256.0,1256.0,1256.0,1256.0
mean,50797.926752,1867.106315,1937.003813,4072.839172,4266.672249
std,64216.752151,113.221536,111.151954,2210.318703,2149.894017
min,0.0,916.0,1125.0,640.0,584.0
25%,312.75,1855.0,1937.003813,3085.5,3193.25
50%,625.5,1902.0,1971.0,3547.0,4000.0
75%,101264.25,1921.25,1998.0,4000.0,4000.0
max,199460.0,1980.0,2020.0,35896.0,29009.0


## Scaling the selected art sample for use in Dimensionality Reduction

In [12]:
from sklearn.preprocessing import StandardScaler
# applying the standard scaler to the numerical data
scaler = StandardScaler()
selected_numerical_art_sample = selected_art_sample.loc[:, ['birthyear', 'deathyear', 'width', 'height']]
scaler.fit(selected_numerical_art_sample)
scaled_numerical_art_sample = scaler.transform(selected_numerical_art_sample)
scaled_numerical_art_sample = pd.DataFrame(scaler.transform(selected_numerical_art_sample), columns=selected_numerical_art_sample.columns)

In [13]:
scaled_numerical_art_sample.describe()

Unnamed: 0,birthyear,deathyear,width,height
count,1256.0,1256.0,1256.0,1256.0
mean,8.14635e-16,-1.923444e-15,-5.6571870000000004e-18,5.0914690000000004e-17
std,1.000398,1.000398,1.000398,1.000398
min,-8.403747,-7.308259,-1.553716,-1.713638
25%,-0.1069685,-2.046426e-15,-0.4468733,-0.4994897
50%,0.3083122,0.3059751,-0.2379967,-0.1240891
75%,0.4784006,0.5489825,-0.03296728,-0.1240891
max,0.9975015,0.7469886,14.40328,11.51321


In [14]:
scaled_numerical_art_sample.to_csv('./data_samples/scaled_numerical_art_sample.csv', index=False)

# Simpler Linear Methods

## PCA (Principal Component Analysis)

In [15]:
# applying PCA to the scaled data
from sklearn.decomposition import PCA
def apply_pca(scaled_df, n_components=2):
    pca = PCA(n_components=n_components)
    pca.fit(scaled_df)
    pca_df = pca.transform(scaled_df)
    pca_df = pd.DataFrame(pca_df, columns=['PC1', 'PC2'])
    return pca_df

In [16]:
pca_art = apply_pca(scaled_numerical_art_sample)

In [17]:
pca_art.head()

Unnamed: 0,PC1,PC2
0,-0.433374,-0.086205
1,0.782061,0.011783
2,-0.348387,-0.513362
3,0.702261,0.468966
4,0.398361,-1.947959


## SVD (Singular Value Decomposition)

In [18]:
from sklearn.decomposition import TruncatedSVD

def apply_svd(scaled_df, n_components=2):
    svd = TruncatedSVD(n_components=n_components)
    svd.fit(scaled_df)
    svd_df = svd.transform(scaled_df)
    svd_df = pd.DataFrame(svd_df, columns=[f'SVD{i+1}' for i in range(n_components)])
    return svd_df


In [19]:
svd_art = apply_svd(scaled_numerical_art_sample)
svd_art.head()

Unnamed: 0,SVD1,SVD2
0,-0.433374,-0.086205
1,0.782061,0.011783
2,-0.348387,-0.513362
3,0.702261,0.468966
4,0.398361,-1.947959


## MDS (Multidimensional Scaling)

In [20]:
from sklearn.manifold import MDS

def apply_mds(scaled_df, n_components=2):
    mds = MDS(n_components=n_components)
    mds_df = mds.fit_transform(scaled_df)
    mds_df = pd.DataFrame(mds_df, columns=[f'MDS{i+1}' for i in range(n_components)])
    return mds_df

In [21]:
mds_art = apply_mds(scaled_numerical_art_sample)
mds_art.head()

Unnamed: 0,MDS1,MDS2
0,-0.007395,-0.469333
1,0.017473,0.833203
2,-0.772755,-0.297951
3,0.529492,0.801523
4,-1.974494,0.510621


## ISOMap (Isometric Mapping)

In [22]:
from sklearn.manifold import Isomap

def apply_isomap(scaled_df, n_components=2, n_neighbors=5):
    isomap = Isomap(n_components=n_components, n_neighbors=n_neighbors)
    isomap_df = isomap.fit_transform(scaled_df)
    isomap_df = pd.DataFrame(isomap_df, columns=[f'ISOMap{i+1}' for i in range(n_components)])
    return isomap_df

In [23]:
isomap_art = apply_isomap(scaled_numerical_art_sample)
isomap_art.head()

Unnamed: 0,ISOMap1,ISOMap2
0,-0.490022,-0.351413
1,1.047635,1.03643
2,-0.14436,-0.012947
3,0.611867,1.576779
4,0.163851,-1.650431


## LLE (Locally Linear Embedding)

In [29]:
from sklearn.manifold import LocallyLinearEmbedding

def apply_lle(scaled_df, n_components=2, n_neighbors=5):
    lle = LocallyLinearEmbedding(n_components=n_components, n_neighbors=n_neighbors, eigen_solver = 'dense')
    lle_df = lle.fit_transform(scaled_df)
    lle_df = pd.DataFrame(lle_df, columns=[f'LLE{i+1}' for i in range(n_components)])
    return lle_df


In [30]:
lle_art = apply_lle(scaled_numerical_art_sample)
lle_art.head()

Unnamed: 0,LLE1,LLE2
0,-0.001767,-0.057712
1,0.01645,0.001881
2,-0.004483,0.04987
3,0.01645,0.001881
4,0.008406,-0.002841


# Complex Multi-linear Methods

## UMAP (Uniform Manifold Approximation and Projection)

In [33]:
%pip install umap-learn

Collecting umap-learn
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.11-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: umap-learn
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.5-py3-none-any.whl size=86831 sha256=858bc870b40da8ec5d1b74ed2e60e108061ed52f58f025b72320921982664a57
  Stored in directory: /root/.cache/pip/wheels/3a/70/07/428d2b58660a1a3b431db59b806a10da736612ebbc66c1bcc5
Successfully built umap-learn
Installing collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.11 umap-learn-0.5.5


In [34]:
from umap import UMAP

def apply_umap(scaled_df, n_components=2, n_neighbors=5):
    umap = UMAP(n_components=n_components, n_neighbors=n_neighbors)
    umap_df = umap.fit_transform(scaled_df)
    umap_df = pd.DataFrame(umap_df, columns=[f'UMAP{i+1}' for i in range(n_components)])
    return umap_df


In [38]:
umap_art = apply_umap(scaled_numerical_art_sample)
umap_art.head()

Unnamed: 0,UMAP1,UMAP2
0,0.199916,5.620243
1,-4.659243,11.928306
2,0.344523,16.047178
3,-4.662936,11.917882
4,-2.043802,9.956807


In [39]:
umap_art.to_csv('./data_samples/umap_art.csv', index=False)

## t-SNE (t-Distributed Stochastic Neighbor Embedding)

In [35]:
from sklearn.manifold import TSNE

def apply_tsne(scaled_df, n_components=2, perplexity=30, learning_rate=200):
    tsne = TSNE(n_components=n_components, perplexity=perplexity, learning_rate=learning_rate)
    tsne_df = tsne.fit_transform(scaled_df)
    tsne_df = pd.DataFrame(tsne_df, columns=[f't-SNE{i+1}' for i in range(n_components)])
    return tsne_df


In [36]:
tsne_art = apply_tsne(scaled_numerical_art_sample)
tsne_art.head()

Unnamed: 0,t-SNE1,t-SNE2
0,2.089915,-23.602427
1,9.56338,-1.85476
2,-12.46655,6.00473
3,9.95203,-2.543584
4,-5.66789,-9.673102


In [40]:
tsne_art.to_csv('./data_samples/tsne_art.csv', index=False)