<a href="https://colab.research.google.com/github/Bryan-Az/Dimensionality_Reduction/blob/main/Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import os
import shutil
import sys
import zipfile
import gensim
import re
import warnings
warnings.filterwarnings('ignore')

# Dimensionality Reduction
In this notebook assignment I will be using the following techniques:
1. PCA (Principal Component Analysis) - univariate on distinct features [cybersecurity data]
2. SVD (Singular Value Decomposition) - univariate on distinct feature [cybersecurity data]
3. MDS (Multidimensional Scaling) - multivariate on distinct feature [art data]
4. ISOMap (Isometric Mapping)  - multivariate on distinct feature [art data]
5. LLE (Locally Linear Embedding) - multivariate on distinct feature [art data]

and seperately,

1. UMAP (Uniform Manifold Approximation and Projection) [cybersecurity data]
2. t-SNE (t-Distributed Stochastic Neighbor Embedding) [art data]

A key difference between UMAP and t-SNE as compared to the other 5 techniques above is that they are more advanced, and they can capture the most highly dimensional, complex, non-linear relationships within the data. UMAP and t-SNE can preserve the local structure within the data and can reveal important clusters that PCA, SVD, and other techniques may miss.

# Data Loading

## Data from the Kaggle Synthetic dataset from Inscribo on Cyber-Security Attacks

In [4]:
inscribo_cybersecurity_df = pd.read_csv('./data_samples/cybersecurity_attacks.csv')

In [8]:
inscribo_cybersecurity_df.shape

(40000, 25)

In [9]:
# as the dataset is huge, we will take a sample of 1000 rows
inscribo_cybersecurity_df = inscribo_cybersecurity_df.sample(1000)

In [10]:
inscribo_cybersecurity_df.head()

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,...,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
29692,2020-08-16 05:06:13,164.213.89.116,144.27.113.174,17034,65017,ICMP,913,Control,DNS,Dolore iure deleniti exercitationem placeat. D...,...,Logged,Medium,Zoya Dugar,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_8)...,Segment B,"Howrah, Haryana",,Log Data,,Server
2985,2023-09-29 04:25:02,11.211.232.94,29.209.83.95,64406,29928,ICMP,756,Control,FTP,Molestias expedita ratione. Occaecati in numqu...,...,Ignored,Medium,Hansh Dhawan,Mozilla/5.0 (Windows 95) AppleWebKit/532.1 (KH...,Segment C,"Mango, Jharkhand",,,Alert Data,Server
39309,2022-10-28 22:08:17,88.17.184.100,3.11.219.244,55677,21548,TCP,794,Data,FTP,Quod doloribus sit corrupti ea impedit est. Mo...,...,Logged,High,Elakshi Gopal,Mozilla/5.0 (Windows; U; Windows NT 10.0) Appl...,Segment B,"Meerut, Manipur",,Log Data,,Firewall
20168,2020-05-21 11:15:13,130.228.41.83,168.146.63.236,46159,23924,TCP,239,Control,FTP,Deleniti quasi reiciendis libero. Repellendus ...,...,Blocked,High,Vivaan Sule,Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_9...,Segment C,"Durgapur, Gujarat",,Log Data,,Server
8917,2022-04-04 20:44:39,198.99.16.135,173.13.217.227,64175,2809,TCP,373,Data,HTTP,Quasi exercitationem alias eum repudiandae rep...,...,Blocked,Low,Mehul Sood,Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_1...,Segment A,"Burhanpur, Gujarat",5.149.128.161,Log Data,Alert Data,Server


## Data from the National Gallery of Art's (NGA) OpenData Project

In [6]:
# unpacking art_tables.zip using zipfile library and then load the two csv's (latinamerican_art.csv & non_latinamerican_art.csv)
# into a single dataframe
nga_art_sample = None
with zipfile.ZipFile('./data_samples/art_tables.zip', 'r') as zip_ref:
    zip_ref.extractall('./data_samples/')
    for file in zip_ref.namelist():
        if file.endswith('.csv'):
            # only sampling 628 rows from non_latinamerican.csv file & all from latinamerican, and selecting only title and nationality
            # latinamerican has very few rows so we are sampling all of them
            to_sample = True if 'non_latinamerican' in file else False
            if nga_art_sample is None:
                if to_sample:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628)
                else:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip')
            else:
                if to_sample:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628), nga_art_sample])
                else:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip'), nga_art_sample])
# delete the unzipped data directory
shutil.rmtree('./data_samples/art_tables')

print(nga_art_sample.shape)
nga_art_sample.head()

(1256, 40)


Unnamed: 0,iiifurl,iiifthumburl,accessioned,title,displayDate_created,roletype,role,forwarddisplayname,birthyear,deathyear,...,uuid,viewtype,sequence,width,height,maxpixels,assistivetext,depictstmsobjectid,objectid,constituentid
50414,https://api.nga.gov/iiif/403a32d8-559d-4d8d-a7...,https://api.nga.gov/iiif/403a32d8-559d-4d8d-a7...,1,Afghan or Circassian Armour,"Scottish, 1802 - 1870",artist,artist,David Octavius Hill,1802,1870,...,403a32d8-559d-4d8d-a790-a191b72d25dd,primary,0.0,2681.0,4000.0,,,137701.0,137701.0,13362.0
140646,https://api.nga.gov/iiif/b2e40894-7e5c-462e-ae...,https://api.nga.gov/iiif/b2e40894-7e5c-462e-ae...,1,Sea Bird Saga VII,"American, born 1899",artist,artist,Wallace Bradstreet Putnam,1899,1999,...,b2e40894-7e5c-462e-ae00-7d56972ad036,primary,0.0,3421.0,2708.0,640.0,,55885.0,55885.0,5286.0
15208,https://api.nga.gov/iiif/1353c8f4-62d4-4c48-b7...,https://api.nga.gov/iiif/1353c8f4-62d4-4c48-b7...,1,"Narrative, of a Five Years' Expedition, agains...","Florentine, 1727 - 1815",artist,artist,Francesco Bartolozzi,1727,1815,...,1353c8f4-62d4-4c48-b7f0-e466eaf43999,alternate,26.0,8095.0,11249.0,,,61222.0,61222.0,2974.0
29212,https://api.nga.gov/iiif/2515e1bd-0e66-4969-9d...,https://api.nga.gov/iiif/2515e1bd-0e66-4969-9d...,1,Man Uncovering a Woman (Homme dévoilant une fe...,"Spanish, 1881 - 1973",artist,artist,Pablo Picasso,1881,1973,...,2515e1bd-0e66-4969-9d0e-dac03bd69821,primary,0.0,3230.0,4000.0,640.0,,39108.0,39108.0,1783.0
6744,https://api.nga.gov/iiif/08cb9ab1-28dd-4935-8a...,https://api.nga.gov/iiif/08cb9ab1-28dd-4935-8a...,1,David and Bathsheba,"German, 1514 - 1562",artist,artist,Virgil Solis,1514,1562,...,08cb9ab1-28dd-4935-8a98-f9f8fbe3c721,primary,0.0,3321.0,3327.0,,,72070.0,72070.0,2753.0


# Data Pre-Processing

## Processing data from Inscribo

In [7]:
inscribo_cybersecurity_df.columns

Index(['Timestamp', 'Source IP Address', 'Destination IP Address',
       'Source Port', 'Destination Port', 'Protocol', 'Packet Length',
       'Packet Type', 'Traffic Type', 'Payload Data', 'Malware Indicators',
       'Action Taken', 'Severity Level', 'User Information',
       'Device Information', 'Network Segment', 'Geo-location Data',
       'Proxy Information', 'Firewall Logs', 'IDS/IPS Alerts', 'Log Source'],
      dtype='object')

In [11]:
inscribo_cybersecurity_df.describe() # describe helps identify the columns with numerical values as it calculates the mean, std, min, max, etc.

Unnamed: 0,Source Port,Destination Port,Packet Length,Anomaly Scores
count,1000.0,1000.0,1000.0,1000.0
mean,33123.418,33271.691,786.411,50.93135
std,18026.598371,18712.683922,417.292635,28.615735
min,1109.0,1376.0,64.0,0.15
25%,17845.75,16981.5,403.0,26.685
50%,33409.0,33675.5,795.5,50.675
75%,48474.25,49348.5,1150.25,76.175
max,65433.0,65532.0,1498.0,99.97


In [14]:
selected_cybersecurity_sample = inscribo_cybersecurity_df.loc[:, ['Source Port', 'Destination Port', 'Packet Length', 'Anomaly Scores', 'Source IP Address', 'Destination IP Address']]
selected_cybersecurity_sample.head()

Unnamed: 0,Source Port,Destination Port,Packet Length,Anomaly Scores,Source IP Address,Destination IP Address
29692,17034,65017,913,28.53,164.213.89.116,144.27.113.174
2985,64406,29928,756,66.1,11.211.232.94,29.209.83.95
39309,55677,21548,794,26.66,88.17.184.100,3.11.219.244
20168,46159,23924,239,86.95,130.228.41.83,168.146.63.236
8917,64175,2809,373,32.26,198.99.16.135,173.13.217.227


### Filling in missing values

In [15]:
selected_cybersecurity_sample.isna().sum() # checking for null values, there is none

Source Port               0
Destination Port          0
Packet Length             0
Anomaly Scores            0
Source IP Address         0
Destination IP Address    0
dtype: int64

### Scaling the selected cybersecurity sample for use in Dimensionality Reduction

Scaling the Source IP and Destination IP address will not work because they are not linearly continous as are packet length and the anomaly scores. They are categorical in terms of data analysis because they are a string concatenation of multiple numerical sub-strings. In that case, we will only scale Source Port, Destination Port, Packet Length, and Anomaly Scores

In [16]:
from sklearn.preprocessing import StandardScaler
# saving the unscalable columns
selected_categorical_cybersecurity_sample = selected_cybersecurity_sample.loc[:, ['Source IP Address', 'Destination IP Address']]
# applying the standard scaler to the numerical data
scaler = StandardScaler()
selected_numerical_cybersecurity_sample = selected_cybersecurity_sample.loc[:, ['Source Port', 'Destination Port', 'Packet Length', 'Anomaly Scores']]
scaler.fit(selected_numerical_cybersecurity_sample)
scaled_numerical_cybersecurity_sample = scaler.transform(selected_numerical_cybersecurity_sample)
scaled_numerical_cybersecurity_sample = pd.DataFrame(scaler.transform(selected_numerical_cybersecurity_sample), columns=selected_numerical_cybersecurity_sample.columns)

In [17]:
scaled_numerical_cybersecurity_sample.head()

Unnamed: 0,Source Port,Destination Port,Packet Length,Anomaly Scores
0,-0.892984,1.697308,0.30351,-0.783225
1,1.736225,-0.178775,-0.072913,0.530346
2,1.251754,-0.626824,0.018195,-0.848606
3,0.723492,-0.499788,-1.312472,1.259331
4,1.723404,-1.628731,-0.991194,-0.652812


In [18]:
selected_categorical_cybersecurity_sample.head()

Unnamed: 0,Source IP Address,Destination IP Address
29692,164.213.89.116,144.27.113.174
2985,11.211.232.94,29.209.83.95
39309,88.17.184.100,3.11.219.244
20168,130.228.41.83,168.146.63.236
8917,198.99.16.135,173.13.217.227


In [32]:
scaled_numerical_cybersecurity_sample.to_csv('./data_samples/scaled_numerical_cybersecurity_sample.csv', index=False)

## Processing data from the NGA

In [5]:
nga_art_sample.columns

Index(['iiifurl', 'iiifthumburl', 'accessioned', 'title',
       'displayDate_created', 'roletype', 'role', 'forwarddisplayname',
       'birthyear', 'deathyear', 'ulanid', 'artistofngaobject', 'nationality',
       'constituenttype', 'beginyear_artistAssigned', 'endyear_artistAssigned',
       'country_artistAssigned', 'zipcode_artistAssigned', 'medium',
       'dimensions', 'inscription', 'markings', 'attribution',
       'visualBrowserClassification', 'parentID', 'isVirtual', 'portfolio',
       'series', 'volume', 'watermarks', 'uuid', 'viewtype', 'sequence',
       'width', 'height', 'maxpixels', 'assistivetext', 'depictstmsobjectid',
       'objectid', 'constituentid'],
      dtype='object')

In [6]:
# selecting numerical columns from the dataset
selected_art_sample = nga_art_sample.loc[:, ['accessioned', 'birthyear', 'deathyear', 'width', 'height', 'medium', 'viewtype']]

### Filling in missing values

In [7]:
# checking for null values
selected_art_sample.isna().sum()

accessioned      0
birthyear        5
deathyear      207
width            0
height           2
medium           0
viewtype         0
dtype: int64

In [8]:
selected_art_sample.reset_index(drop=False, inplace=True)

In [9]:
# imputing null values with mean and also converting the columns to int
selected_art_sample['birthyear'] = pd.to_numeric(selected_art_sample['birthyear'], errors='coerce')
selected_art_sample['birthyear'].fillna(selected_art_sample['birthyear'].mean(), inplace=True)

selected_art_sample['deathyear'] = pd.to_numeric(selected_art_sample['deathyear'], errors='coerce')
selected_art_sample['deathyear'].fillna(selected_art_sample['deathyear'].mean(), inplace=True)

selected_art_sample['width'] = pd.to_numeric(selected_art_sample['width'], errors='coerce')
selected_art_sample['width'].fillna(selected_art_sample['width'].mean(), inplace=True)

selected_art_sample['height'] = pd.to_numeric(selected_art_sample['height'], errors='coerce')
selected_art_sample['height'].fillna(selected_art_sample['height'].mean(), inplace=True)

# imputing the categorical string column 'medium' with the most frequent value
selected_art_sample['medium'].fillna(selected_art_sample['medium'].value_counts().index[0], inplace=True)
# imputing the categorical string column 'viewtype' with the most frequent value
selected_art_sample['viewtype'].fillna(selected_art_sample['viewtype'].value_counts().index[0], inplace=True)


In [10]:
selected_art_sample.loc[:, ['medium', 'viewtype']].describe()

Unnamed: 0,medium,viewtype
count,1256,1256
unique,371,3
top,gelatin silver print,primary
freq,164,1244


In [11]:
selected_art_sample.describe()

Unnamed: 0,index,birthyear,deathyear,width,height
count,1256.0,1256.0,1256.0,1256.0,1256.0
mean,50797.926752,1867.106315,1937.003813,4072.839172,4266.672249
std,64216.752151,113.221536,111.151954,2210.318703,2149.894017
min,0.0,916.0,1125.0,640.0,584.0
25%,312.75,1855.0,1937.003813,3085.5,3193.25
50%,625.5,1902.0,1971.0,3547.0,4000.0
75%,101264.25,1921.25,1998.0,4000.0,4000.0
max,199460.0,1980.0,2020.0,35896.0,29009.0


### Scaling the selected art sample for use in Dimensionality Reduction

In [12]:
# applying the standard scaler to the numerical data
scaler = StandardScaler()
selected_numerical_art_sample = selected_art_sample.loc[:, ['birthyear', 'deathyear', 'width', 'height']]
scaler.fit(selected_numerical_art_sample)
scaled_numerical_art_sample = scaler.transform(selected_numerical_art_sample)
scaled_numerical_art_sample = pd.DataFrame(scaler.transform(selected_numerical_art_sample), columns=selected_numerical_art_sample.columns)

In [13]:
scaled_numerical_art_sample.describe()

Unnamed: 0,birthyear,deathyear,width,height
count,1256.0,1256.0,1256.0,1256.0
mean,8.14635e-16,-1.923444e-15,-5.6571870000000004e-18,5.0914690000000004e-17
std,1.000398,1.000398,1.000398,1.000398
min,-8.403747,-7.308259,-1.553716,-1.713638
25%,-0.1069685,-2.046426e-15,-0.4468733,-0.4994897
50%,0.3083122,0.3059751,-0.2379967,-0.1240891
75%,0.4784006,0.5489825,-0.03296728,-0.1240891
max,0.9975015,0.7469886,14.40328,11.51321


In [14]:
scaled_numerical_art_sample.to_csv('./data_samples/scaled_numerical_art_sample.csv', index=False)

# Simpler Linear Methods

## PCA (Principal Component Analysis)

In [20]:
# applying PCA to the scaled data
from sklearn.decomposition import PCA
def apply_pca(scaled_df, n_components=2):
    pca = PCA(n_components=n_components)
    pca.fit(scaled_df)
    pca_df = pca.transform(scaled_df)
    pca_df = pd.DataFrame(pca_df, columns=['PC1', 'PC2'])
    return pca_df

In [22]:
pca_cyber = apply_pca(scaled_numerical_cybersecurity_sample, n_components=2)

In [23]:
pca_cyber.head()

Unnamed: 0,PC1,PC2
0,1.312423,-0.557411
1,-1.28933,-0.408926
2,-0.653072,0.908675
3,-0.379432,-0.731101
4,-0.807452,1.322671


## SVD (Singular Value Decomposition)

In [25]:
from sklearn.decomposition import TruncatedSVD

def apply_svd(scaled_df, n_components=2):
    svd = TruncatedSVD(n_components=n_components)
    svd.fit(scaled_df)
    svd_df = svd.transform(scaled_df)
    svd_df = pd.DataFrame(svd_df, columns=[f'SVD{i+1}' for i in range(n_components)])
    return svd_df


In [33]:
svd_cyber = apply_svd(scaled_numerical_cybersecurity_sample)
svd_cyber.head()

Unnamed: 0,SVD1,SVD2
0,1.312423,-0.557411
1,-1.28933,-0.408926
2,-0.653072,0.908675
3,-0.379432,-0.731101
4,-0.807452,1.322671


## MDS (Multidimensional Scaling)

In [20]:
from sklearn.manifold import MDS

def apply_mds(scaled_df, n_components=2):
    mds = MDS(n_components=n_components)
    mds_df = mds.fit_transform(scaled_df)
    mds_df = pd.DataFrame(mds_df, columns=[f'MDS{i+1}' for i in range(n_components)])
    return mds_df

In [21]:
mds_art = apply_mds(scaled_numerical_art_sample)
mds_art.head()

Unnamed: 0,MDS1,MDS2
0,-0.007395,-0.469333
1,0.017473,0.833203
2,-0.772755,-0.297951
3,0.529492,0.801523
4,-1.974494,0.510621


## ISOMap (Isometric Mapping)

In [22]:
from sklearn.manifold import Isomap

def apply_isomap(scaled_df, n_components=2, n_neighbors=5):
    isomap = Isomap(n_components=n_components, n_neighbors=n_neighbors)
    isomap_df = isomap.fit_transform(scaled_df)
    isomap_df = pd.DataFrame(isomap_df, columns=[f'ISOMap{i+1}' for i in range(n_components)])
    return isomap_df

In [23]:
isomap_art = apply_isomap(scaled_numerical_art_sample)
isomap_art.head()

Unnamed: 0,ISOMap1,ISOMap2
0,-0.490022,-0.351413
1,1.047635,1.03643
2,-0.14436,-0.012947
3,0.611867,1.576779
4,0.163851,-1.650431


## LLE (Locally Linear Embedding)

In [29]:
from sklearn.manifold import LocallyLinearEmbedding

def apply_lle(scaled_df, n_components=2, n_neighbors=5):
    lle = LocallyLinearEmbedding(n_components=n_components, n_neighbors=n_neighbors, eigen_solver = 'dense')
    lle_df = lle.fit_transform(scaled_df)
    lle_df = pd.DataFrame(lle_df, columns=[f'LLE{i+1}' for i in range(n_components)])
    return lle_df


In [30]:
lle_art = apply_lle(scaled_numerical_art_sample)
lle_art.head()

Unnamed: 0,LLE1,LLE2
0,-0.001767,-0.057712
1,0.01645,0.001881
2,-0.004483,0.04987
3,0.01645,0.001881
4,0.008406,-0.002841


# Complex Multi-linear Methods

## UMAP (Uniform Manifold Approximation and Projection)

In [33]:
%pip install umap-learn

Collecting umap-learn
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.11-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: umap-learn
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.5-py3-none-any.whl size=86831 sha256=858bc870b40da8ec5d1b74ed2e60e108061ed52f58f025b72320921982664a57
  Stored in directory: /root/.cache/pip/wheels/3a/70/07/428d2b58660a1a3b431db59b806a10da736612ebbc66c1bcc5
Successfully built umap-learn
Installing collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.11 umap-learn-0.5.5


In [27]:
from umap import UMAP

def apply_umap(scaled_df, n_components=2, n_neighbors=5):
    umap = UMAP(n_components=n_components, n_neighbors=n_neighbors)
    umap_df = umap.fit_transform(scaled_df)
    umap_df = pd.DataFrame(umap_df, columns=[f'UMAP{i+1}' for i in range(n_components)])
    return umap_df


In [29]:
umap_cyber = apply_umap(selected_numerical_cybersecurity_sample)
umap_cyber.head()

Unnamed: 0,UMAP1,UMAP2
0,6.090585,-7.161148
1,13.751868,10.544777
2,11.418718,11.608393
3,10.542574,9.488722
4,12.466639,15.125901


In [31]:
umap_cyber.to_csv('./data_samples/umap_cybersecurity.csv', index=False)

## t-SNE (t-Distributed Stochastic Neighbor Embedding)

In [35]:
from sklearn.manifold import TSNE

def apply_tsne(scaled_df, n_components=2, perplexity=30, learning_rate=200):
    tsne = TSNE(n_components=n_components, perplexity=perplexity, learning_rate=learning_rate)
    tsne_df = tsne.fit_transform(scaled_df)
    tsne_df = pd.DataFrame(tsne_df, columns=[f't-SNE{i+1}' for i in range(n_components)])
    return tsne_df


In [36]:
tsne_art = apply_tsne(scaled_numerical_art_sample)
tsne_art.head()

Unnamed: 0,t-SNE1,t-SNE2
0,2.089915,-23.602427
1,9.56338,-1.85476
2,-12.46655,6.00473
3,9.95203,-2.543584
4,-5.66789,-9.673102


In [40]:
tsne_art.to_csv('./data_samples/tsne_art.csv', index=False)