<a href="https://colab.research.google.com/github/Bryan-Az/Dimensionality_Reduction/blob/main/Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import os
import shutil
import sys
import zipfile
import gensim
import re

# Dimensionality Reduction
In this notebook assignment I will be using the following techniques:
1. PCA (Principal Component Analysis) - multivariate on distinct features
2. SVD (Singular Value Decomposition) - univariate on distinct feature
3. MDS (Multidimensional Scaling) - multivariate on distinct feature
4. ISOMap (Isometric Mapping)  - multivariate on distinct feature
5. LLE (Locally Linear Embedding) - multivariate on distinct feature

and seperately,

1. UMAP (Uniform Manifold Approximation and Projection)
2. t-SNE (t-Distributed Stochastic Neighbor Embedding)

A key difference between UMAP and t-SNE as compared to the other 5 techniques above is that they are more advanced, and they can capture the most highly dimensional, complex, non-linear relationships within the data. UMAP and t-SNE can preserve the local structure within the data and can reveal important clusters that PCA, SVD, and other techniques may miss. 

## Data Loading

In [2]:
# unpacking art_tables.zip using zipfile library and then load the two csv's (latinamerican_art.csv & non_latinamerican_art.csv)
# into a single dataframe
nga_art_sample = None
with zipfile.ZipFile('./data_samples/art_tables.zip', 'r') as zip_ref:
    zip_ref.extractall('./data_samples/')
    for file in zip_ref.namelist():
        if file.endswith('.csv'):
            # only sampling 628 rows from non_latinamerican.csv file & all from latinamerican, and selecting only title and nationality
            # latinamerican has very few rows so we are sampling all of them
            to_sample = True if 'non_latinamerican' in file else False
            if nga_art_sample is None:
                if to_sample:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628)
                else:
                    nga_art_sample = pd.read_csv('./data_samples/' + file, on_bad_lines='skip')
            else:
                if to_sample:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628), nga_art_sample])
                else:
                    nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip'), nga_art_sample])
# delete the unzipped data directory
shutil.rmtree('./data_samples/art_tables')

print(nga_art_sample.shape)
nga_art_sample.head()

(1256, 40)


  nga_art_sample = pd.concat([pd.read_csv('./data_samples/' + file, on_bad_lines='skip').sample(628), nga_art_sample])


Unnamed: 0,iiifurl,iiifthumburl,accessioned,title,displayDate_created,roletype,role,forwarddisplayname,birthyear,deathyear,...,uuid,viewtype,sequence,width,height,maxpixels,assistivetext,depictstmsobjectid,objectid,constituentid
174,https://api.nga.gov/iiif/0037f7c4-699f-4ac3-b2...,https://api.nga.gov/iiif/0037f7c4-699f-4ac3-b2...,1,"Plate 77: Dotted Bee Fly with a White Flower, ...","Flemish, 1542 - 1600",artist,artist,Joris Hoefnagel,1542,1600.0,...,0037f7c4-699f-4ac3-b25e-f12306ecb735,primary,0.0,4000.0,2825.0,,,69744.0,69744.0,2569.0
2237,https://api.nga.gov/iiif/02fac72f-5170-4f5f-a2...,https://api.nga.gov/iiif/02fac72f-5170-4f5f-a2...,1,Doll's Dress and Shift,"American, active c. 1935",artist,artist,Rosalia Lane,1855,1995.0,...,02fac72f-5170-4f5f-a250-ab231522699f,primary,0.0,3405.0,4368.0,,,27385.0,27385.0,7746.0
20439,https://api.nga.gov/iiif/19eb6be1-4f8f-4098-bd...,https://api.nga.gov/iiif/19eb6be1-4f8f-4098-bd...,1,Ship Builder's Axe Head,"American, active c. 1935",artist,artist,Harvey Beck,1855,1995.0,...,19eb6be1-4f8f-4098-bddf-bbcc94861aa9,primary,0.0,3119.0,3978.0,,,20536.0,20536.0,7247.0
186642,https://api.nga.gov/iiif/ee32f9fa-fbff-4e28-be...,https://api.nga.gov/iiif/ee32f9fa-fbff-4e28-be...,1,Cicada,"American, born 1930",artist,artist,Jasper Johns,1930,,...,ee32f9fa-fbff-4e28-be4c-e93b801b2986,primary,0.0,3002.0,4000.0,640.0,,152020.0,152020.0,2136.0
24792,https://api.nga.gov/iiif/1f72426e-4b94-4a2d-b1...,https://api.nga.gov/iiif/1f72426e-4b94-4a2d-b1...,1,"Angola, Louisiana","American, born 1951",artist,artist,Deborah Luster,1951,,...,1f72426e-4b94-4a2d-b153-f400574031c2,primary,0.0,3821.0,4735.0,640.0,,168403.0,168403.0,38644.0


## Data Pre-Processing

In [3]:
nga_art_sample.columns

Index(['iiifurl', 'iiifthumburl', 'accessioned', 'title',
       'displayDate_created', 'roletype', 'role', 'forwarddisplayname',
       'birthyear', 'deathyear', 'ulanid', 'artistofngaobject', 'nationality',
       'constituenttype', 'beginyear_artistAssigned', 'endyear_artistAssigned',
       'country_artistAssigned', 'zipcode_artistAssigned', 'medium',
       'dimensions', 'inscription', 'markings', 'attribution',
       'visualBrowserClassification', 'parentID', 'isVirtual', 'portfolio',
       'series', 'volume', 'watermarks', 'uuid', 'viewtype', 'sequence',
       'width', 'height', 'maxpixels', 'assistivetext', 'depictstmsobjectid',
       'objectid', 'constituentid'],
      dtype='object')

In [4]:
# selecting numerical columns from the dataset
selected_art_sample = nga_art_sample.loc[:, ['accessioned', 'birthyear', 'deathyear', 'width', 'height', 'medium', 'viewtype']]

In [6]:
# checking for null values
selected_art_sample.isna().sum()

accessioned      0
birthyear        8
deathyear      214
width            2
height           4
medium           0
viewtype         2
dtype: int64

In [7]:
selected_art_sample.reset_index(drop=False, inplace=True)

In [8]:
# imputing null values with mean and also converting the columns to int
selected_art_sample['birthyear'] = pd.to_numeric(selected_art_sample['birthyear'], errors='coerce')
selected_art_sample['birthyear'].fillna(selected_art_sample['birthyear'].mean(), inplace=True)

selected_art_sample['deathyear'] = pd.to_numeric(selected_art_sample['deathyear'], errors='coerce')
selected_art_sample['deathyear'].fillna(selected_art_sample['deathyear'].mean(), inplace=True)

selected_art_sample['width'] = pd.to_numeric(selected_art_sample['width'], errors='coerce')
selected_art_sample['width'].fillna(selected_art_sample['width'].mean(), inplace=True)

selected_art_sample['height'] = pd.to_numeric(selected_art_sample['height'], errors='coerce')
selected_art_sample['height'].fillna(selected_art_sample['height'].mean(), inplace=True)

# imputing the categorical string column 'medium' with the most frequent value
selected_art_sample['medium'].fillna(selected_art_sample['medium'].value_counts().index[0], inplace=True)
# imputing the categorical string column 'viewtype' with the most frequent value
selected_art_sample['viewtype'].fillna(selected_art_sample['viewtype'].value_counts().index[0], inplace=True)


In [15]:
selected_art_sample.loc[:, ['medium', 'viewtype']].describe()

Unnamed: 0,medium,viewtype
count,1256,1256
unique,372,3
top,gelatin silver print,primary
freq,168,1233


In [10]:
selected_art_sample.describe()

Unnamed: 0,index,birthyear,deathyear,width,height
count,1256.0,1256.0,1256.0,1256.0,1256.0
mean,50797.10828,1864.290064,1932.184438,4107.07815,4301.347444
std,64993.086106,115.772634,113.05142,2261.003196,2133.871492
min,0.0,1100.0,1199.0,578.0,976.0
25%,311.75,1855.0,1932.184438,3075.0,3228.5
50%,625.5,1902.0,1969.0,3560.5,4000.0
75%,102048.0,1922.0,1998.0,4000.0,4000.0
max,199929.0,1980.0,2022.0,35432.0,20615.0


## Scaling the selected art sample for use in Dimensionality Reduction

In [16]:
from sklearn.preprocessing import StandardScaler
# applying the standard scaler to the numerical data
scaler = StandardScaler()
selected_numerical_art_sample = selected_art_sample.loc[:, ['birthyear', 'deathyear', 'width', 'height']]
scaler.fit(selected_art_sample)
scaled_numerical_art_sample = scaler.transform(selected_numerical_art_sample)
scaled_numerical_art_sample = pd.DataFrame(scaler.transform(selected_numerical_art_sample), columns=selected_numerical_art_sample.columns)

ValueError: could not convert string to float: 'François Hutin'

# Simpler Linear Methods

## PCA (Principal Component Analysis)

In [None]:
# applying PCA to the scaled data
pca = PCA(n_components=2)
pca.fit(scaled_df_interpretable)
pca_df = pca.transform(scaled_df_interpretable)
pca_df = pd.DataFrame(pca_df, columns=['PC1', 'PC2'])
pca_df.head()

## SVD (Singular Value Decomposition)

## MDS (Multidimensional Scaling)

## ISOMap (Isometric Mapping) 

## LLE (Locally Linear Embedding)

# Complex Multi-linear Methods

## UMAP (Uniform Manifold Approximation and Projection)

## t-SNE (t-Distributed Stochastic Neighbor Embedding)