# Dimensionality Reduction Techniques

Lets use some Dimensionality Reduction Techniques to visualize our data, lets see if we can find clear groupings of games.

## Dimensionality Reduction
A widely encountered problem in machine learning is that of dimensionality. We typically refer to the size or shape of a dataset as an $n$ x $p$ matrix, where each row from 1 to $n$ represents an observation, or data point, and each column from 1 to $p$ represents a variable, or feature. With each additional feature, the dimensionality of a dataset increases by 1.

The problems with increasing or high levels of dimensionality are as follows:

- More storage space required for the data;
- More computation time required to work with the data; and
- More features mean more chance of feature correlation, and hence feature redundancy.

## Outline

In this notebook we will:
* Explain the following advanced dimensionality reduction techniques:
    * Principal Component Analysis
    * Multi-dimensional Scaling
    * t-SNE
* Implement these techniques on our dataset;
* Implement these techniques on a text dataset.




In [1]:
# dependencies

# data processing
import pandas as pd
print('Pandas - ' + str(pd.__version__))

from time import time

# feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Dimensionality Reduction
from sklearn.manifold import TSNE
from sklearn.manifold import MDS

Pandas - 1.3.3


In [2]:
df = pd.read_csv('steam_game_data.csv')

In [3]:
df = df[df['number_of_reviews'] > 1000]
df.shape
df.reset_index(drop=True, inplace=True)

In [4]:
tag_based = df[['title','tags']]

In [5]:
# lets remove the [] 
def update_tags(tags):
    return tags.replace('[','').replace(']','').replace(',','').replace("'",'')

In [6]:
tag_based['tags'] = tag_based['tags'].apply(update_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_based['tags'] = tag_based['tags'].apply(update_tags)


In [7]:
tag_based

Unnamed: 0,title,tags
0,Counter-Strike: Global Offensive,FPS Shooter Multiplayer Competitive Action Tea...
1,Apex Legends™,Free to Play Multiplayer Battle Royale Shooter...
2,Stray,Cats Adventure Cyberpunk Cute Atmospheric Expl...
3,Grand Theft Auto V,Open World Action Multiplayer Automobile Sim C...
4,MultiVersus,Multiplayer Co-op 2D Fighter Action Competitiv...
...,...,...
4624,Godus,God Game Strategy Simulation Indie Kickstarter...
4625,Fallout 4 Season Pass,RPG Open World Post-apocalyptic Singleplayer A...
4626,GASP,Free to Play Space Simulation Action Survival ...
4627,Godus,God Game Strategy Simulation Indie Kickstarter...


In [8]:
tag_based['tags'][0]

'FPS Shooter Multiplayer Competitive Action Team-Based eSports Tactical First-Person PvP Online Co-Op Co-op Strategy Military War Difficult Trading Realistic Fast-Paced Moddable'

In [9]:
tag_based['tags'][1]

'Free to Play Multiplayer Battle Royale Shooter First-Person FPS PvP Action Hero Shooter Team-Based Tactical Survival Character Customization Sci-fi Funny Loot Lore-Rich Cyberpunk Co-op Cinematic'

In [13]:
df['is_duplicated'] = tag_based.duplicated('tags')
df['is_duplicated'].sum()

133

In [8]:
# create an instance of the vertorizer
vectorizer = TfidfVectorizer(max_features=200, ngram_range=(1,3))
#
matrix = vectorizer.fit_transform(tag_based.tags)

In [9]:
# matrix = pd.DataFrame(matrix.todense())
matrix = pd.DataFrame(matrix.A, columns=vectorizer.get_feature_names())

In [10]:
matrix

Unnamed: 0,2d,3d,access,action,action adventure,action indie,action rpg,adventure,adventure action,and,...,violent,visual,visual novel,vr,walking,walking simulator,war,world,world war,zombies
0,0.000000,0.0,0.000000,0.084025,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.193904,0.000000,0.0,0.0
1,0.000000,0.0,0.000000,0.093624,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
2,0.000000,0.0,0.000000,0.124429,0.0,0.0,0.0,0.130404,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.183694,0.0,0.0
3,0.000000,0.0,0.000000,0.080707,0.0,0.0,0.0,0.084583,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.119148,0.0,0.0
4,0.430115,0.0,0.000000,0.091707,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4624,0.000000,0.0,0.210293,0.000000,0.0,0.0,0.0,0.111129,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.156542,0.0,0.0
4625,0.000000,0.0,0.000000,0.078438,0.0,0.0,0.0,0.082205,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.115798,0.0,0.0
4626,0.000000,0.0,0.196751,0.099208,0.0,0.0,0.0,0.103973,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.146461,0.0,0.0
4627,0.000000,0.0,0.210293,0.000000,0.0,0.0,0.0,0.111129,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.156542,0.0,0.0


In [20]:
print("Computing MDS embedding")
clf = MDS(n_components=2, 
                   n_init=4, 
                   max_iter=200,
                   n_jobs=-1,
                   random_state=42,
                   dissimilarity='euclidean')
t0 = time()
X_mds = clf.fit_transform(matrix)
t1 = time()
print("Done. Stress: %f" % clf.stress_)

Computing MDS embedding
Done. Stress: 2950587.447729


In [21]:
mds_df = pd.DataFrame(X_tsne, columns=['D1', 'D2'])
mds_df['text'] = tag_based['title']

In [22]:
mds_df

Unnamed: 0,D1,D2,text
0,42.174809,-1.100116,Counter-Strike: Global Offensive
1,43.278641,4.356322,Apex Legends™
2,22.847174,-35.519600,Stray
3,58.187057,-21.116899,Grand Theft Auto V
4,-7.513382,-12.023914,MultiVersus
...,...,...,...
4624,-7.408246,57.540565,Godus
4625,52.659164,-23.192423,Fallout 4 Season Pass
4626,-29.525322,-1.855076,GASP
4627,-7.408242,57.540543,Godus


In [23]:
data = [
    go.Scatter(
        x = tsne_df.iloc[:,0].values,
        y = tsne_df.iloc[:,1].values,
        text = tsne_df.iloc[:,2].values,
        hoverinfo = 'text',
        marker = dict(
            color = 'lightblue'
        ),
        mode='markers',
        showlegend = False
    )
]

iplot(data, filename = "add-hover-text")

In [11]:
print("Computing t-SNE embedding")
tsne = TSNE(n_components=2,
                     perplexity=40,
                     metric='euclidean',
                     init='pca',
                     verbose=1,
                     random_state=42)
t0 = time()
X_tsne = tsne.fit_transform(matrix)
t1 = time()

Computing t-SNE embedding
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 4629 samples in 0.001s...
[t-SNE] Computed neighbors for 4629 samples in 0.487s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4629
[t-SNE] Computed conditional probabilities for sample 2000 / 4629
[t-SNE] Computed conditional probabilities for sample 3000 / 4629
[t-SNE] Computed conditional probabilities for sample 4000 / 4629
[t-SNE] Computed conditional probabilities for sample 4629 / 4629
[t-SNE] Mean sigma: 0.333794
[t-SNE] KL divergence after 250 iterations with early exaggeration: 82.405785
[t-SNE] KL divergence after 1000 iterations: 1.502036


In [12]:
# tsne?

In [13]:
tsne_df = pd.DataFrame(X_tsne, columns=['D1', 'D2'])
tsne_df['text'] = tag_based['title']

In [14]:
import cufflinks as cf
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [15]:
tsne_df.head(10)

Unnamed: 0,D1,D2,text
0,42.174809,-1.100116,Counter-Strike: Global Offensive
1,43.278641,4.356322,Apex Legends™
2,22.847174,-35.5196,Stray
3,58.187057,-21.116899,Grand Theft Auto V
4,-7.513382,-12.023914,MultiVersus
5,58.984718,-22.918989,Red Dead Redemption 2
6,28.621799,23.484627,Rust
7,54.889462,-10.653167,PUBG: BATTLEGROUNDS
8,20.436592,7.625875,Destiny 2
9,39.622787,11.33367,MONSTER HUNTER RISE


In [16]:
data = [
    go.Scatter(
        x = tsne_df.iloc[:,0].values,
        y = tsne_df.iloc[:,1].values,
        text = tsne_df.iloc[:,2].values,
        hoverinfo = 'text',
        marker = dict(
            color = 'lightblue'
        ),
        mode='markers',
        showlegend = False
    )
]

iplot(data, filename = "add-hover-text")