# t-SNE Visualization

In this visualization, I try to see if the data creates natural clusterings / groupings. However, due to the high dimension of the data, I require a dimensionality reduction method to visualize the data. In this case, I use t-SNE.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

TRAINSET = '../data/raw/0173eeb640e7-Challenge+Data+Set+-+Campus+Analytics+2020.xlsx'

In [None]:
# Load in data and visually inspect/verify
df = pd.read_excel(TRAINSET)
df

In [None]:
vecs, chars, targets = [], [], []

# Extract all data
targets = df['y'].to_numpy()
chars = df['XC'].to_numpy()
feat_df = df.drop(['y', 'XC'], axis=1)
vecs = feat_df.to_numpy()

In [None]:
# Represent feature vectors in lower dimensional space
vecs_embedded = TSNE(n_components=2).fit_transform(vecs)

In [None]:
# Organize data for plotting
data = {
    0: {
        'A': [],
        'B': [],
        'C': [],
        'D': [],
        'E': [],
    },
    1: {
        'A': [],
        'B': [],
        'C': [],
        'D': [],
        'E': [],
    }
}

for vec_embed, char, target in zip(vecs_embedded, chars, targets):
    data[target][char].append(vec_embed)

# Plot by target label

In [None]:
def plot_point(vec, char, target, include_target=True, include_char=True):
    if include_target:
        if target == 0:
            color = '#3498cd'
        else:
            color = '#f89939'
    else:
        color = 'b'
        
    marker = f'${char}$' if include_char else '.'
    
    plt.scatter(vec[0], vec[1], c=color, marker=marker)

In [None]:
# Visualize by target label
plt.figure(figsize=(16,16))
for target, char_dict in data.items():
    for char, data_pts_ls in char_dict.items():
        for data_pt in data_pts_ls:
            plot_point(data_pt, char, target)
plt.show()

After observing this graph, the data appears to be quite random. To consider the groupings by target label (0 or 1), I focus on the color of the datapoint and ignore the char value.

Groupings cannot be distinguished and the spread is wide but points are denser in the center in a form of Gaussian spread. Note: I do not assume that the data takes on a Gaussian distribution. I am simply making an observation of the appearance of the visualization.

# Plot by Char Feature (Column 'XC' in the data)
Although the classification task does not use the char feature column 'XC' as a class, I attempt to see if the char feature 'XC' is strongly correlated with the numerical features. If so, then this feature will also be correlated with the final binary classification label.

In [None]:
COLORS = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
def plot_by_char(vec, char, target):
    plt.scatter(vec[0], vec[1], c=COLORS[ord(char) - ord('A')], marker=f'${char}$')

In [None]:
# Visualize by char feature
plt.figure(figsize=(16,16))
for target, char_dict in data.items():
    for char, data_pts_ls in char_dict.items():
        for data_pt in data_pts_ls:
            plot_by_char(data_pt, char, target)
plt.show()

In this demonstration, I try to see if there are perceptable grouping by the character feature (column 'XC'). However, results are similar to the earlier demonstration by target label: randomness in the form of Gaussian spread. Note: I do not assume that the data takes on a Gaussian distribution. I am simply making an observation of the appearance of the visualization.

For both demonstrations however, it should be noted that t-SNE uses a t-distribution to compare similarity of points. Therefore, the visualization would naturally appear to have denser centers.

However, due to the overlapping of all these data points, I can conclude that this visualization does not allow me to find obvious relationships between data points if one exists.