### Visualization Notebook

Summary: \
Quick experiments to view word-vector distribution in 3D space. Could be worth investigating with the ada / claude generated vectors as well. General results for preliminary vectorizers / cleaning strategies showed no real patterns in the reduced dimensionality.

PCA Vis for Word Vectors

In [1]:
from Vectorizers import PreW2V, AdaVectorizer
from DataLoaders import AbstractDataLoader, OriginalDataLoader, ProcessedData, NoSWLoader, MatchLoader
from typing import List
import numpy as np
import pandas as pd
from type_utils import MatchedData, Label, CleanedAndLabeledData, UnprocessedData

In [2]:
# dataloader: AbstractDataLoader = NoSWLoader(data_path='../william_data/test_xml/')
dataloader = MatchLoader(data_path='../william_data/test_xml/')
data_o: MatchedData = dataloader.load_and_preprocess_data()

def quick_clean(x: str) -> str:
    import re
    # remove \n
    x = x.replace('\n', ' ')
    # follow this form: 'oc- cupent' -> 'occupent'
    x = re.sub(r'(?<=[a-zA-Z])-\s', '', x)

    return x

data: UnprocessedData = {
    'good': [quick_clean(e['snippet']) for e in data_o['good']],
    'bad': [quick_clean(e['snippet']) for e in data_o['bad']]
}

vectorizer = AdaVectorizer() #PreW2V('fr_w2v_web_w5')
X,y = vectorizer.vectorize(data)


# good 328
# bad 2290
loading from memory...


In [3]:
import matplotlib.pyplot as plt
import pandas as pd

# use PCA to create a 3d representation of our data points in X using plotly. color code them by their label. 
# on hover, show the index of the data point and the corresponding sentence
from sklearn.decomposition import PCA
from plotly import graph_objects as go

pca = PCA(n_components=3)
pca_result = pca.fit_transform(X)

# create a dataframe with the 3 pca dimensions and the label
pca_df = pd.DataFrame(pca_result, columns=['pca1', 'pca2', 'pca3'])
pca_df['label'] = y

# create a figure
fig = go.Figure()

# add a scatter plot with the pca dimensions as x, y, z and the label as the color
fig.add_trace(go.Scatter3d(
    x=pca_df['pca1'],
    y=pca_df['pca2'],
    z=pca_df['pca3'],
    mode='markers',
    marker=dict(
        size=3,
        color=pca_df['label'],
        colorscale='Viridis',
        opacity=0.8
    )
))

# show the figure
fig.show()

In [7]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

model = vectorizer.model

def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # extract the words & their vectors, as numpy arrays
    vectors = np.asarray(model.vectors)
    labels = np.asarray(model.index_to_key)  # fixed-width numpy strings

    # reduce using t-SNE
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

plot_function = plot_with_matplotlib


plot_function(x_vals, y_vals, labels)


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



KeyboardInterrupt: 

Insight into good / bad classes

In [6]:
def data_to_sentence(data: List[str]):
    return ' '.join(data)

# print a random good and random bad sentence
print(data_to_sentence(data['good'][np.random.randint(0, len(data['good']))]))
print(data_to_sentence(data['bad'][np.random.randint(0, len(data['bad']))]))

tionale on peut visiter les appartements du château en l absence de l empereur le reste du pa jais renferme de grandes administrations publiques el une partie du musée de peinture qui occupe le premier étage des galeries dont la façade mé ridiouale regarde la seine palais du ce magnifique palais occupe l emplacement où s élevait anciennement un château fort antique demeure des rois de france philippe auguste y avait fait construire en une grosse our connue dans l histoire sous le nom de tour du qui terminait l enceinte de paris sur cette rive de la seine faisant face à la tour de élevée sur la rive opposée et qui seryait alors selon t usage tout à la fois de palais de forteresse et de prison prison terrible et redoutée des grands feudataires de la couronne en celle vieille construction tombant en ruines françois ier ja fit raser et pierre lescot commença la construction du palais actuel c est àlui que l on doit cette partie du actuel qui du pavillon de l horloge situé en face du pavill