# Word embeddings

Whereas pictures and audo is inherently mathematical in nature, sentances have no mathematical analog. This requires text data to be embedded in a mathematical framework when used in machine learning. One hot encoding can work but does not retain relations between words or what the words themselves mean. NLP (natural language processing mains to provide a possible solution to this.)

In [6]:
#importing text data to embed

with open('../Data/49960') as file:
    contents = file.read()

contents = [x for x in contents.split(' ') if x != '']

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() #initialize the vectorizer

X = vectorizer.fit_transform(contents) #fit the vectorizer to the data and generate a matrix
Xc = (X.T @ X).todense() #required since the form returned by the above is a compressed sparsed matrix
Xc.shape #We see that it ranks how many times each word occured with every other unique word.

(852, 852)

In [21]:
#notice that this can be thought of as 852 instances with 852 features.
#Checking with PCA allows us to reduce the dimension to improve computation times.

import numpy as np

from sklearn.decomposition import PCA
pca_ini = PCA()
exp_var = pca_ini.fit(np.asarray(Xc)).explained_variance_ratio_

for n in range(50,800,50):
    print(f'{n} components explain {sum(exp_var[:n])*100} of the variance.')

50 components explain 90.99838080214158 of the variance.
100 components explain 93.96027055575762 of the variance.
150 components explain 95.57991442437799 of the variance.
200 components explain 96.51960897512026 of the variance.
250 components explain 97.45046956936194 of the variance.
300 components explain 97.83409350244322 of the variance.
350 components explain 98.06901714012851 of the variance.
400 components explain 98.3039407778138 of the variance.
450 components explain 98.53886441549909 of the variance.
500 components explain 98.77378805318439 of the variance.
550 components explain 99.00871169086967 of the variance.
600 components explain 99.24363532855497 of the variance.
650 components explain 99.47855896624026 of the variance.
700 components explain 99.71348260392556 of the variance.
750 components explain 99.93584081502573 of the variance.


In [26]:
#the feature names are given by

wordlabels = vectorizer.get_feature_names_out()

#using 100 compents then we can reduce the dimension of the co-oorence matrix to have 100 features.

pca = PCA(n_components= 100)
embeddings = pca.fit_transform(np.asarray(Xc))
embeddings.shape
#The words are now embedded in a mathematical form. From this we can see which words are closely associated

(852, 100)

In [36]:
# nearest neighbors
from sklearn.neighbors import KDTree
from sklearn.preprocessing import normalize

# Euclidean distance on normalized vectors is cosine distance
embeddings = normalize(embeddings)
tree = KDTree(embeddings)

evalWord = 'short'
k = 5 # five closest words

dist, ind = tree.query([embeddings[list(wordlabels).index(evalWord)]], k=k)

for i in range(k):
    print(wordlabels[ind[0][i]], ":  ", dist[0][i])

#Note the predictions are only as good as the data you feed it.
#This is what embedding is but modern methods are different.

short :   0.0
compromise :   0.19154606252369427
less :   0.9321305669073944
formalistic :   0.9321305669073944
273 :   0.944492651476401


In [37]:
Xc

matrix([[1, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 5, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 1, 0, 0],
        [0, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 0, 1]], dtype=int64)