# LSA - Code

## Building the Term / Document matrix
LSA is a natural language processing method, it helps analysing a corpus of texts. A corpus of text is simply a collection of texts, to understand what follows we introduce a few terms :
* *Corpus :* the collection of texts to be analysed
* *Document :* how we will call a text in the corpus
* *Term :* how we will call individual words in the corpus

The object we need to build in order to run the LSA model is a term-document matrix, it is a matrix which rows represent all the words present in the corpus, the columns represent each document in the corpus, and the values contained in the matrix indicate a certain metric that quantifies the link between a given term with a given document. For example this metric may be the number of times that the term appears in a document. More commonly the metric that is used to build the term/document matrix is the term frequency inverse document frequency introduced in the first lecture (TfIdf) because of its ability to more accurately represent the relevance of each term in each document.

Let's create a toy example of corpus and walk you through the steps of building the term document matrix :

In [5]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import spacy

from sklearn.decomposition import TruncatedSVD

# List of documents
a1 = "He is a good dog."
a2 = "The dog is too lazy."
a3 = "That is a brown cat."
a4 = "Cats are very active."
a5 = "I have a brown cat and dog."

df = pd.DataFrame()
df["documents"] = [a1,a2,a3,a4,a5]
df

Unnamed: 0,documents
0,He is a good dog.
1,The dog is too lazy.
2,That is a brown cat.
3,Cats are very active.
4,I have a brown cat and dog.


We are now going to run simple preprocessing on this simple corpus to make it suitable for our model :
* remove all special characters
* convert all characters to lowercase

In [6]:
# Importing english
import en_core_web_sm
nlp = en_core_web_sm.load()

In [7]:
# Preprocessing
from spacy.lang.en.stop_words import STOP_WORDS
df['clean_documents'] = df['documents'].str.replace(r"[^A-Za-z0-9 ]+", " ", regex=True)
df['clean_documents'] = df['clean_documents'].fillna('').apply(lambda x: x.lower())
tokenized_doc = df['clean_documents'].fillna('').apply(lambda x: nlp(x))
tokenized_doc = tokenized_doc.apply(lambda doc: [token.lemma_ for token in doc if token.text not in STOP_WORDS])

Now that the text has been preprocessed, we will proceed to tokenization, which is the process of isolating each word in each document as a token, this will be a good opportunity to remove stop words tokens.

In [8]:


# tokenization
tokenized_doc = df['clean_documents'].fillna('').apply(lambda x: nlp(x))

In [9]:
# remove stop-words
tokenized_doc = df['clean_documents'].fillna('').apply(lambda x: nlp(x))
tokenized_doc = tokenized_doc.apply(lambda doc: [token.lemma_ for token in doc if token.text not in STOP_WORDS])
tokenized_doc

0          [good, dog]
1          [dog, lazy]
2         [brown, cat]
3        [cat, active]
4    [brown, cat, dog]
Name: clean_documents, dtype: object

Now that all our documents have been tokenized and cleaned, it is time to create ou term-document matrix giving the tf-idf for each term in each document. But first we have to re inject the cleaned documents in the DataFrame.

In [10]:
# Preprocessing
df['clean_documents'] = df['documents'].str.replace(r"[^A-Za-z0-9 ]+", " ", regex=True)
df['clean_documents'] = df['clean_documents'].fillna('').apply(lambda x: x.lower())
tokenized_doc = df['clean_documents'].fillna('').apply(lambda x: nlp(x))
tokenized_doc = tokenized_doc.apply(lambda doc: [token.lemma_ for token in doc if token.text not in STOP_WORDS])
df["clean_documents"] = tokenized_doc.apply(lambda x: ' '.join(x))


In [11]:
# TF-IDF vector
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_documents'])
X

<5x6 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [12]:
dense = X.toarray()
dense

array([[0.        , 0.        , 0.        , 0.55645052, 0.83088075,
        0.        ],
       [0.        , 0.        , 0.        , 0.55645052, 0.        ,
        0.83088075],
       [0.        , 0.76944707, 0.63871058, 0.        , 0.        ,
        0.        ],
       [0.83088075, 0.        , 0.55645052, 0.        , 0.        ,
        0.        ],
       [0.        , 0.64846263, 0.53828256, 0.53828256, 0.        ,
        0.        ]])

In [13]:
vectorizer.vocabulary_

{'good': 4, 'dog': 3, 'lazy': 5, 'brown': 1, 'cat': 2, 'active': 0}

Note that the obtained object is a sparse matrix. This format is often used when dealing with matrices with a very high number of elements equal to zero to save memory space and accelerate computation. The only drawback is that sparse object cannot be visualized.

## Apply truncatedSVD

Remember from theory that truncated SVD means that we can take the $k$ highest singular values to approximates a matrix $A$.

Now if we only wish to select the highest $k$ out of the $l$ singular values $\sigma_1, ..., \sigma_l$ as well as the corresponding vectors from $U$ and $V$ we would obtain the best rank $k$ approximation of X, and we could map the word vector $t_i^T$ to its $k$ rank approxiamtion $\hat{t_i^T}_{(k)}$, and $\hat{d_j}$ to its $k$ rank approximation $\hat{d_j}_{(k)}$

You can now do the following:

* See how related documents $j$ and $q$ are in the low-dimensional space by comparing the vectors $\Sigma_{k} \hat{d_{j}}$ and $\Sigma_{k}\hat{d_q}$ (typically by [cosine similarity](https://en.wikipedia.org/wiki/Vector_space_model)).
* Comparing terms $i$ and $p$ by comparing the vectors $\Sigma_{k}\hat{t_{i}}$ and $\Sigma_{k}\hat{t_{p}}$. Note that $\hat{t}$ is now a column vector.
* Documents and term vector representations can be clustered using traditional clustering algorithms like k-means using similarity measures like cosine.

The $k$ dimensions retained for the low dimensional space are often referred to as topics even though their interpretation may be difficult.

Let's see how we can use the SVD in python in order to understand our corpus:



In [14]:
svd_model = TruncatedSVD(n_components=2)
lsa = svd_model.fit_transform(X)

In [15]:
df

Unnamed: 0,documents,clean_documents
0,He is a good dog.,good dog
1,The dog is too lazy.,dog lazy
2,That is a brown cat.,brown cat
3,Cats are very active.,cat active
4,I have a brown cat and dog.,brown cat dog


In [16]:
# SVD represent documents and terms in vectors
svd_model = TruncatedSVD(n_components=2)
lsa = svd_model.fit_transform(X)

topic_encoded_df = pd.DataFrame(lsa)
topic_encoded_df["documents"] = df['clean_documents']
topic_encoded_df.loc[0]

0            0.341383
1            0.719978
documents    good dog
Name: 0, dtype: object

In [17]:
topic_encoded_df.iloc[0,:-1].sum()

1.0613615258740996

In [18]:
from sklearn.decomposition import PCA

pca = PCA()
pd.DataFrame(pca.fit_transform(X.toarray()))

Unnamed: 0,0,1,2,3,4
0,0.634346,-0.006953,-0.5875214,-0.03495,4.6608720000000003e-17
1,0.634346,-0.006953,0.5875214,-0.03495,4.6608720000000003e-17
2,-0.596842,-0.336615,-3.345062e-16,-0.209075,4.6608720000000003e-17
3,-0.4001,0.754422,1.783933e-16,0.036099,4.6608720000000003e-17
4,-0.271749,-0.403902,-3.428177e-16,0.242875,4.6608720000000003e-17


In [19]:
topic_encoded_df

Unnamed: 0,0,1,documents
0,0.341383,0.719978,good dog
1,0.341383,0.719978,dog lazy
2,0.860949,-0.365984,brown cat
3,0.516666,-0.385005,cat active
4,0.949412,0.02363,brown cat dog


In [20]:
fig = px.scatter(topic_encoded_df, x=0 ,y=1, color = "documents")
fig.show()

In [21]:
svd_model.components_

array([[ 0.20035413,  0.59651171,  0.6293381 ,  0.4158308 ,  0.1323826 ,
         0.1323826 ],
       [-0.24244085, -0.2018099 , -0.32988591,  0.61690333,  0.45337665,
         0.45337665]])

In [22]:
vectorizer.get_feature_names_out()

array(['active', 'brown', 'cat', 'dog', 'good', 'lazy'], dtype=object)

In [23]:
topic = pd.DataFrame(svd_model.components_, index = ["topic_1","topic_2"])
topic.columns= vectorizer.get_feature_names_out()
topic_t = topic.transpose()
topic_t

Unnamed: 0,topic_1,topic_2
active,0.200354,-0.242441
brown,0.596512,-0.20181
cat,0.629338,-0.329886
dog,0.415831,0.616903
good,0.132383,0.453377
lazy,0.132383,0.453377


In [24]:
topic_t['topic_1'].sort_values(ascending=False)

cat       0.629338
brown     0.596512
dog       0.415831
active    0.200354
lazy      0.132383
good      0.132383
Name: topic_1, dtype: float64

In [25]:
topic_t['topic_2'].sort_values(ascending=False)

dog       0.616903
good      0.453377
lazy      0.453377
brown    -0.201810
active   -0.242441
cat      -0.329886
Name: topic_2, dtype: float64

In [26]:
topic = pd.DataFrame(svd_model.components_, index = ["topic_1","topic_2"])
topic.columns= vectorizer.get_feature_names_out()
topic_t = topic.transpose()

fig = px.scatter(topic_t, x = "topic_1", y = "topic_2", color = topic_t.index)
fig.show()

In [27]:
svd_model

In [28]:
svd_model.explained_variance_ratio_

array([0.10893162, 0.40096526])

For example here, we used the SVD in order to approximate the term-document matrix with a decomposition of rank 2, and we can see how each document (sentence) is projected across these two topics.

Here we notice that topic 1 tends to be more linked to cat and topic 2 is more linked with dog.

In [30]:
from sklearn.decomposition import LatentDirichletAllocation

In [31]:
df = pd.read_csv('../../12_assets/06_unsupervised_ML/twitter_training.csv')
df.columns = ['drop', 'brand', 'sentiment', 'documents']
X = df['documents'].sample(1000)

In [32]:
X = X.str.replace(r"[^A-Za-z0-9 ]+", " ", regex=True)
X= X.fillna('').apply(lambda x: x.lower())
tokenized_doc = X.apply(lambda x: nlp(x))
tokenized_doc = tokenized_doc.apply(lambda doc: [token.lemma_ for token in doc if token.text not in STOP_WORDS])
X = tokenized_doc.apply(lambda x: ' '.join(x))

In [33]:
cv = CountVectorizer()
vals = cv.fit_transform(X.values)

In [34]:
analyse = pd.DataFrame(vals.toarray(), columns=cv.get_feature_names_out())


In [35]:
analyse.sum().sort_values(ascending=False)[analyse.sum().sort_values(ascending=False) >= 4]

com        158
game       126
play        83
good        83
like        71
          ... 
term         4
yo           4
lady         4
4k           4
despite      4
Length: 565, dtype: int64

In [36]:
new_stop = analyse.sum().sort_values(ascending=False)[analyse.sum().sort_values(ascending=False) < 4].index

In [37]:
new_stop

Index(['hair', 'ray', 'swedish', 'clean', 'system', 'studio', 'half', 'high',
       'child', 'completely',
       ...
       'glove', 'gian', 'giant', 'girlfriend', 'github', 'gjbvddptvm', 'glare',
       'glitch', 'globe', 'zzz'],
      dtype='object', length=3094)

In [38]:
stop = list(STOP_WORDS)
stop.extend(new_stop)

In [39]:
vals

<1000x3659 sparse matrix of type '<class 'numpy.int64'>'
	with 9805 stored elements in Compressed Sparse Row format>

In [40]:
tf = TfidfVectorizer(stop_words=stop)
tf_vec = tf.fit_transform(X)


Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ll', 've'] not in stop_words.



In [41]:
pd.DataFrame(tf_vec.toarray(), columns =tf.get_feature_names_out()).T.sum()[pd.DataFrame(tf_vec.toarray(), columns =tf.get_feature_names_out()).T.sum() == 0].shape

(58,)

In [42]:
svd_model = TruncatedSVD(n_components=2)
lsa = svd_model.fit_transform(tf_vec)


In [43]:
topics = pd.DataFrame(svd_model.components_)
topics.columns = tf.get_feature_names_out()

In [44]:
topics.T[0].sort_values(ascending=False)

com        0.416758
game       0.300020
twitter    0.284868
pic        0.273066
good       0.249319
             ...   
assist     0.002516
sync       0.002288
poverty    0.002033
despite    0.002033
messi      0.001200
Name: 0, Length: 548, dtype: float64

In [45]:
topics.T[1].sort_values(ascending=False)

com          0.442755
twitter      0.360501
pic          0.325792
rhandlerr    0.143724
facebook     0.107642
               ...   
bad         -0.112217
like        -0.170053
good        -0.192471
play        -0.206219
game        -0.446182
Name: 1, Length: 548, dtype: float64

In [46]:
vectorizer.get_feature_names_out()

array(['active', 'brown', 'cat', 'dog', 'good', 'lazy'], dtype=object)

In [47]:
topics = pd.DataFrame(svd_model.components_, index=[f'topic{i}' for i in range(len(lsa[0]))]).T

In [48]:
lda = LatentDirichletAllocation(n_components=2)
ldaframe = pd.DataFrame(lda.fit_transform(tf_vec))

In [49]:
components = pd.DataFrame(lda.components_, columns=tf.get_feature_names_out()).T

In [50]:
components[0].sort_values(ascending=False)

good           22.141495
game           18.226379
play           18.203817
look           14.916508
bad            13.555316
                 ...    
talc            0.510514
preparatory     0.510041
line            0.507915
poverty         0.507836
despite         0.507836
Name: 0, Length: 548, dtype: float64

In [51]:
components[1].sort_values(ascending=False)

com         31.838719
twitter     18.074115
love        17.727326
pic         17.114949
shit        16.445703
              ...    
original     0.514226
usually      0.513804
beta         0.513490
assist       0.513164
gen          0.511859
Name: 1, Length: 548, dtype: float64

In [52]:
topics.T[0].sort_values(ascending=False).head(100)

topic1    0.015600
topic0    0.015013
Name: 0, dtype: float64

In [53]:
pd.set_option('display.max_rows', 500)


In [54]:
topics.T[1].sort_values(ascending=False).head(100)

topic0    0.04494
topic1   -0.01942
Name: 1, dtype: float64

## Resources 📚📚

*   <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html?highlight=truncated%20svd#sklearn.decomposition.TruncatedSVD" target="_blank">Truncated SVD</a>

*   <a href="https://scikit-learn.org/stable/modules/decomposition.html#lsa" target="_blank">Truncated singular value decomposition and latent semantic analysis</a>