# Latent Semantic Analysis (LSA)

This notebook is for developing the LSA solution, not implementing it.

Preprocessing was performed for a TF-IDF representation. Since this representation still suffers from the curse of dimensionality, we apply LSA, in particular matrix decomposition using SVD, to reduce the Tfidf representation from a document-term matrix to a document-"topic" (aka component) matrix.

---

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd

import cleanup_module as Cmod

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.decomposition import TruncatedSVD

### Sample 10% of the training data for POC

In [2]:
# load minimally prepared X, y train subsets
raw_path = os.path.join("..","data","1_raw","sentiment140")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

# sample for dev
X, X_rest, y, y_rest = train_test_split(X_train, y_train, test_size=0.9, random_state=42)

# create array
X_array = np.array(X.iloc[:, 2]).ravel()
y_array = y.iloc[:,0].ravel()

In [3]:
X_array.shape, y_array.shape

((119747,), (119747,))

In [4]:
pipe = Pipeline([('counter', Cmod.DocumentToNgramCounterTransformer()),
                 ('bow', Cmod.WordCounterToVectorTransformer(vocabulary_size=50000)),
                 ('tfidf', TfidfTransformer())])

In [5]:
X_train_transformed = pipe.fit_transform(X_array)

In [6]:
X_train_transformed

<119747x50001 sparse matrix of type '<class 'numpy.float64'>'
	with 2224008 stored elements in Compressed Sparse Row format>

### SVD

Code from the [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/).

In [12]:
# SVD represent documents and terms in vectors 
start_time = time.time()

svd_model = TruncatedSVD(n_components=20, # 20 topics
                         algorithm='randomized', 
                         n_iter=100, 
                         random_state=122)

svd_model.fit(X_train_transformed)

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} min {secs:0.0f} sec')

Elapsed: 0 min 56 sec


In [27]:
svd_model.n_features_in_

50001

In [13]:
len(svd_model.components_) # n_features_out_ 

20

In [23]:
svd_model.singular_values_

array([79.97041978, 26.43187451, 22.59517475, 18.50297893, 18.3121478 ,
       17.372291  , 16.78251612, 16.09946305, 15.86258119, 15.05910139,
       14.89085909, 14.51385208, 14.23353069, 14.07930775, 13.90572399,
       13.69576191, 13.52584642, 13.48107192, 13.35254595, 13.12763369])

In [33]:
len(svd_model.components_[0]) # first component

50001

In [35]:
svd_model.components_[0][:50] # first 50 items of the first component

array([0.94860277, 0.12504544, 0.11471194, 0.07550355, 0.06407802,
       0.06050408, 0.0490459 , 0.04268702, 0.04106846, 0.03921248,
       0.03968058, 0.03613236, 0.03218715, 0.03267205, 0.02960873,
       0.03013664, 0.02674532, 0.02565623, 0.02446787, 0.02438652,
       0.02285762, 0.02493416, 0.02166465, 0.03104746, 0.02302004,
       0.02319094, 0.02228861, 0.02512052, 0.02007365, 0.0203596 ,
       0.02010738, 0.01985315, 0.02214465, 0.02016377, 0.01679119,
       0.01915747, 0.01981817, 0.01975959, 0.01759534, 0.0198492 ,
       0.01679392, 0.01695766, 0.01705169, 0.01784087, 0.01582815,
       0.01571794, 0.01614131, 0.01638356, 0.01504842, 0.0156314 ])

In [None]:
# so svd_model.components_ is a numpy array with 20 numpy arrays of len 50001
# we can transpose it to a 50001 by 20 but... we wanted a 119747 by 20 matrix

In [42]:
svd_model.components_.T.shape

(50001, 20)

In [43]:
X_train_transformed.shape

(119747, 50001)

In [19]:
#terms = pipe['tfidf'].get_feature_names() # that didn't work...

In [20]:
#for i, comp in enumerate(svd_model.components_):
#    terms_comp = zip(terms, comp)
#    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
#    print("Topic "+str(i)+": ")
#    for t in sorted_terms:
#        print(t[0])
#        print(" ")

In [21]:
#import umap
#
#X_topics = svd_model.fit_transform(X)
#embedding = umap.UMAP(n_neighbors=150, min_dist=0.5, random_state=12).fit_transform(X_topics)
#
#plt.figure(figsize=(7,5))
#plt.scatter(embedding[:, 0], embedding[:, 1], 
#c = dataset.target,
#s = 10, # size
#edgecolor='none'
#)
#plt.show()

Apart from LSA, there are other advanced and efficient topic modeling techniques such as Latent Dirichlet Allocation (LDA) and lda2Vec. We have a wonderful article on LDA which you can check out [here](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/). lda2vec is a much more advanced topic modeling which is based on word2vec word embeddings

In [36]:
X_train_transformed.shape

(119747, 50001)

In [90]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_array, cv=10, verbose=1, scoring='accuracy')
print('Mean accuracy: ' + str(round(score.mean(),4)))

Mean accuracy: 0.6984


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished


In [91]:
from sklearn.naive_bayes import MultinomialNB

NB_clf = MultinomialNB()
score = cross_val_score(NB_clf, X_train_transformed, y_array, cv=10, verbose=1, scoring='accuracy')
print('Mean accuracy: ' + str(round(score.mean(),4)))

Mean accuracy: 0.7068


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished


---