# Latent Semantic Analysis (LSA)

This notebook is for developing the LSA solution, not implementing it.

Preprocessing was performed for a TF-IDF representation. Since this representation still suffers from the curse of dimensionality, we apply LSA, in particular matrix decomposition using SVD, to reduce the Tfidf representation from a document-term matrix to a document-"topic" (aka component) matrix.

So far, reducing from 50,000 features to 300 components or topics did not help accuracy. 

---

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd

import cleanup_module as Cmod

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.decomposition import TruncatedSVD

### Sample 10% of the training data for POC

In [2]:
# load minimally prepared X, y train subsets
raw_path = os.path.join("..","data","1_raw","sentiment140")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

# sample for dev
X, X_rest, y, y_rest = train_test_split(X_train, y_train, test_size=0.9, random_state=42)

# create array
X_array = np.array(X.iloc[:, 2]).ravel()
y_array = y.iloc[:,0].ravel()

In [3]:
X_array.shape, y_array.shape

((119747,), (119747,))

In [4]:
pipe = Pipeline([('counter', Cmod.DocumentToNgramCounterTransformer()),
                 ('bow', Cmod.WordCounterToVectorTransformer(vocabulary_size=50000)),
                 ('tfidf', TfidfTransformer())])

In [5]:
X_train_transformed = pipe.fit_transform(X_array)

In [6]:
X_train_transformed

<119747x50001 sparse matrix of type '<class 'numpy.float64'>'
	with 2224008 stored elements in Compressed Sparse Row format>

### SVD

Point of departure was this [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/).

In [52]:
# randomized algo
start_time = time.time()

svd_model = TruncatedSVD(n_components=300, # topics
                         algorithm='randomized', 
                         n_iter=200, 
                         random_state=42)

[(source)](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/decomposition/_truncated_svd.py#L24)
```
    def fit_transform(self, X, y=None):
        """Fit LSI model to X and perform dimensionality reduction on X.
        
        [...]
        
        if self.algorithm == "arpack":
            U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)
            # svds doesn't abide by scipy.linalg.svd/randomized_svd
            # conventions, so reverse its outputs.
            Sigma = Sigma[::-1]
            U, VT = svd_flip(U[:, ::-1], VT[::-1])

        elif self.algorithm == "randomized":
            k = self.n_components
            n_features = X.shape[1]
            if k >= n_features:
                raise ValueError("n_components must be < n_features;"
                                 " got %d >= %d" % (k, n_features))
            U, Sigma, VT = randomized_svd(X, self.n_components,
                                          n_iter=self.n_iter,
                                          random_state=random_state)
        else:
            raise ValueError("unknown algorithm %r" % self.algorithm)

        self.components_ = VT

        # Calculate explained variance & explained variance ratio
        X_transformed = U * Sigma
        
        [...]
        
        self.singular_values_ = Sigma  # Store the singular values.

        return X_transformed

    def transform(self, X):
        """Perform dimensionality reduction on X.
        [...]
        X = check_array(X, accept_sparse=['csr', 'csc'])
        check_is_fitted(self)
        return safe_sparse_dot(X, self.components_.T)
```                                          

In [54]:
svd_model.fit(X_train_transformed)

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} min {secs:0.0f} sec')

Elapsed: 18 min 25 sec


In [55]:
svd_model.n_features_in_ 

50001

In [56]:
len(svd_model.components_) # VT (num features out...)

300

In [57]:
len(svd_model.singular_values_) 

300

In [58]:
len(svd_model.components_[0]) # first component

50001

In [62]:
from sklearn.utils.extmath import safe_sparse_dot

# transform method in sklearn's docs
svd_model_transformed = safe_sparse_dot(X_train_transformed, svd_model.components_.T)

In [63]:
svd_model_transformed.shape

(119747, 300)

In [70]:
# same as this, X_topics == svd_model_transformed
start_time = time.time()
X_topics = svd_model.fit_transform(X_train_transformed)

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} min {secs:0.0f} sec')

Elapsed: 17 min 38 sec


In [72]:
X_topics.shape

(119747, 300)

In [87]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_topics, y_array, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print('Mean accuracy: ' + str(round(score.mean(),4)))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   15.7s remaining:   23.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   17.2s finished


Mean accuracy: 0.7513


In [88]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_array, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print('Mean accuracy: ' + str(round(score.mean(),4)))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.4s remaining:    3.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.5s finished


Mean accuracy: 0.7999


Avoiding Naive Bayes on SVD since it implies strong independence between variables.

Quoting the same [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/) ~

"*Apart from LSA, there are other advanced and efficient topic modeling techniques such as Latent Dirichlet Allocation (LDA) and lda2Vec. We have a wonderful article on LDA which you can check out [here](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/). lda2vec is a much more advanced topic modeling which is based on word2vec word embeddings.*"

In [99]:
# arpack 
start_time = time.time()

svd_model = TruncatedSVD(n_components=300, # topics
                         algorithm='arpack', 
                         n_iter=200, 
                         random_state=42)

In [100]:
svd_model.fit(X_train_transformed)

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} min {secs:0.0f} sec')

Elapsed: 1 min 25 sec


In [101]:
svd_model_transformed = safe_sparse_dot(X_train_transformed, svd_model.components_.T)

In [103]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, svd_model_transformed, y_array, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print('Mean accuracy: ' + str(round(score.mean(),4)))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   14.0s remaining:   21.1s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   16.7s finished


Mean accuracy: 0.7513


A lot faster training time and same accuracy as the random algorithm.

In [104]:
# arpack 
start_time = time.time()

svd_model = TruncatedSVD(n_components=500, # topics
                         algorithm='arpack', 
                         n_iter=200, 
                         random_state=42)

In [107]:
#svd_model.fit(X_train_transformed)

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} min {secs:0.0f} sec')

Elapsed: 69 min 41 sec


In [108]:
svd_model_transformed = safe_sparse_dot(X_train_transformed, svd_model.components_.T)

In [None]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, svd_model_transformed, y_array, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print('Mean accuracy: ' + str(round(score.mean(),4)))