# Latent Semantic Analysis (LSA)

---

*Features*

- Use SVD for dimensionality reduction. 

- Point of departure: [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/). 

- Consulted Prof. Steve Brunton's [YouTube lecture series](https://www.youtube.com/playlist?list=PLMrJAkhIeNNSVjnsviglFoY2nXildDCcv) and [Data-Driven Science and Engineering book](https://www.amazon.com/Data-Driven-Science-Engineering-Learning-Dynamical/dp/1108422098) - see notes from first few lectures [here](Extra_SteveBrunton_SVD_lecture.pdf).

*Results*

- SVD is just predicting the base rate (0.8674, or 1 - spam prop): something is wrong...


### Setup

In [1]:
import os
import time
import json

import numpy as np
import pandas as pd

from datetime import datetime

start_time = time.time()
dt_object = datetime.fromtimestamp(time.time())
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2020-12-20


### Load Data

In [18]:
def load_data(data):
    raw_path = os.path.join("..","data","1_raw")
    filename = ''.join([data, ".csv"])
    out_dfm = pd.read_csv(os.path.join(raw_path, filename))
    out_arr = np.array(out_dfm.iloc[:,0].ravel())
    return out_arr

X_train = load_data("X_train")
y_train = load_data("y_train")

# transform y_array into int type
y_train[y_train=='ham'] = 0
y_train[y_train=='spam'] = 1
y_train = y_train.astype('int')

### BoW and Tfidf

Here I clean and preprocess the data in two formats, a Bag-of-(upto)-Bigrams with 2,000 terms, and a Tfidf representation of the same.

In [3]:
import urlextract
from nltk.stem import WordNetLemmatizer

with open("contractions_map.json") as f:
    contractions_map = json.load(f)

url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()

import custom.clean_preprocess as cp
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

pipe = Pipeline([('counter', cp.DocumentToNgramCounterTransformer(n_grams=3)),
                 ('bow', cp.WordCounterToVectorTransformer(vocabulary_size=2000)),
                 ('tfidf', TfidfTransformer(sublinear_tf=True))                  
                ])

In [4]:
# BoW
X_trans_counter = pipe['counter'].fit_transform(X_train)
X_trans_bow = pipe['bow'].fit_transform(X_trans_counter) 

# Tfidf
X_trans_tfidf = pipe.fit_transform(X_train)

# convert to fptype for SVD
X_trans_bow = X_trans_bow.asfptype()

### SVD

Borroming from sklearn's **TruncatedSVD** class, "arpack" algorithm (the "randomized" algorithm takes longer and arrives at the same result), here are the relevant code bits:

[(source)](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/decomposition/_truncated_svd.py#L24)
```
149    def fit_transform(self, X, y=None):
[...]
168        if self.algorithm == "arpack":
169             U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)
170             # svds doesn't abide by scipy.linalg.svd/randomized_svd
171             # conventions, so reverse its outputs.
172            Sigma = Sigma[::-1]
173            U, VT = svd_flip(U[:, ::-1], VT[::-1])
```                                          

In [6]:
from scipy.sparse.linalg import svds
from sklearn.utils.extmath import svd_flip
from sklearn.preprocessing import MinMaxScaler

U, Sigma, VT = svds(X_trans_tfidf.T, # transposed to a term-document matrix
                    k=300) # k = number of components / "topics"
# reverse outputs
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])

# scale
scaler = MinMaxScaler()
X_train_svd_scaled = scaler.fit_transform(VT.T)

In [7]:
U.shape, Sigma.shape, VT.shape

((2001, 300), (300,), (300, 3900))

In [8]:
V = VT.T
V.shape, y_train.shape

((3900, 300), (3900,))

- U contains the eigenvectors of the term correlations: $XX^T$
- V contains the eigenvectors of the document correlations: $X^TX$
- $\Sigma$ contains the singular values of the factorization

### SVD Tfidf

In [82]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)

acc = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
tpr = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring='recall', n_jobs=-1)

print(f'accuracy: {acc.mean():0.4f} (+/- {np.std(acc):0.4f})')
print(f'sensitivity: {tpr.mean():0.4f} (+/- {np.std(tpr):0.4f})')

accuracy: 0.9662 (+/- 0.0073)
sensitivity: 0.7602 (+/- 0.0560)


In [78]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def print_scores(X, y, random_state):
    log_clf = LogisticRegression(solver="liblinear", 
                                 random_state=random_state)
    
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33, 
                                          random_state=random_state)
    log_clf.fit(Xtr, ytr)
    ypred = log_clf.predict(Xte)

    [[TP, FP], 
     [FN, TN]] = confusion_matrix(yte, ypred)
    cond_pos = TP + FN
    cond_neg = FP + TN
    
    accuracy = (TP + TN) / (cond_pos + cond_neg)
    sensitivity = TP / cond_pos
    specificity = TN / cond_neg
    
    print(f'accuracy: {accuracy:0.4f}')
    print(f'sensitivity: {sensitivity:0.4f}')
    print(f'specificity: {specificity:0.4f}')

In [79]:
print_scores(X_train_svd_scaled, y_train, random_state=13)

accuracy: 0.9479
sensitivity: 0.9461
specificity: 0.9661


In [80]:
print_scores(X_train_svd_scaled, y_train, random_state=42)

accuracy: 0.9510
sensitivity: 0.9475
specificity: 0.9839


In [81]:
print_scores(X_train_svd_scaled, y_train, random_state=25)

accuracy: 0.9503
sensitivity: 0.9463
specificity: 0.9912


### Original Tfidf

In [9]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_trans_tfidf, y_train, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.0s remaining:    0.0s


accuracy: 0.9777 (+/- 0.0037)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.1s finished


### SVD bow

In [10]:
U, Sigma, VT = svds(X_trans_bow.T, # transposed to a term-document matrix
                    k=300) # k = number of components / "topics"
# reverse outputs
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])

# scale
scaler = MinMaxScaler()
X_train_svd_scaled = scaler.fit_transform(VT.T)

In [11]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.3s remaining:    0.5s


accuracy: 0.9623 (+/- 0.0061)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.4s finished


### Original bow

In [12]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_trans_bow, y_train, cv=5, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

accuracy: 0.9849 (+/- 0.0042)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished


---

**Final Notes in the [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/)**

Avoid Naive Bayes on SVD since it implies strong independence between variables.

"*Apart from LSA, there are other advanced and efficient topic modeling techniques such as Latent Dirichlet Allocation (LDA) and lda2Vec. We have a wonderful article on LDA which you can check out [here](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/). lda2vec is a much more advanced topic modeling which is based on word2vec word embeddings.*"

In [14]:
mins, secs = divmod(time.time() - start_time, 60)
print(f'Time elapsed: {mins:0.0f} m {secs:0.0f} s')

Time elapsed: 0 m 28 s
