# Singular Vector Decomposition

---

*Features*

- Use SVD for dimensionality reduction. 

- Point of departure: [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/). 

- Consulted Prof. Steve Brunton's [YouTube lecture series](https://www.youtube.com/playlist?list=PLMrJAkhIeNNSVjnsviglFoY2nXildDCcv) and [Data-Driven Science and Engineering book](https://www.amazon.com/Data-Driven-Science-Engineering-Learning-Dynamical/dp/1108422098) - see notes from first few lectures [here](Extra_SteveBrunton_SVD_lecture.pdf).

*Results*

TODO



### Setup

In [1]:
import os
import time
import json

import numpy as np
import pandas as pd

from datetime import datetime

start_time = time.time()
dt_object = datetime.fromtimestamp(time.time())
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2020-12-21


### Load Data

In [2]:
def load_data(data):
    raw_path = os.path.join("..","data","1_raw")
    filename = ''.join([data, ".csv"])
    out_dfm = pd.read_csv(os.path.join(raw_path, filename))
    out_arr = np.array(out_dfm.iloc[:,0].ravel())
    return out_arr

X_train = load_data("X_train")
y_train = load_data("y_train")

# transform y_array into int type
y_train[y_train=='ham'] = 0
y_train[y_train=='spam'] = 1
y_train = y_train.astype('int')

### BoW and Tfidf

Here I clean and preprocess the data in two formats, a Bag-of-upto-Trigrams with 2,000 terms, and a Tfidf representation of the same.

In [3]:
import urlextract
from nltk.stem import WordNetLemmatizer

with open("contractions_map.json") as f:
    contractions_map = json.load(f)

url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()

import custom.clean_preprocess as cp
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

pipe = Pipeline([('counter', cp.DocumentToNgramCounterTransformer(n_grams=3)),
                 ('bow', cp.WordCounterToVectorTransformer(vocabulary_size=2000)),
                 ('tfidf', TfidfTransformer(sublinear_tf=True))                  
                ])

In [4]:
# BoW
X_trans_counter = pipe['counter'].fit_transform(X_train)
X_trans_bot = pipe['bow'].fit_transform(X_trans_counter) 
X_trans_bot = X_trans_bot.asfptype() # for SVD

# Tfidf
X_trans_tfidf = pipe.fit_transform(X_train)

### SVD

Borroming from sklearn's **TruncatedSVD** class, "arpack" algorithm (the "randomized" algorithm takes longer and arrives at the same result), here are the relevant code bits:

[(source)](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/decomposition/_truncated_svd.py#L24)
```
149    def fit_transform(self, X, y=None):
[...]
168        if self.algorithm == "arpack":
169             U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)
170             # svds doesn't abide by scipy.linalg.svd/randomized_svd
171             # conventions, so reverse its outputs.
172            Sigma = Sigma[::-1]
173            U, VT = svd_flip(U[:, ::-1], VT[::-1])
```                  


- U contains the eigenvectors of the term correlations: $XX^T$
- V contains the eigenvectors of the document correlations: $X^TX$
- $\Sigma$ contains the singular values of the factorization

### SVD on Bag-of-Trigrams

In [5]:
from scipy.sparse.linalg import svds
from sklearn.utils.extmath import svd_flip
from sklearn.preprocessing import MinMaxScaler

U, Sigma, VT = svds(X_trans_bot.T, # transposed to a term-document matrix
                    k=300) # k = number of components / "topics"
# reverse outputs
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])

# scale V (transpose of VT)
scaler = MinMaxScaler()
X_train_svd_scaled = scaler.fit_transform(VT.T)

In [6]:
U.shape, Sigma.shape, VT.shape

((2001, 300), (300,), (300, 3900))

In [7]:
VT.T.shape, y_train.shape

((3900, 300), (3900,))

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, accuracy_score, recall_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)

my_scorer = {
    'accuracy': make_scorer(accuracy_score),
    'sensitivity': make_scorer(recall_score, pos_label=1),
    'specificity': make_scorer(recall_score, pos_label=0)
}

### 10-fold CV

In [9]:
acc = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring=my_scorer['accuracy'], n_jobs=-1)
tpr = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring=my_scorer['sensitivity'], n_jobs=-1)
tnr = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring=my_scorer['specificity'], n_jobs=-1)

print(f'accuracy: {acc.mean():0.4f} (+/- {np.std(acc):0.4f})')
print(f'sensitivity: {tpr.mean():0.4f} (+/- {np.std(tpr):0.4f})')
print(f'specificity: {tnr.mean():0.4f} (+/- {np.std(tnr):0.4f})')

accuracy: 0.9662 (+/- 0.0073)
sensitivity: 0.7602 (+/- 0.0560)
specificity: 0.9976 (+/- 0.0029)


### Hand-rolled CV for SVD on Bag-of-Trigrams

In [10]:
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def handrolled_cv(clf, X, y, seed_, cv=10, test_size=.25):
                  
    def get_scores(clf, X, y, random_state, test_size):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, 
                                                            random_state=random_state)
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        acc = (tp + tn) / (tp + fn + fp + tn)
        tpr = tp / (tp + fn)
        tnr = tn / (fp + tn)
        return acc, tpr, tnr

    random.seed(seed_)
    random_states = [random.randint(1, 9999) for i in range(0, cv)]

    accs, tprs, tnrs = [], [], []
    for state in random_states:
        acc, tpr, tnr = get_scores(clf, X, y, 
                                   random_state=state, test_size=test_size)
        accs.append(acc)
        tprs.append(tpr)
        tnrs.append(tnr)

    print(f'accuracy: {np.mean(accs):0.4f} (+/- {np.std(accs):0.4f})')
    print(f'sensitivity: {np.mean(tprs):0.4f} (+/- {np.std(tprs):0.4f})')
    print(f'specificity: {np.mean(tnrs):0.4f} (+/- {np.std(tnrs):0.4f})')

In [11]:
handrolled_cv(log_clf, X_train_svd_scaled, y_train, seed_=2340981)

accuracy: 0.9599 (+/- 0.0075)
sensitivity: 0.7111 (+/- 0.0372)
specificity: 0.9978 (+/- 0.0016)


### Original Bag-of-Trigrams

In [12]:
acc = cross_val_score(log_clf, X_trans_bot, y_train, cv=10, verbose=0, scoring=my_scorer['accuracy'], n_jobs=-1)
tpr = cross_val_score(log_clf, X_trans_bot, y_train, cv=10, verbose=0, scoring=my_scorer['sensitivity'], n_jobs=-1)
tnr = cross_val_score(log_clf, X_trans_bot, y_train, cv=10, verbose=0, scoring=my_scorer['specificity'], n_jobs=-1)

print(f'accuracy: {acc.mean():0.4f} (+/- {np.std(acc):0.4f})')
print(f'sensitivity: {tpr.mean():0.4f} (+/- {np.std(tpr):0.4f})')
print(f'specificity: {tnr.mean():0.4f} (+/- {np.std(tnr):0.4f})')

accuracy: 0.9859 (+/- 0.0053)
sensitivity: 0.9069 (+/- 0.0412)
specificity: 0.9979 (+/- 0.0019)


### Hand-rolled CV for Original Bag-of-Trigrams

In [13]:
handrolled_cv(log_clf, X_trans_bot, y_train, seed_=2340981)

accuracy: 0.9845 (+/- 0.0045)
sensitivity: 0.9087 (+/- 0.0212)
specificity: 0.9960 (+/- 0.0032)


### SVD on Tfidf

In [14]:
U, Sigma, VT = svds(X_trans_tfidf.T, # transposed to a term-document matrix
                    k=300) # k = number of components / "topics"
# reverse outputs
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])

# scale
scaler = MinMaxScaler()
X_train_svd_scaled = scaler.fit_transform(VT.T)

In [15]:
acc = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring=my_scorer['accuracy'], n_jobs=-1)
tpr = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring=my_scorer['sensitivity'], n_jobs=-1)
tnr = cross_val_score(log_clf, X_train_svd_scaled, y_train, cv=10, verbose=0, scoring=my_scorer['specificity'], n_jobs=-1)

print(f'accuracy: {acc.mean():0.4f} (+/- {np.std(acc):0.4f})')
print(f'sensitivity: {tpr.mean():0.4f} (+/- {np.std(tpr):0.4f})')
print(f'specificity: {tnr.mean():0.4f} (+/- {np.std(tnr):0.4f})')

accuracy: 0.9836 (+/- 0.0048)
sensitivity: 0.8857 (+/- 0.0358)
specificity: 0.9985 (+/- 0.0020)


### Hand-rolled CV for SVD on Tfidf

In [16]:
handrolled_cv(log_clf, X_train_svd_scaled, y_train, seed_=2340981)

accuracy: 0.9848 (+/- 0.0031)
sensitivity: 0.8881 (+/- 0.0210)
specificity: 0.9994 (+/- 0.0009)


### Original Tfidf

In [17]:
acc = cross_val_score(log_clf, X_trans_tfidf, y_train, cv=10, verbose=0, scoring=my_scorer['accuracy'], n_jobs=-1)
tpr = cross_val_score(log_clf, X_trans_tfidf, y_train, cv=10, verbose=0, scoring=my_scorer['sensitivity'], n_jobs=-1)
tnr = cross_val_score(log_clf, X_trans_tfidf, y_train, cv=10, verbose=0, scoring=my_scorer['specificity'], n_jobs=-1)

print(f'accuracy: {acc.mean():0.4f} (+/- {np.std(acc):0.4f})')
print(f'sensitivity: {tpr.mean():0.4f} (+/- {np.std(tpr):0.4f})')
print(f'specificity: {tnr.mean():0.4f} (+/- {np.std(tnr):0.4f})')

accuracy: 0.9779 (+/- 0.0088)
sensitivity: 0.8450 (+/- 0.0685)
specificity: 0.9982 (+/- 0.0024)


### Hand-rolled CV for Original Tfidf

In [18]:
handrolled_cv(log_clf, X_trans_tfidf, y_train, seed_=2340981)

accuracy: 0.9782 (+/- 0.0042)
sensitivity: 0.8441 (+/- 0.0280)
specificity: 0.9985 (+/- 0.0015)


---

**Final Notes in the [Analytics Vidhya Tutorial](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/)**

Avoid Naive Bayes on SVD since it implies strong independence between variables.

"*Apart from LSA, there are other advanced and efficient topic modeling techniques such as Latent Dirichlet Allocation (LDA) and lda2Vec. We have a wonderful article on LDA which you can check out [here](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/). lda2vec is a much more advanced topic modeling which is based on word2vec word embeddings.*"

In [19]:
mins, secs = divmod(time.time() - start_time, 60)
print(f'Time elapsed: {mins:0.0f} m {secs:0.0f} s')

Time elapsed: 0 m 17 s
