## Steering with feedback 

All the previous approaches using LSA failed to take into account information regarding the similarity between documents. In this way, topics were created based on a generic set of rules.
<br>
The unsupervised learning of these feature (topic) extraction models didn't have any data about how 'close' the topic vectors should be (relative) to each other.
<br>
However, we did not consider any 'feedback' about where the topic vectors ended up or how they were related to each other during computation.
<br>
<br>
Steering (learned distance metrics) are the latest advancements in dimension reduction and feature extraction. By adjusting the distance scores reported to clustering and embedding algorithms, we can **steer** our vectors so that they minimize some cost function. In this way we can force the vectors to focus on some aspect of the information content that we might be interested in.

In previous runs/sections of LSA (ex notebooks in repo), we ignored all of the meta information about the documents. An instance of this among SMS messages is that we ignored the sender of the message.
<br>
This is a good indication of topic similarity and could be used to inform your topic vector transformation (LSA).

One method to carry out this steering computations is to calculate the mean difference between our two centroids - as was carried out for linear discriminant analysis (LDA) - and add the some portion of 'bias' to all word/topic vectors. The idea is to take out the average topic vector difference, where an example application could be such topic vector difference between CVs and job descriptions (based on a summarisation of various keywords/topics).
<br>
<br>
Examples of 'steering with feedback' could mean taking into consideration that topics such as '“beer on tap at lunch' might appear in job descriptions but never in a CV. Similarly, quirky hobbies/leisure activities like underwater sculpture might appear in some CVs but never a job description.
<br>
Steering topic vectors can help us focus them on the topics one maybe interested in modelling.



### Linear discriminant analysis (LDA)

Here will be shown a LDA model trained on the labelled SMS dataset.
<br>
LDA works similarly to latent semnatic analysis (LSA), except it requires classification labels or other scores to be able to find the best linear combination of the dimensions in high-dimensional space (the terms in a BOW/TF-IDF vector transformation)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
import pandas as pd 
from nlpia.data.loaders import get_data
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
pd.options.display.width = 120 
sms = get_data('sms-spam')
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index 


In [10]:
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).todense()
tfidf_docs = tfidf_docs - tfidf_docs.mean(axis=0) # mean of each word-vector feature (each column)

In [11]:
lda = LDA(n_components=1)
lda = lda.fit(tfidf_docs, sms.spam)
sms['lda_spaminess'] = lda.predict(tfidf_docs)

In [12]:
# rough evaluation metrics done manually
print(((sms.spam - sms.lda_spaminess) ** 2.).sum() ** .5)
print((sms.spam == sms.lda_spaminess).sum())
print(len(sms))

0.0
4837
4837


This is a naive example of coming to a quick conclusion as saying the model got every observation on the test set correct, without considering the problems of overfitting.
<br>
With around 10k terms in our TF-IDF vectors, it doesn't come as a surprise that the model could just 'memorize' the answer.
<br> 
Previous use of `train-test-split` evaluation is warranted, so let’s reserve a third of our dataset for testing

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_docs, sms.spam, test_size=0.33, random_state=271828)
lda = LDA(n_components=1)
lda.fit(X_train, y_train)
lda.score(X_test, y_test).round(3)

0.764

Again, poor test set accuracy. So it doesn’t look like we're unlucky with our data sampling. It’s a poor, overfitting model.
<br>
Let’s see if LSA combined with LDA will help us create an accurate model that is also generalized well so that new SMS messages don’t trip it up:

In [27]:
from sklearn.decomposition import PCA
# 16-D topic vectors 
pca = PCA(n_components=16)
# saves us one more step performing fit_transform simultaneously on training set 
pca_topic_vectors16 = pca.fit_transform(tfidf_docs)
columns = [f'topic{i}' for i in range(pca.n_components_)]
pca_topic_vectors_df = pd.DataFrame(pca_topic_vectors16, columns=columns, index=sms.index)
pca_topic_vectors_df.round(3).head(6)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.003,0.037,0.011,-0.019,-0.053,0.039,-0.065,0.011,-0.082,0.008,0.0,0.003,-0.031,-0.004,0.027
sms1,0.404,-0.094,-0.078,0.051,0.1,0.047,0.023,0.066,0.022,-0.022,-0.005,-0.036,0.04,-0.016,0.051,-0.05
sms2!,-0.03,-0.048,0.09,-0.067,0.091,-0.043,-0.0,0.0,-0.058,0.054,0.127,-0.031,0.017,-0.004,-0.035,0.048
sms3,0.329,-0.033,-0.035,-0.016,0.052,0.056,-0.165,-0.073,0.062,-0.106,0.022,-0.021,0.077,-0.043,0.012,-0.063
sms4,0.002,0.031,0.038,0.034,-0.075,-0.093,-0.044,0.061,-0.044,0.03,0.027,0.014,0.026,0.031,-0.075,-0.019
sms5!,-0.016,0.059,0.014,-0.006,0.122,-0.04,0.005,0.165,-0.022,0.063,0.041,-0.052,-0.041,0.07,-0.004,0.023


In [32]:
X_train, X_test, y_train, y_test = train_test_split(pca_topic_vectors_df.to_numpy(), sms.spam, test_size=0.3, random_state=271828)
lda = LDA(n_components=1)
lda.fit(X_train, y_train)
lda.score(X_test, y_test).round(3)

0.965

***Summary***

So with LSA, we can characterise properties of an SMS message with only 16 dimensions and still have plenty of information to classify them as spam (or not).
<br>
Also, our low-dimensional model is much less likely to overfit. It should generalise well and be able to classify as-yet-unseen SMS messages or chats.


We got better accuracy with your simple LDA model before trying all the semantic analysis approaches. But the advantage of this new model (LDA + LSA/PCA) is that we can create vectors that represent the semantics of a statement in more than just a single dimension.