# LSA with count vectorization

In this exercise, you will work with a text file that contains a list of book titles (just the titles, no authors etc.) on separate lines (provided as part of the course).

The Latent Semantic Analysis involves breaking down the text into *documents*-by-*tokens* matrix and applying Singular Value Decomposition to reduce the number of words to the relevant ones only, i.e. reduce dimensionality.

From that point, you can visualize the relevant data as scatterplots to identify any patterns.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

%matplotlib inline

In [2]:
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')

## Tokenize text

While tokenizing the text, you will lemmatize the words and remove any stopwords. During previous data exploratory analysis, an additional list of domain-specific words has to be added to the stopwords.

In [3]:
# Initialize lemmatizer object

wordnet_lemmatizer = WordNetLemmatizer()

In [4]:
# Extract titles from text file into list of strings

titles = [line.rstrip() for line in open('data/all_book_titles.txt')]

In [6]:
titles[:5]

['Philosophy of Sex and Love A Reader',
 'Readings in Judaism, Christianity, and Islam',
 'Microprocessors Principles and Applications',
 'Bernhard Edouard Fernow: Story of North American Forestry',
 'Encyclopedia of Buddhism']

In [15]:
len(titles)

2373

In [7]:
stopwords = set(stopwords.words('english'))

In [8]:
# Add more stopwords - great example of domain-specific stopwords
stopwords = stopwords.union({'introduction', 'edition', 'series', 'application', 'approach', 'card', 'access', 'package', 'plus', 
                     'etext', 'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed', 'third', 'second', 'fourth', 
                     'volume'})

In [9]:
# Function to remove stopwords and lemmatize text when tokenizing

def my_tokenizer(s):
    # Lowercase all text 
    s = s.lower() 
    
    # Split string into words (tokens) 
    tokens = nltk.tokenize.word_tokenize(s) 
    
    # Remove short words as they probably not useful 
    tokens = [t for t in tokens if len(t) > 2] 
    
    # Put tokens into root form (lemmatize)
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] 
    
    # Remove stopwords 
    tokens = [t for t in tokens if t not in stopwords] 
    
    # Remove any digits, i.e. "3rd edition" 
    tokens = [t for t in tokens if not any(c.isdigit() for c in t)] 
    
    return tokens

In [10]:
# Pass tokenizing function as custom tokenizer to count vectorizer
# binary=True means we acknowledge presence or absence of word, not count

vectorizer = CountVectorizer(binary=True, tokenizer=my_tokenizer, token_pattern=None)

In [11]:
# Count-vectorize text

X = vectorizer.fit_transform(titles)



In [14]:
# Documents-by-words sparse matrix (2,131 words)

X

<2373x2131 sparse matrix of type '<class 'numpy.int64'>'
	with 10112 stored elements in Compressed Sparse Row format>

## Create index-to-word mapping to plot model results

Conceptually, what we want to do is:

    index_word_map = [None] * len(vectorizer.vocabulary_)

    for word, index in vectorizer.vocabulary_.items():
        index_word_map[index] = word


However, it is already stored for us in the count vectorizer method **`get_feature_names_out()`**, so there is no need.

In [12]:
index_word_map = vectorizer.get_feature_names_out()

In [13]:
index_word_map

array(["'the", '...', 'a-z', ..., 'zen', 'zionism', 'zurich'],
      dtype=object)

## Singular Value Decomposition

Use SciKit-Learn's `TruncatedSVD()` function to apply during LSA. You need to transpose the vectorized matrix to *terms*-by-*documents*, so that each word is treated as a sample (rows), and the columns represent the dimension coordinates for that word. This enables plotting later on.

In [16]:
# Transpose count matrix to make rows=words and cols=documents

X = X.T

In [17]:
# Truncated SVD cuts off the noisy columns
# Default no of desired components (features) is 2

svd = TruncatedSVD()

# Transform count matrix to two-dimensional representation per word

Z = svd.fit_transform(X)

In [20]:
# 2 columns in output - note that negative values are allowed

Z

array([[ 6.07572856e-04,  1.78078426e-03],
       [ 1.96674509e-02, -4.56575716e-03],
       [ 1.26927018e-03, -2.44890743e-05],
       ...,
       [ 3.28077342e-03,  6.09458905e-03],
       [ 8.54604655e-03,  6.59404212e-03],
       [ 1.14336371e-03,  3.18587058e-03]])

In [21]:
# 2,131 words (as expected)

len(Z)

2131

## Plot relevant data

In [18]:
# Use plotly to visualize interactive scatterplot of words plotted in two-dimensional 

import plotly.express as px

In [22]:
# Make sure to pass the list of words from index-to-word mapping

fig = px.scatter(x=Z[:, 0], y=Z[:, 1], text=index_word_map, size_max=60)
fig.update_traces(textposition='top center')
fig.show()

Data spreads in two directions:

* Horizontally, words like 'biology', 'political', 'earth' - more to do with science & politics
* Vertically, words like 'america', 'global', 'culture' - more to do with arts & news

Most words are clustered around center so many words overlapping in the data, but when you zoom in you can still see the words leanings upwards or across based on whether its more scientific or more artistic.

This means that the book titles are mostly about science, politics, arts and business, i.e. reading list for college students perhaps?