# Latent Semantic Analysis Lab

Andrew Peabody apeab2@uis.edu

This notebook performs a Latent Semantic Analysis concepts on a corpus (collection) of 'sci.crypt' newsgroup posts based on the lecture and provided examples.

This particular LSA includes rather agressive data cleanup including:
 - fetch_20newsgroups strips the headers, footers, and quotes.
 - TfidfVectorizer handles the case conversion.
 - Removal of all digits (including binary) and underscores
 - A comprehensive stopwords list

In [444]:
#Required Dependencies
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [445]:
#only required once to download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [446]:
#Download the sci.crypt newsgroup posts
categories = ['sci.crypt']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories,
                            remove=('headers', 'footers', 'quotes'))
corpus = dataset.data

from string import digits
corpus = [str(x).translate(None, digits+'_') for x in corpus]

In [447]:
#Additional custom stopwords to clean the data
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter',
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title',
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class', 'com','edu',
                'gov','org','one','get','nntp','uni','de','david','\t','\n','db','amanda','steve','ellisun','carl','bontchev',
                'fbihh','informatik','aa'])

In [448]:
#These might be useful to troubleshoot the corpus or stopword set
#corpus[0]
#stopset

### TF-IDF Vectorizing

Use scikit-learn's TF-IDF vectorizer to take the corpus and convert each document into a sparse matrix of TFIDF Features...

In [449]:
vectorizer = TfidfVectorizer(stop_words=stopset, lowercase=True,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus)

In [450]:
#This might be useful to check the scoring of the first document
#X[0]
#This can be used to check the number of documents and/or features.
#X.shape

### Perform Latent Semantic Analysis

Perform the Truncated Singular Value Decomposition.  As these are newgroups threads I've decided to go with n_components of ~1%, this also keeps the RAM usage and CPU time reasonable.

In [455]:
lsa = TruncatedSVD(n_components=10, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=10, n_iter=100,
       random_state=None, tol=0.0)




### Discovered Concepts

In [456]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
key
encryption
chip
would
clipper
government
keys
use
law
escrow
 
Concept 1:
encryption
law
americans
technology
industry
devices
encryption technology
law enforcement
key escrow
administration
 
Concept 2:
internet
phone
government
pub
security
email
privacy
anonymous
eff
know
 
Concept 3:
key
ripem
use
encrypted
also
aachen math
security
internet
like
two
 
Concept 4:
key
would
ripem
government
administration
use
public
rsa
pq
law enforcement
 
Concept 5:
encryption
anonymous
anon
posting
anonymity
penet
clipper
two
penet fi
administration
 
Concept 6:
like
use
law
chip
public
escrow
used
algorithm
serial number
need
 
Concept 7:
use
clipper
escrow
government
aachen ftp site
time
information
agencies
many
device
 
Concept 8:
chip
could
escrow
use
would
law
right
also
keys
might
 
Concept 9:
key
clipper
government
chip
number
people
nsa
aachen ftp site
much
like
 
