In [1]:
#  Jordan Hoover
#  CSC570R Fall 2016
#  Assignment: 
#  LSA lab, using Python 3.5

# LSA lab: 
Latent Semantic Analysis (LSA) on newsgroup posts

Steps: 
<li>First Steps are to import needed modules and grab the dataset, which are
the newsgroup posts.  I chose to use sci.electronics, but experimented using other newsgroups as well

<li> Then I used the TF-IDF vectorizer

<li> Then do the LSA and take a look at the resulting concepts

<li> Go through a loop of updating the stopwords and examining resulting concepts for information


In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
categories = ['sci.electronics'] # name of newsgroup to use here
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data #corpus is now a list of documents(strings) from sci.electronics


In [3]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [4]:
# only needed to run this once so it's commented out for future runs (I did not want to just delete it
# though because it doesn't work without it)
#nltk.download('stopwords')

In [5]:
stopset = set(stopwords.words('english'))
stopset.update(['nntp', 'edu' , 'com', 'go', 'two', 'get', 'use','would', 
                'one', 'ca', '00', '1993', 'from','subject', '\n', '000', 'even', 'also',
                'article','posting','like','good','neal','355o33', 'dave medin', 'dave','medin',
                'could', 'many', '0000001200', '10','20','001', 'cmptrc', 'way','something','michael','need',
                'much', 'know','b30','josh'
               ])

Now, we have the dataset, as a list of documents in 'corpus'
Next, I will use scikit-learn's TF-IDF vectorizer on my corpus and it will be 
converted into a sparse matrix of TFIDF Features 

In [6]:
# Before running the vectorizer, this is what a document/post in my corpus looks like:
# I looked through some of the documents in the corpus to try to get ideas for better stop words
corpus[5]

"From: sgberg@charon.bloomington.in.us (Stefan G. Berg)\nSubject: Re: Motorola XC68882RC33 and RC50\nReply-To: sgberg@charon.bloomington.in.us (Stefan Berg)\nDistribution: world\nOrganization: Not an Organization\nX-NewsSoftware: GRn 1.16f (10.17.92) by Mike Schwartz & Michael B. Smith\nLines: 25\n\nIn article <16APR199323531467@rosie.uh.edu> st1my@rosie.uh.edu (Stich, Christian E.) writes:\n> I just installed a Motorola XC68882RC50 FPU in an Amiga A2630 board (25 MHz\n> 68030 + 68882 with capability to clock the FPU separately).  Previously\n> a MC68882RC25 was installed and everything was working perfectly.  Now the\n> systems displays a yellow screen (indicating a exception) when it check for\n> the presence/type of FPU.  When I reinstall an MC68882RC25 the system works\n> fine, but with the XC68882 even at 25 MHz it does not work.  The designer\n> of the board mentioned that putting a pullup resistor on data_strobe (470 Ohm)\n> might help, but that didn't change anything.  Does any

In [7]:
# run vectorizer
# I tried changing ngram_range, but it did not seem to produce better or more useful insights, 
# so I left it at 3
vectorizer = TfidfVectorizer(stop_words=stopset, 
                             use_idf=True, ngram_range=(1, 3))
X= vectorizer.fit_transform(corpus)


After running vectorizer, first document looks like: 

In [8]:
X[0]

<1x166969 sparse matrix of type '<class 'numpy.float64'>'
	with 162 stored elements in Compressed Sparse Row format>

In [9]:
#print(X[0])

In [10]:
type(X)

scipy.sparse.csr.csr_matrix

In [11]:
type(X[0])

scipy.sparse.csr.csr_matrix

Now I can do the LSA

In [12]:
X.shape

(984, 166969)

In [13]:
X[0].shape

(1, 166969)

In [14]:
# I experimented with adjusting number of total concepts by adjusting n_components
lsa = TruncatedSVD(n_components=25, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=25, n_iter=100,
       random_state=None, tol=0.0)

In [15]:
# show what the first row for V looks like
lsa.components_[0]

array([ 0.00026131,  0.00026131,  0.00026131, ...,  0.00154365,
        0.00154365,  0.00154365])

In [16]:
# Show versions I am using, for reference
import sys
print(sys.version)

3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]


In [17]:
# print the terms 
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_):
    termsInComp = zip (terms, comp)
    sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse =True)[:10]
    print("Concept %d:" %i)
    for term in sortedTerms:
        print(term[0])
    print("")

Concept 0:
writes
lines
copy
organization
battery
university
host
ground
power
phone

Concept 1:
copy
protection
copy protection
program
sure
software
tih
ketil
reed
copy protected

Concept 2:
battery
acid
concrete
batteries
003132 wcsub
temple
lead acid
discharge
ai
uga

Concept 3:
catbyte
dtmedin
ingr
phone
003800
catbyte ingr
dtmedin catbyte
dtmedin catbyte ingr
reply
batteries

Concept 4:
audio
relays
ground
wire
wagner
nanaimo
0000 ibm tieline
used
led
state

Concept 5:
wire
wiring
may
ground
baden
003800 18288
neutral
battery
bison
bison mb

Concept 6:
wire
wiring
ground
university
neutral
gfci
outlet
outlets
writes
003132

Concept 7:
writes
used
power
host
oversampling
may
electronics
university
john
cd

Concept 8:
radio
plants
organization
ground
mpr
power
please
really
ac
program

Concept 9:
writes
003132 wcsub ctstateu
chip
find
us
output
want
thing
cs
cycle

Concept 10:
0000 ibm tieline
find
wiring
current
ai
power
hp
never
host
ink

Concept 11:
0000 ibm tieline
host
organiz

# Summary of Concepts

Many of my concepts seemed to have a lot of junk in them even after many iterations of changing different parameters and adding to the list of stop words, but I tried to find a few things that stood out:

<li> copy protection
<li> voltage dector 
<li> circuit wiring
<li> audio signal, tv signal, tv signal voltage 
<li> battery, power, computer -> could be talking about powering a computer
