# Latent Semantic Analysis Lab
### Completed by Jacob Metzger
### Due Feb 15, 2016

## Task Description
Your assignment for this week is to do LSA on a group of newsgroup posts from the newsgroup 'rec.sport.baseball.'  (Feel free to pick another newsgroup if you like, the list is here.  http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)   

1.  To get the newsgroup data, use this code:<br>
from sklearn.datasets import fetch_20newsgroups<br>
categories = ['rec.sport.baseball']<br>
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)<br>
corpus = dataset.data<br>

2.  Next, you'll be adapting my LSA code for your problem.  This shouldn't be too hard, but please spend some time understanding what my code is doing.  

3.  When you print the discovered concepts you'll probably find they don't make sense.  Consider adjusting the words in the stop word list to remove things like nntp, and people's names...

4.  Once youre satisfied with your work, submit the link to your work

## Task Work
#### (Note: Code in this notebook is adapted from course lecture sample code, some portions directly, found at https://github.com/mbernico/CS570/blob/master/LSA%20Text.ipynb)

In [1]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
categories = ['sci.space'] #Chosen newsgroup
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

In [2]:
#Not importing BeautifulSoup as data is not XML-like
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Izzy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
from collections import defaultdict

#Start some ad-hoc parsing to skip header and get body
corpusSplit = [_.lower().split('\n') for _ in corpus]
dropHeaderCorpus = defaultdict(str)
for idx, document in enumerate(corpusSplit):
    for line in document:
        if line.startswith("from:"):
            continue
        elif line.startswith("subject:"):
            continue
        elif line.startswith("organization:"):
            continue
        elif line.startswith("distribution:"):
            continue
        elif line.startswith("lines: "):
            continue
        elif line.startswith("nntp-posting-host:"):
            continue
        elif line.startswith("article-i.d.:"):
            continue
        elif line.startswith("x-added:"):
            continue
        elif line.startswith("original-sender:"):
            continue
        dropHeaderCorpus[idx]+=' '+line.replace('<','').replace('>','') #also clean up some annoying symbols

In [45]:
stopset = set(stopwords.words('english'))
#These are ad-hoc as a result of trial and error
stopset.update(['zoo','\t','henry','com','toronto','re','edu','__','___', '_____',\
                'also','like','gmt', 'gov', 'net','theporch','raider','digex'])

vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram_range=(1,3))
X=vectorizer.fit_transform(dropHeaderCorpus.values())

In [46]:
X.shape

(987, 244327)

In [47]:
#Use Single Value Decomposition to complete the LSA
lsa = TruncatedSVD(n_components=8, n_iter=100) #n_components was chosen after trial and error based on inspection
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=8, n_iter=100,
       random_state=None, tol=0.0)

In [48]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print "\t",term[0] #added indentation to make slightly easier to read
    print " "

Concept 0:
	space
	would
	nasa
	writes
	one
	article
	shuttle
	launch
	orbit
	moon
 
Concept 1:
	space
	venus
	planet
	mission
	surface
	solar
	spacecraft
	kilometers
	solar system
	good
 
Concept 2:
	hst
	mission
	work
	even
	sky
	real
	people
	pluto
	maybe
	something
 
Concept 3:
	nasa
	think
	pat
	even
	space
	software
	may
	shuttle
	launch
	make
 
Concept 4:
	would
	launch
	satellite
	things
	year
	enough
	use
	vehicle
	project
	flight
 
Concept 5:
	people
	launch
	nasa
	solar
	see
	reply
	propulsion
	think
	first
	hst
 
Concept 6:
	would
	mission
	access
	program
	see
	project
	could
	probe
	dc
	things
 
Concept 7:
	space
	see
	alaska
	one
	us
	new
	first
	day
	solar
	launch
 
