## Latent Semantic Analysis

**Latent Semantic Analysis** is a more advanced and powerful strategy useful for interpreting and analyzing textual/language based data. In order to complete an LSA, there are a few steps of preparation work on our data that must be completed prior to actually beginning the analysis. 

First, I will be utilizing the _BeautifulSoup_ library for parsing an XML file containing a decent number of real student forum posts from my Data Science course's discussion board to a corpus of documents. Then I will use  _scikit-learn_ to streamline the TF-IDF process by **vectorizing** directly from each document to a sparse matrix of TFIDF features. 

Once I have my corpus in the form of a collection of TF-IDF matrices, then I will perform an LSA on the dataset which will result in the extraction of significant **concepts** from our textual data that will be easily interpreted for any further study. 

In [141]:
# imports
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd


## Parsing & Cleaning with BeautifulSoup

In [142]:
# downloading stopwords should they not be present
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jasonschenck/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

BeautifuSoup is a very efficient tool for text parsing and cleanup, it's very flexible and easy to use as well. In this case I need to read in an XML file.

In [143]:
# Parse XML file dq forum posts
posts = open('raw_forum_posts.dat', 'r').read()

In [144]:
# In XML, the raw text content is arranged and separated by tagging. Below I will extract only data from text tags
# First, instantiate beautiful 'soup' with params (data_file,'file type') Note var name must be soup!
soup = BeautifulSoup(posts, 'lxml')

# Filter the soup for only the values found between the <text> tags, rename the variable for ease of reading.
postTxt = soup.findAll('text')

# Generate the corpus with some list comprehension, iterates for all documents all text and appends docs to postDocs.
# Also, now the data will no longer be just text, so new var name postDocs 
postDocs = [x.text for x in postTxt]
postDocs.pop(0)
postDocs = [x.lower() for x in postDocs]

Cleaning time. But no worries, Python and scikit-learn make this easy. All we need to do is to define a list of **stopwords** called a **stopset** then let scikit-learn know about it as a parameter, and it will automatically just remove them as it processes the vectorization. 

Stopwords are going to be the kind of words that will have no conceptual meaning from the textual analysis. For example the words "the", "0px", "rgb", etc. can be removed as all they will do is slow the process down and make it more inaccurate in the longrun.

For this example, I used a provided set of stopwords that was downloaded from the **ntlk** library. A few additional HTML syntax additions were manually added, but for the most part it should do the trick. After the list is defined, just need to call 

In [145]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter'
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video',
                'title','white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class',  ])


### TF-IDF Vectorizing with Scikit-Learn


In [146]:
# Before...
postDocs[0]

'<p>data science is about analyzing relevant data to obtain patterns of information in order to help achieve a goal. the main focus of the data analysis is the goal rather then the methodology on how it will achieved. this allows for creative thinking and allowing for the optimal solution or model to be found wihtout the constraint of a specific methodology.</p>'

When we vectorize, we are essentially defining a lexical analyzer that is built into scikit-learn and therefore must specify some important parameters:  

* **stopwords:** set the param to var stopset  
<br>
* **use idf:** true or false [will want this to be set to true in most cases]  
<br>
* **ngram range:** 'grams' are essentially words, and the ngram_range specifies to the analyzer the minimum(1) to the maximum(3) grams to consider for contextual relationships and significance. For example, in this case we are going to use 'ngram_range=(1,3)' which means "analyze at minimum one word, but also analyze for pairings of two words repeating, or even up to 3 words in a relationship across our corpus. The larger the range the more possible concepts we will be able to extract.

In [147]:
# Define the vectorizer model -- TfidfVectorizer(set stopwords = ?, use idf = true, num grams range = ?)
vectorizer = TfidfVectorizer(stop_words=stopset,use_idf=True, ngram_range=(1, 3))

# Fit the corpus data to the vectorizer model
X = vectorizer.fit_transform(postDocs)

In [148]:
# Notice the output here..
X[0]

<1x3341 sparse matrix of type '<class 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>

In [149]:
# Tada! This is now the output of the first document in the corpus, in sparse idf matrix form.
print(X[0])

  (0, 641)	0.0899008417366
  (0, 2471)	0.0310567745199
  (0, 160)	0.0675084290336
  (0, 2400)	0.0882799340268
  (0, 2026)	0.10905143902
  (0, 2140)	0.0969008875155
  (0, 1575)	0.0529592419308
  (0, 2071)	0.0761293825223
  (0, 1459)	0.0761293825223
  (0, 47)	0.0969008875155
  (0, 1376)	0.176559868054
  (0, 1801)	0.10905143902
  (0, 1248)	0.0882799340268
  (0, 143)	0.071509957232
  (0, 2365)	0.10905143902
  (0, 1902)	0.21810287804
  (0, 52)	0.10905143902
  (0, 108)	0.10905143902
  (0, 617)	0.0969008875155
  (0, 2965)	0.0969008875155
  (0, 105)	0.10905143902
  (0, 2065)	0.10905143902
  (0, 2741)	0.0969008875155
  (0, 1930)	0.0761293825223
  (0, 1282)	0.0882799340268
  :	:
  (0, 2028)	0.10905143902
  (0, 2144)	0.10905143902
  (0, 1587)	0.10905143902
  (0, 2077)	0.10905143902
  (0, 1461)	0.10905143902
  (0, 49)	0.10905143902
  (0, 1378)	0.10905143902
  (0, 1803)	0.10905143902
  (0, 1252)	0.10905143902
  (0, 649)	0.10905143902
  (0, 147)	0.10905143902
  (0, 1382)	0.10905143902
  (0, 2367)	0.

### Lexical Semantic Analysis (LSA)


LSA is the process of taking our corpus of matrices (X), and performing **matrix decomposition** such that:

<big>$$X \approx USV^{T}$$</big>

where...

* **X** = original corpus of matrices
* **m** = # of matrices, or documents, contained in X
* **n** = # of terms  
<br>

>**The Process:**  
>- X is decomposed into three matricies called U, S, and T with k-value such that...  

<br>

* **k** = # of concepts we want to keep during analysis


and...

* **U** will be a **m x k** matrix.  
 * _Rows_ --> Documents
 * _Columns_ --> Concepts
* **S** will be a **k x k** diagonal matrix. 
 * _Elements_ --> the amount of _variation_ captured from each concept.
* **V** will be a **n x k** matrix.
 * _Rows_ --> Terms
 * _Columns_ --> Concepts

In [150]:
# The current shape is (documents, terms)
X.shape

(27, 3341)

## Truncated Singular Value Decomposition

This is an advanced mathematical procedure involving linear algebra which will decompose our matrix X into three U,S,& V. The entire process is built-in to scikit-learn as an engine model, all we must do is define the model specifications and let it do the work for us. 

[**scikit-learn**](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) provides the following documentation on this function:  
> "Dimensionality reduction using truncated SVD (aka LSA).
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently.
In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient."

In [151]:
#Begin by defining the TruncatedSVD model (num rows/docs?, how many passes over the data (epochs)? )
#Note: n_iter defaults to 5 if not passed, and 1 if using partial_fit
lsa = TruncatedSVD(n_components=27, n_iter=5)

# Fit the model
lsa.fit(X)



TruncatedSVD(algorithm='randomized', n_components=27, n_iter=5,
       random_state=None, tol=0.0)

## Interpretation Post-SVD

Post-SVD 'lsa' will be a collection of the 3 matrices above, where matrix V has been transposed from through the decomposition of X -> U,S & --> V[]   (Number of Terms x Extracted Concepts).  

**Concepts** are the the reason we peformed this LSA process.

In [152]:
# After decomposition, 'lsa.components_[]' represents matrix V'
lsa.components_[0]

array([ 0.00568167,  0.00568167,  0.00568167, ...,  0.00438096,
        0.00438096,  0.00438096])

In [153]:
import sys
print (sys.version)

3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


In [154]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
data
procedures
large amounts
large amounts data
science
amounts
amounts data
different
could
large
 
Concept 1:
procedures
large amounts
large amounts data
could
amounts
amounts data
large
used
according data
according data science
 
Concept 2:
make
decisions
make better
problem
better
data science analyzing
science analyzing
better decisions
make better decisions
hi
 
Concept 3:
goal
data science analyzing
science analyzing
achieve
solution
methodology
relevant
relevant data
answer
finding
 
Concept 4:
information
converted
big
big data
big data converted
data converted
useful
goal
came
came ways
 
Concept 5:
business
methods
competitive edge
edge
especially
perspective
goal
analyzing
competitive
achieve
 
Concept 6:
converted
hello
big data converted
data converted
resource
relevant
relevant data
art
scientific
big
 
Concept 7:
converted
dig
users
competitive
data scientist
scientist
building
amounts
amounts data
find
 
Concept 8:
may
data scientist
scientist
provide
chil

In [155]:
lsa.components_

array([[  5.68167255e-03,   5.68167255e-03,   5.68167255e-03, ...,
          4.38095782e-03,   4.38095782e-03,   4.38095782e-03],
       [ -1.01002149e-02,  -1.01002149e-02,  -1.01002149e-02, ...,
         -7.49848859e-03,  -7.49848859e-03,  -7.49848859e-03],
       [ -2.37990997e-03,  -2.37990997e-03,  -2.37990997e-03, ...,
          1.86152381e-03,   1.86152381e-03,   1.86152381e-03],
       ..., 
       [ -1.88577217e-02,  -1.88577217e-02,  -1.88577217e-02, ...,
         -3.18150966e-03,  -3.18150966e-03,  -3.18150966e-03],
       [ -4.78584275e-05,  -4.78584275e-05,  -4.78584275e-05, ...,
         -1.21575209e-02,  -1.21575209e-02,  -1.21575209e-02],
       [ -6.91126091e-01,   4.24611441e-01,   8.54192316e-03, ...,
          7.58532068e-04,   7.58532068e-04,  -5.36711691e-04]])