# Latent Semantic Analysis:
## _Data Mining for Meaningful Concepts In Christianity Newsgroups_
---

Prepared By: Jason Schenck  
Date: February 6th 2017  
CSC-570 Data Science Essentials


<br>
<big>Table Of Contents</big>

---
* **[1 Introduction][Introduction]**
   * [1.1][1.1] _Purpose & Data Source_
   * [1.2][1.2] _What is a "Latent Semantic Analysis"(LSA)?_
   * [1.3][1.3] _Terminology Defined_
   * [1.4][1.4] _Process/Procedure & Methodology_


* **[2 Data Preparation][Data Preparation]**
   * [2.1][2.1] _Data Retrieval_
   * [2.2][2.2] _Data Inspection_
   * [2.3][2.3] _Defining 'stopwords'_


* **[3 Latent Semantic Analysis (LSA)][Latent Semantic Analysis (LSA)]**
   * [3.1][3.1] _TF-IDF Vectorization_
   * [3.2][3.2] _3.2 SVD Modeling with Scikit-Learn_


* **[4 Results: Interpration Of Extracted Concepts][Results: Interpration Of Extracted Concepts]**



     
[Introduction]: #1-Introduction
[1.1]: #1.1-Purpose-&-Data-Source
[1.2]: #1.2-What-is-a-"Latent-Semantic-Analysis"(LSA)?
[1.3]: #1.3-Terminology-Defined
[1.4]: #1.4-Process/Procedure-&-Methodology
[Data Preparation]: #2-Data-Preparation
[2.1]: #2.1-Data-Retrieval
[2.2]: #2.2-Data-Inspection
[2.3]: #2.3-Defining-'stopwords'
[Latent Semantic Analysis (LSA)]: #3-Latent-Semantic-Analysis-(LSA)
[3.1]: #3.1-TF-IDF-Vectorization
[3.2]: #3.2-SVD-Modeling-with-Scikit-Learn
[Results: Interpration Of Extracted Concepts]: #4-Results:-Interpration-Of-Extracted-Concepts


<br>

<html>
<div class="alert alert-success">
<b>Data Source</b> ["Twenty Newsgroups", Provided By: Scikit-Learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#)
</div>
</html>
---

### 1 Introduction

#### 1.1 Purpose & Data Source

In this analysis I will be performing data mining in an effort to extract a series of meaningful and significant concepts from a public dataset of newsgroup postings on the topic of Christianity.

The dataset, titled "Twenty Newsgroups" and is officially described as follows:
>"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."

A newsgroup is an online public forum for discussion on a particular topic. The topic that I will be extracting data from will be "Christianity" (soc.religion.christian). I'm very curious to see what the results of this analysis will be, and in concluding intend to share my opinion on them. 

#### 1.2 What is a "Latent Semantic Analysis"(LSA)?

Latent Semantic Analysis (LSA) is a technique commonly used in the field of Natural Language Processing (NLP). As a computer scientist, when performing NLP we are concerned with studying the interactions and between computers and human language. A great portion of this field focuses on the analysis of the relationship between multiple words in a document of text containing in a collection of documents. This is known as the subfield of Natural Language Understanding and can be thought of more simply as teaching computers how to read. 

LSA is more formally defined by ["An Introduction to Latent Semantic Analysis" by Landauer, Foltz, & Laham](http://lsa.colorado.edu/papers/dp1.LSAintro.pdf)
>"Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the
contextual-usage meaning of words by statistical computations applied to a large corpus of
text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word
contexts in which a given word does and does not appear provides a set of mutual
constraints that largely determines the similarity of meaning of words and sets of words to
each other."

#### 1.3 Terminology Defined

There is a vast list of new terminoloy defined by the field of NLP. Below I will briefly define those of significance to LSA that I will be using regularly throughout this analysis.

* **Word** - A single English word in text.
* **Bag Of Words (BOW)** - An abstraction model in NLP where we consider each document of text to simply be a "bag of words" in the literal sense, such that grammar and conceptual meaning is ignored.
* **Term Frequency–Inverse Document Frequency (TF-IDF)** - A mathematical calculation for scoring the importance of a word in a document or a collection. 
* **Term** - A single word found in a document of text.
* **Document** - A single collection of terms.
* **Corpus** - A single collection documents.
* **Concept** - The final output of and LSA is a list of concepts. These are words, or multiple words together, which were found to have the highest significance across our corpus.

#### 1.4 Process/Procedure & Methodology

In brief, I will summarize a list of 7 steps representing the overall process required to perform an LSA:

1. Collect/Retrieve a dataset containing text of interest. 
2. Define which text in the dataset will be represented as documents (sentences, discussion board poasts, news articles, ?)
3. Using the BOW model, parse by document and store words in a bag of words where each bag is a document. Ending result should be a collection of documents of tokenized words.
4. Clean the data. Remove as many unneccessary words and characters as possible.
5. Perform TF-IDF Vectorization. This scores the words as terms for each document and across the document collection as a whole. 
6. Matrix decomposition using the Singular Value Decomposition algorithm.
7. Output a list of concepts extracted. 

Now we can begin our prepartions for LSA, starting with step 1, importing the dataset.

### 2 Data Preparation

#### 2.1 Data Retrieval

In [1]:
# Imports, and dataset download via sk-learn
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import math
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import re

categories = ['soc.religion.christian']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

#### 2.2 Data Inspection

In [2]:
# Check how many documents (forum posts) are in the dataset
len(corpus)

997

In [3]:
# Check the first document
corpus[0]

'From: sciysg@nusunix1.nus.sg (Yung Shing Gene)\nSubject: Mission Aviation Fellowship\nOrganization: National University of Singapore\nLines: 3\n\nHi,\n\tDoes anyone know anything about this group and what they\ndo? Any info would be appreciated. Thanks!\n'

In [4]:
# * Uncomment to inspect the raw data *
"""
# Print the first 10 documents to inspect the data
for x in range(0,12):
    print(corpus[x])
"""

'\n# Print the first 10 documents to inspect the data\nfor x in range(0,12):\n    print(corpus[x])\n'

It appears our data is in plain text with no tagging. However, each post starts with a heading which I've noticed is also variable across the corpus. For example some posts start with a header containing "From", "Subject", and "Organization" while others do not. The following headers are present across the corpus document:  
* **From:** [ _email@emailaddress.com_ ]
* **Subject:** [ _topic_ ]
* **Reply-To:** [ _email@emailaddress.com_ ]
* **Organization:** [ _Organization Name_ ]
* **Lines:** [ _# Lines of post_ ]

Also, it appears that a post can be from either a public individual or a member of an organization. In either case, posts can be new posts or replies to other's posts. Every header ends with **"Lines:"** which tells us the number of lines of text contained in the post message itself.


Post content looks like it could be problemsome for LSA if I don't include plenty of words and characters to exclude with the "stopset". For example, below is a post that demonstrates what we are dealing with in regard to post content.

**Observations:**

We are going to want to exclude e-mail addresses and poster's names because these will hinder the performance of our analysis. In the above post, I see that quoted articles may be included as well. Quote blocks appear to contain a brief header, and then are marked with '>' before each line contained in the quote. I plan to leave the articles in the dataset for my first LSA attempt, but I will definitely need to exclude the '>' characters. Also punctuation and capitalization should be taken care of as well. The hardest part will be excluding e-mail addresses and names since they are so variable across the corpus. 

In [5]:
# Convert all text to lower-case
postDocs = [x.lower() for x in corpus]

# Check it
postDocs[0]

'from: sciysg@nusunix1.nus.sg (yung shing gene)\nsubject: mission aviation fellowship\norganization: national university of singapore\nlines: 3\n\nhi,\n\tdoes anyone know anything about this group and what they\ndo? any info would be appreciated. thanks!\n'

In [6]:
# Using regex, find and remove all e-mail addresses
corpus = [re.sub(r'(\s)(\S+\@\S+)(\s)', r'\1\3', corpus[x]) for x in range(len(corpus))]

# Ensure successful
corpus[6]

'From: \nSubject: tongues (read me!)\nLines: 8\n\nPersons interested in the tongues question are are invited to\nperuse an essay of mine, obtainable by sending the message\n GET TONGUES NOTRANS\n to  or to\n    \n\n Yours,\n James Kiefer\n'

#### 2.3 Defining 'stopwords'

Now that I have all of the e-mail addresses stored, as well as a fairly decent idea of what needs to be included, I will build the stopset. 

A stopset is a list of **'stopwords'** which will be excluded from analysis automatically by scikit-learn's vectorization algorithm. For this LSA, I'm going to use a combination of three lists for the first attempt: a stopset provided by my professor, one that I found online called the _terrier stop-set_, and all of the e-mail addresses of the corpus.

In order to do this, I will store Professor Bernico's set manually, then I will import the terrier set, and then finally I will union all three into one set with no duplicates.

In [7]:
# This set was provided to me, and updated by my professor
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jasonschenck/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:

# ORIGINAL
stopset = set(stopwords.words('english'))



stopset.update(['nova','gmi','khan0095','budd','28','30602','bud','nj','44','31','10','11','31','wkuvx1',
                'bitnet','easteee','holt','gatech','carol','howard','len','hampton','va','cs','terrance',
                'watt','apointee','acad1','sahs','uth','randerso','larc','gov','whitesbsd','nextwork','trol',
                'eeap','apr','r2d2','vbv','3062','7415','n4tmi','wbt','wycliffe','david','paul','ata','hfsi','uk',
                'fidonet','jeff','fenholt','indiana','fisher','microsystems','creps','alvin','netcom','andrew','fil',
                'revdak','jr','velasco','virgilio','ac','za','hayesstw','risc1','ucs','lee','nicholas',
                'mandock','randal','overacker','larry','bernard','elizabeth','dean','seanna','unisa','rose','bryan',
                'bnr','jayne','heath','scott','michael','llo','acs','vela','atterlep','lines',
                'petch','carlson','caralv','0358','706','542','subject','university','georgia','aisun3','reply-to',
                'organization','hulman','hayes','steve','mcovingt','ai','ca','covington','bigelow','eugene','tek',
                'gvg47','chuck','gvg','com','uga','bernadette','rutgers','edu','lt','p','/p','br','amp','quot',
                'font','span','0px','rgb','51','spacing','text','helvetica', 'arial','indent','line','none','sans',
                'serif','line','title','word','0pt','16','12','14','21', 'neue','johnsd2','rpi','mls','panix','ebay',
                'group','aa888','freenet','mark','carleton','ncr','cso','uxa','uiuc','bjorn','elsegundoca',
                'mit','koberg','gt7122b','oo','la','microsoft','kuhub','cc','ukans','codex','fnal','marka','csd',
                'sapienza','lady', ])



In [9]:
# Let's take a look
#stopset

In [10]:
# TERRIER
terrierstopset = open('terrierstopset.txt', 'r').read()
stopset = set(stopset).union(set(terrierstopset))

In [11]:
# * Uncomment to review the stopset wordlist *
"""

print(stopset)

"""

'\n\nprint(stopset)\n\n'

### 3 Latent Semantic Analysis (LSA)

#### 3.1 TF-IDF Vectorization

In [12]:
# Define the vectorizer model -- TfidfVectorizer(set stopwords = ?, use idf = true, num grams range = ?)
vectorizer = TfidfVectorizer(stop_words=stopset,use_idf=True, ngram_range=(2, 5))

# Fit the corpus data to the vectorizer model
X = vectorizer.fit_transform(postDocs)

In [13]:
# Tada! This is now the output of the first document in the corpus, in sparse idf matrix form.
print(X[0])

  (0, 463220)	0.122190056616
  (0, 365021)	0.106667010589
  (0, 364984)	0.100966773934
  (0, 475177)	0.122190056616
  (0, 591646)	0.122190056616
  (0, 476784)	0.122190056616
  (0, 215969)	0.122190056616
  (0, 342444)	0.122190056616
  (0, 58524)	0.122190056616
  (0, 200883)	0.122190056616
  (0, 352929)	0.106667010589
  (0, 483406)	0.122190056616
  (0, 248261)	0.11044732561
  (0, 41364)	0.0892240429277
  (0, 288806)	0.103578268315
  (0, 41981)	0.122190056616
  (0, 267272)	0.122190056616
  (0, 580547)	0.106667010589
  (0, 45007)	0.106667010589
  (0, 463221)	0.122190056616
  (0, 365022)	0.106667010589
  (0, 365006)	0.122190056616
  (0, 475178)	0.122190056616
  (0, 591647)	0.122190056616
  (0, 476785)	0.122190056616
  :	:
  (0, 58526)	0.122190056616
  (0, 200885)	0.122190056616
  (0, 352943)	0.122190056616
  (0, 483408)	0.122190056616
  (0, 248263)	0.115320999322
  (0, 41369)	0.122190056616
  (0, 288813)	0.122190056616
  (0, 41983)	0.122190056616
  (0, 267274)	0.122190056616
  (0, 463223)	0

In [14]:
# The current shape is (documents, terms)
X.shape

(997, 592004)

#### 3.2 SVD Modeling with Scikit-Learn

In [15]:
#Begin by defining the TruncatedSVD model (num rows/docs?, how many passes over the data (epochs)? )
#Note: n_iter defaults to 5 if not passed, and 1 if using partial_fit
lsa = TruncatedSVD(n_components=100, n_iter=5)

# Fit the model
lsa.fit(X)


TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=None, tol=0.0)

In [16]:
# After decomposition, 'lsa.components_[]' represents matrix V'
lsa.components_[0]

array([  2.32819642e-04,   1.40661489e-04,   1.40661489e-04, ...,
         1.92725738e-05,   1.92725738e-05,   1.92725738e-05])

In [17]:
import sys
print (sys.version)

3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


In [18]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
priest priest
four years
immaculate conception
answer priest
told priest
years old
case doctrine
apparition deemed
apparition deemed true
apparition deemed true sealed
 
Concept 1:
secretary interior
married god
god eyes
married god eyes
appointee james
appointee james pentacostal
appointee james pentacostal christian
appointee james pentacostal christian think
christian think
christian think secretary
 
Concept 2:
grass valley
daily verse grass
daily verse grass valley
daily verse grass valley grass
verse grass
verse grass valley
verse grass valley grass
verse grass valley grass valley
grass valley grass
grass valley grass valley
 
Concept 3:
secretary interior
appointee james
appointee james pentacostal
appointee james pentacostal christian
appointee james pentacostal christian think
christian think
christian think secretary
christian think secretary interior
christian think secretary interior saw
days would last
 
Concept 4:
married god
god eyes
married god eyes
two peopl

### 4 Results: Interpration Of Extracted Concepts

Work in progress...