# Latent Semantic Analysis:
## _Data Mining for Meaningful Concepts In Christianity Newsgroups_
---

Prepared By: Jason Schenck  
Date: February 6th 2017  
CSC-570 Data Science Essentials


<br>
<big>Table Of Contents</big>

---
* **[1 Introduction][Introduction]**
   * [1.1][1.1] _Purpose & Data Source_
   * [1.2][1.2] _What is a "Latent Semantic Analysis"(LSA)?_
   * [1.3][1.3] _Terminology Defined_
   * [1.4][1.4] _Process/Procedure & Methodology_


* **[2 Data: Retrieval, Parsing, & Cleansing][Data: Retrieval, Parsing, & Cleansing]**
   * [2.1][2.1] _Retrieval_
     * [2.1.1][2.1.1] Retrieving the dataset
     * [2.1.2][2.1.2] Parsing with BeautifulSoup
   * [2.2][2.2] _Inspection_
     * [2.2.1][2.2.1] Overview
     * [2.2.2][2.2.2] Word Analysis
     * [2.2.3][2.2.3] Cleansing via "stopset" definition


* **[3 TF-IDF Vectorization][TF-IDF Vectorization]**
   * [3.1][3.1] _Overview: Vectorizing_
   * [3.2][3.2] _TF-IDF Vectorization with Scikit-Learn_
     * [3.2.1][3.2.1] Function & Syntax Documentation
     * [3.2.2][3.2.2] Parameters


* **[4 Lexical Semantic Analysis (LSA)][Lexical Semantic Analysis (LSA)]**
   * [4.1][4.1] _Overview: Theory & Practice_
   * [4.2][4.2] _Mathematics: Singular Value Decomposition (SVD)_
   * [4.3][4.3] _SVD Modeling with Scikit-Learn_
     * [4.3.1][4.3.1] Function & Syntax Documentation
     * [4.3.2][4.3.2] Parameters
   * [4.4][4.4] _Producing A Meaningful Output Of Concepts_
     * [4.4.1][4.4.1] TruncatedSVD() Output
     * [4.4.2][4.4.2] Converting Document Matrices to Concepts


* **[5 Results: Interpration Of Extracted Concepts][Results: Interpration Of Extracted Concepts]**
    * [5.1][5.1] _Output_
    * [5.2][5.2] _Observations & Opinions_


     
[Introduction]: #1-Introduction
[1.1]: #1.1-Purpose-&-Data-Source
[1.2]: #1.2-What-is-a-"Latent-Semantic-Analysis"(LSA)?
[1.3]: #1.3-Terminology-Defined
[1.4]: #1.4-Process/Procedure-&-Methodology
[Data: Retrieval, Parsing, & Cleansing]: #2-Data:-Retrieval,-Parsing,-&-Cleansing
[2.1]: #2.1-Retrieval
[2.1.1]: #2.1.1-Retrieving-the-dataset
[2.1.2]: #2.1.2-Parsing-with-BeautifulSoup
[2.2]: #2.2-Inspection
[2.2.1]: #2.2.1-Overview
[2.2.2]: #2.2.2-Word-Analysis
[2.2.3]: #2.2.3-Cleansing-via-"stopset"-definition
[TF-IDF Vectorization]: #3-TF-IDF-Vectorization
[3.1]: #3.1-Overiview:-Vectorizing
[3.2]: #3.2-TF-IDF-Vectorization-with-Scikit-Learn
[3.2.1]: #3.2.1-Function-&-Syntax-Documentation
[3.2.2]: #3.2.2-Parameters
[Lexical Semantic Analysis (LSA)]: #4-4-Lexical-Semantic-Analysis-(LSA)
[4.1]: #4.1-Overview:-Theory-&-Practice
[4.2]: #4.2-Mathematics:-Singular-Value-Decomposition-(SVD)
[4.3]: #4.3-SVD-Modeling-with-Scikit-Learn

[4.3.1]: #4.3.1-Function-&-Syntax-Documentation
[4.3.2]: #4.3.2-Parameters

[4.4]: #4.4-Producing-A-Meaningful-Output-Of-Concepts

[4.4.1]: #4.4.1-TruncatedSVD()-Output
[4.4.2]: #4.4.2-Converting-Document-Matrices-to-Concepts

[Results: Interpration Of Extracted Concepts]: #5-Results:-Interpration-Of-Extracted-Concepts
[5.1]: #5.1-Output
[5.2]: #5.2-Observations-&-Opinions

<br>


<div class="alert alert-success">
<b>Data Source</b> ["Twenty Newsgroups", Provided By: Scikit-Learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#)
</div>

---

### 1 Introduction

#### 1.1 Purpose & Data Source

In this analysis I will be performing data mining in an effort to extract a series of meaningful and significant concepts from a public dataset of newsgroup postings on the topic of Christianity.

The dataset, titled "Twenty Newsgroups" and is officially described as follows:
>"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."

A newsgroup is an online public forum for discussion on a particular topic. The topic that I will be extracting data from will be "Christianity" (soc.religion.christian). I'm very curious to see what the results of this analysis will be, and in concluding intend to share my opinion on them. 

#### 1.2 What is a "Latent Semantic Analysis"(LSA)?

Latent Semantic Analysis (LSA) is a technique commonly used in the field of Natural Language Processing (NLP). As a computer scientist, when performing NLP we are concerned with studying the interactions and between computers and human language. A great portion of this field focuses on the analysis of the relationship between multiple words in a document of text containing in a collection of documents. This is known as the subfield of Natural Language Understanding and can be thought of more simply as teaching computers how to read. 

LSA is more formally defined by ["An Introduction to Latent Semantic Analysis" by Landauer, Foltz, & Laham](http://lsa.colorado.edu/papers/dp1.LSAintro.pdf)
>"Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the
contextual-usage meaning of words by statistical computations applied to a large corpus of
text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word
contexts in which a given word does and does not appear provides a set of mutual
constraints that largely determines the similarity of meaning of words and sets of words to
each other."

#### 1.3 Terminology Defined

There is a vast list of new terminoloy defined by the field of NLP. Below I will briefly define those of significance to LSA that I will be using regularly throughout this analysis.

* **Word** - A single English word in text.
* **Bag Of Words (BOW)** - An abstraction model in NLP where we consider each document of text to simply be a "bag of words" in the literal sense, such that grammar and conceptual meaning is ignored.
* **Term Frequency–Inverse Document Frequency (TF-IDF)** - A mathematical calculation for scoring the importance of a word in a document or a collection. 
* **Term** - A single word found in a document of text.
* **Document** - A single collection of terms.
* **Corpus** - A single collection documents.
* **Concept** - The final output of and LSA is a list of concepts. These are words, or multiple words together, which were found to have the highest significance across our corpus.

#### 1.4 Process/Procedure & Methodology

In brief, I will summarize a list of 7 steps representing the overall process required to perform an LSA:

1. Collect/Retrieve a dataset containing text of interest. 
2. Define which text in the dataset will be represented as documents (sentences, discussion board poasts, news articles, ?)
3. Using the BOW model, parse by document and store words in a bag of words where each bag is a document. Ending result should be a collection of documents of tokenized words.
4. Clean the data. Remove as many unneccessary words and characters as possible.
5. Perform TF-IDF Vectorization. This scores the words as terms for each document and across the document collection as a whole. 
6. Matrix decomposition using the Singular Value Decomposition algorithm.
7. Output a list of concepts extracted. 

Now we can begin our prepartions for LSA, starting with step 1, importing the dataset.

### 2 Data: Retrieval, Parsing, & Cleansing

#### 2.1 Retrieval

##### 2.1.1 Retrieving the dataset

##### 2.1.2 Parsing with BeautifulSoup

#### 2.2 Inspection

##### 2.2.1 Overview

##### 2.2.2 Word Analysis

##### 2.2.3 Cleansing via "stopset" definition

### 3 TF-IDF Vectorization

#### 3.1 Overiview: Vectorizing

#### 3.2 TF-IDF Vectorization with Scikit-Learn

##### 3.2.1 Function & Syntax Documentation

##### 3.2.2 Parameters

### 4 Lexical Semantic Analysis (LSA)

#### 4.1 Overview: Theory & Practice

#### 4.2 Mathematics: Singular Value Decomposition (SVD)

#### 4.3 SVD Modeling with Scikit-Learn

##### 4.3.1 Function & Syntax Documentation

##### 4.3.2 Parameters

#### 4.4 Producing A Meaningful Output Of Concepts

##### 4.4.1 TruncatedSVD() Output

##### 4.4.2 Converting Document Matrices to Concepts

### 5 Results: Interpration Of Extracted Concepts

#### 5.1 Output

#### 5.2 Observations & Opinions