**Best viewed via Jupyter nbviewer**: https://nbviewer.jupyter.org/github/Josh-Been/LSI-Topic-Model/blob/master/Baylor-Libraries-LSI-Topic-Model.ipynb
![alt text](https://github.com/Josh-Been/Sentiment-Per-Line/blob/master/Capture.PNG?raw=true "Baylor University Libraries")

## LSI-Topic-Model

Generates related keywords from a corpus bundeled together into topic areas. 10 keywords are generated per topic. A single text file is uploaded and each line is treated as a separate document.

Baylor University Libraries: LSI Topic Model

Implements the Latent Semantic Index

From Wikipedia https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing "Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings."

This Python application was built by the Baylor University Libraries to assist researchers to implement unsupervised topic modelling on 1-line documents, such as Twitter social media.

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/anaconda.png?raw=true" width=200>
<p>**First**, ensure Anaconda 2.7 is installed on your system. If it is not, head to https://www.anaconda.com/download/ and install. Then continue with the next step.</p>

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/jupyter.png?raw=true" width=200>
**Second**, launch the Jupyter Notebook application. If you do not see a launcher for Jupyter Notebook in the Anaconda application directory, launch Anaconda Navigator which will have a link to Jupyter Notebook.

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/rightclicksave.png?raw=true" hspace="20" width=100>

**Third**, download the Jupyter Notebook file https://raw.githubusercontent.com/Josh-Been/LSI-Topic-Model/master/Baylor-Libraries-LSI-Topic-Model.ipynb to your computer. In the Jupyter browser tab that opened in the previous step, click the Upload button and browse for the saved Jupyter Notebook file.

Up to this point you have been reading an HTML version of this Notebook.

Now switch to the interactive version in Jupyter.


<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/gensim.png?raw=true" hspace="20" width=200>
**Fourth**, ensure you have the Gensim: Topic Modelling for Humans library installed. If you are confident you already installed Gensim, skip ahead of this step. Otherwise, put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.


For background on Gensim - https://radimrehurek.com/gensim/

#### NOTE: The command below may take a minute or two before any response is given, depending on the speed of the computer and the network connection. Please be patient before moving on to the next step. Now is a good time to go and grab that coffee.

In [None]:
!conda install -c anaconda gensim

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/pyenchant.png?raw=true" hspace="20" width=200>
**Fifth**, ensure you have the PyEnchant: a spellchecking library for Python installed. If you are confident you already installed PyEnchant, skip ahead of this step. Otherwise, put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.


For background on PyEnchant - http://pythonhosted.org/pyenchant/

In [None]:
!pip install pyenchant

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/txt.png?raw=true" hspace="20" width=100>

**Sixth**, browse for the text file containing lines of documents. Put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.

In [None]:
import warnings, enchant
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim import corpora, models
from Tkinter import *
from tkFileDialog import askopenfilename

root_stop = Tk()
txt_file = askopenfilename()
print txt_file
root_stop.update()
root_stop.destroy()

documents = []
documents[:] = []
f = open(txt_file, 'r')
for line in f:
    documents.append(line)
f.close()

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/stopwords_list.gif?raw=true" hspace="20" width=200>

**Seventh**, browse for a stop word list. This list must be a text file with one word per line. Put the cursor in the box beow and click the 'run cell, select below' button at the top of this notebook.

There are numerous stopword lists on the internet. One example of lists in numerous languages is https://github.com/Alir3z4/stop-words It is advisable to modify lists as per your corpus.

In [None]:
root_lines = Tk()
stoptxt = askopenfilename()
print stoptxt
root_lines.update()
root_lines.destroy()

stoplist = []
stoplist[:] = []
f1 = open(stoptxt, 'r')
for line in f1:
    line = line.replace('\n','')
    line = line.replace('\r','')
    stoplist.append(line)
f1.close()
stoplist.append('rt')
stoplist.append('&gt;')
stoplist.append('sho')
stoplist.append('&amp;:)')
stopset = set(stoplist)

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/options.png?raw=true" hspace="20" width=100>
**Eighth**, specify the following options. Then, put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.

In [None]:
number_of_topics = 5
limit_proper_english_words = 'true'

<img style="float: left;" src="https://github.com/Josh-Been/LSI-Topic-Model/blob/master/lsi.png?raw=true" hspace="20" width=200>
**Tenth**, calculate the LSI topics for the corpus. Put the cursor in the box beow and click the 'run cell, select below' button at the top of this notebook.

In [None]:
# This snippet assigns default values in case the ooptions are not run
try:
    a = number_of_topics
except:
    number_of_topics = 5
    limit_proper_english_words = 'true'

d = enchant.Dict('en_US')
if limit_proper_english_words == 'true':
    texts = [[word for word in document.lower().split() if (word not in stopset and 'http' not in word and not word.isdigit() and word.islower() and d.check(word))]
         for document in documents]
else:
    texts = [[word for word in document.lower().split() if (word not in stopset and 'http' not in word and not word.isdigit())]
         for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

dictionary = corpora.Dictionary(texts)
# dictionary.save('/tmp/twitter.dict')

corpus = [dictionary.doc2bow(text) for text in texts]
# corpora.MmCorpus.serialize('/tmp/twitter.mm', corpus)

tfidf = models.TfidfModel(corpus)

corpus_tfidf = tfidf[corpus]

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=number_of_topics)
corpus_lsi = lsi[corpus_tfidf]

lsi.print_topics(number_of_topics)