## Text analysis with gensim

[Gensim](https://radimrehurek.com/gensim/intro.html) is a Python library that is designed to extract semantic objects from plain text documents.

Gensim allows you to examine statistical co-occurence patterns of the words within a corpus in order to understand the semantic structure of documents. It uses algorithms like Latent Semantic Analysis, Latent Dirichlet Allocation and Random Word Projections to look for topical similarities among documents. These algorithms are unsupervised. 

If you have not already done so, install gensim.

In [62]:
#!conda install -y gensim

### Goals

- Vectorize a corpus
- Get tf-idf scores for a corpus
- Compare an individual text to a corpus using LSI

### Important Terms, as defined in gensim

**Corpus**: A collection of digital documents. This is sometimes called a "**training corpus**" because it is used to infer the structure of the documents. This inferred latent structure can be used later to examine new documents, which did not appear in the original corpus. 

**Vector**: Gensim relies on the Vector Space Model (VSM), in which each document is represented by an array of features. Features can be thought of as a question-answer pair. For example:

 - How many paragraphs does the document consist of? Seven.
 - How many times does the word "cat" appear in the document? Zero.
 - How many commas are there in the document? Fourteen.

The question is usually represented by an integer id, so that the document is represented as a series of pairs like (1, 7.0), (2, 0.0) (3, 14.0). However, if the questions are known in advance they can be left implicit. Only questions to which the answer can be expressed as a single real number will work. This sequence of answers can be thought of as a multi-dimensional vector. The same questions are applied to each document, so that conclusions about similarity can be drawn (as long as our questions are well-picked). 

**Sparse Vector**: Vectors in which most of the answers are 0.0. Usually the answer to most questions will be 0.0, so to save space gensim omits them from the document's representation. In the example above, this means the document would be represented as (1,7.0), (3,14.0), with (2,0.0) left out.

**Model**: An abstract term referring to a transformation from one document representation to another. Since documents are represented as vectors, in gensim a model can be thought of as a transformation between two vector spaces.

### Transformations available in gensim

[Term Frequency * Inverse Document Frequency (TF-IDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

[Latent Semantic Indexing (LSI or sometimes LSA)](https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing)

[Random Projection (RP)](http://users.ics.aalto.fi/ella/publications/randproj_kdd.pdf)

[Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

[Hierarchical Dirichlet Process (HDP)](http://proceedings.mlr.press/v15/wang11a/wang11a.pdf)

Gensim's explanation of each of these transformations can be found [here](https://radimrehurek.com/gensim/tut2.html).


### Testing gensim on a small corpus

Import corpora from gensim and create a corpus.

In [1]:
from gensim import corpora
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Lowercase each document, split it by white space and filter out stopwords.

In [2]:
stoplist = set('for a of the and to in'.split(' '))
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]

Count word frequencies, throw out words that only appear once. 

**Note**: approaches to this initial text processing will vary - gensim is set up to deal with that.

In [3]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
processed_corpus

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Assign a unique integer ID to all words appearing in the corpus.

In [4]:
dictionary = corpora.Dictionary(texts)
#dictionary.save('/Users/annapreus/practice.dict')  # store the dictionary, for future reference
print(dictionary)
print ""
print(dictionary.token2id)

Dictionary(35 unique tokens: [u'minors', u'generation', u'testing', u'iv', u'engineering']...)

{u'minors': 30, u'generation': 22, u'testing': 16, u'iv': 29, u'engineering': 15, u'computer': 2, u'relation': 20, u'human': 3, u'measurement': 18, u'unordered': 25, u'binary': 21, u'abc': 0, u'ordering': 31, u'graph': 26, u'system': 10, u'machine': 6, u'quasi': 32, u'random': 23, u'paths': 28, u'error': 17, u'trees': 24, u'lab': 5, u'applications': 1, u'management': 14, u'user': 12, u'interface': 4, u'intersection': 27, u'response': 8, u'perceived': 19, u'widths': 34, u'well': 33, u'eps': 13, u'survey': 9, u'time': 11, u'opinion': 7}


Convert tokenized documents to vectors.

In [5]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(2, 1), (3, 1)]


In [6]:
from pprint import pprint
corpus = [dictionary.doc2bow(text) for text in texts]
#corpora.MmCorpus.serialize('corpus.mm', corpus)  # store to disk, for later use
pprint(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)],
 [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)],
 [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)],
 [(24, 1), (26, 1), (27, 1), (28, 1)],
 [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)],
 [(9, 1), (26, 1), (30, 1)]]


### Working with a larger corpus

Read in corpus, preprocess texts.

In [8]:
import glob, re, string

print string.punctuation + ' \n\t'

path = "../corpora/shakespeare_plaintext/*.txt"
filepaths = glob.glob(path)

documents = []
punctuations = "-,.?!;: \n\t"

for fn in filepaths:
    text = open(fn, 'r').read()
    documents.append(text)

stoplist = set('for a of the and to in'.split())
texts = [[word.strip(punctuations) for word in document.lower().split() if word not in stoplist]for document in documents]

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
        
texts = [[token for token in text if frequency[token] > 1]for text in texts]

print len(texts)
print texts[:1]

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 
	
36
[['if', 'music', 'be', 'food', 'love', 'play', 'on', 'give', 'me', 'excess', 'it', 'that', 'surfeiting', 'appetite', 'may', 'sicken', 'so', 'die', 'that', 'strain', 'again', 'it', 'had', 'dying', 'fall', 'o', 'it', 'came', "o'er", 'my', 'ear', 'like', 'sweet', 'sound', 'that', 'breathes', 'upon', 'bank', 'violets', 'stealing', 'giving', 'enough', 'no', 'more', "'tis", 'not', 'so', 'sweet', 'now', 'as', 'it', 'was', 'before', 'o', 'spirit', 'love', 'how', 'quick', 'fresh', 'art', 'thou', 'that', 'notwithstanding', 'thy', 'capacity', 'as', 'sea', 'nought', 'enters', 'there', 'what', 'validity', 'pitch', "soe'er", 'but', 'falls', 'into', 'abatement', 'low', 'price', 'even', 'minute', 'so', 'full', 'shapes', 'is', 'fancy', 'that', 'it', 'alone', 'is', 'high', 'fantastical', 'will', 'you', 'go', 'hunt', 'my', 'lord', 'what', 'hart', 'why', 'so', 'i', 'do', 'noblest', 'that', 'i', 'have', 'o', 'when', 'mine', 'eyes', 'did', 'see', 'olivia', 'first', 'me