# Basic Example LDA Topic Extraction - Text Mining

- English fo simplicity

In [9]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

### Import/Create documents

In [10]:
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health." 

## Cleaning your documents

- Tokenizing: converting a document to its atomic elements.

    
- Stopping: removing meaningless words.

    
- Stemming: merging words that are equivalent in meaning.

### Tokenization
- Tokenization segments a document into its atomic elements. 


- In this case, we are interested in tokenizing to words. 


- Tokenization can be performed many ways–we are using NLTK’s tokenize.regexp module:

In [11]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

A ``RegexpTokenizer`` splits a string into substrings using a regular expression.

In [12]:
raw = doc_a.lower()
tokens = tokenizer.tokenize(raw)

In [13]:
print(tokens)

['brocolli', 'is', 'good', 'to', 'eat', 'my', 'brother', 'likes', 'to', 'eat', 'good', 'brocolli', 'but', 'not', 'my', 'mother']


### Stop words
- Conjunctions (“for”, “or”) or  “the” are meaningless to a topic model. 


- These terms are called stop words and need to be removed from our token list.


- Example inconvenience:  topic modeling a collection of music reviews, then terms like 
  “The Who” will have trouble 


- Freedom to construct your own stop word list (like we do in Spanish)


- stop_words package from Pypi, a relatively conservative list. 


- We can call get_stop_words() to create a list of stop words:

In [15]:
# create English stop words list
en_stop = get_stop_words('en')
en_stop

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 "can't",
 'cannot',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'doing',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 "hadn't",
 'has',
 "hasn't",
 'have',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 "here's",
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 "how's",
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 "let's",
 'me',
 'more',
 'most',
 "mustn't",
 'my',
 'myself',
 'no',
 'nor',
 'not',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'ought',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'same',
 "shan't",
 'she',
 "she'd",
 "she'll",
 "she's",
 'should',
 "s

### Stemming
- NLP technique to reduce topically similar words to their root. 


- For example, “stemming,” “stemmer,” “stemmed,” all have similar meanings; 


- stemming reduces those terms to “stem. (tallo)” 


- This is important for topic modeling, which would otherwise view those terms as separate entities and 
  reduce their importance in the model.

    
- One option:  The Porter stemming algorithm is the most widely used method

In [17]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [18]:
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]


In [19]:
# list for tokenized documents in loop
texts = []

In [20]:
# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

### Constructing a document-term matrix using Genism

- Gensim
- author(s): Radim Řehůřek
- Initial release: 2009
- Gensim is a robust open-source topic modeling toolkit implemented in Python.
- It uses NumPy, SciPy and optionally Cython for performance. 
- Gensim is specifically designed to handle large text collections
- Streaming ->  differentiates it from most other scientific software packages that only target batch and in-memory processing.

- To generate an LDA model, we need to understand how **frequently** each term occurs within each document. 
- We need to construct a document-term matrix with  gensim.

- The Dictionary() function traverses texts, 
- assigning a unique integer id to each unique token 
- Also collecting word counts and relevant statistics.  

In [21]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

To see each token’s unique integer id, try

In [22]:
print(dictionary.token2id)

{'blood': 13, 'expert': 15, 'like': 4, 'suggest': 20, 'around': 6, 'drive': 8, 'well': 29, 'good': 3, 'often': 25, 'pressur': 19, 'brocolli': 0, 'may': 18, 'lot': 9, 'school': 27, 'better': 22, 'increas': 17, 'caus': 14, 'brother': 1, 'mother': 5, 'seem': 28, 'basebal': 7, 'never': 24, 'feel': 23, 'health': 16, 'say': 31, 'eat': 2, 'practic': 10, 'spend': 11, 'time': 12, 'perform': 26, 'tension': 21, 'profession': 30}


Next, our dictionary must be converted into a bag-of-words:
    
- The bag-of-words model is a simplifying representation used in natural language processing 


- In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words


- **Disregarding grammar and even word order but keeping multiplicity**


- The bag-of-words model has also been used for computer vision.[1]


- In document classification (frequency of) occurrence of each word is used as a feature.

-- First reference in literature: Zellig Harris's 1954.

In [23]:
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]


- The doc2bow() (Document to bag of words) function converts dictionary into a bag-of-words. 


- The result, corpus, is a list of vectors equal to the number of documents. 


- In each document vector is a series of tuples. 


- As an example, print(corpus[0]) results in the following:

In [24]:
print(corpus[0])

[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)]


In [None]:
- This list of tuples represents our first document, doc_a. 


- The tuples are (term ID, term frequency) pairs, 


- If  print(dictionary.token2id) says brocolli’s id is 0, then the first tuple indicates that brocolli
 appeared twice in doc_a. 

- **corpus** is a document-term matrix and now we’re ready to generate an LDA model.


- The LdaModel class is described in detail in the gensim documentation. 


- Parameters:
  - **num_topics**: User determine how many topics should be generated. Our document set is small, so we’re only asking for three topics.
    
  - **id2word**: required. The LdaModel class requires our previous **dictionary to map ids to strings.**

  - **passes:** optional. The number of **laps** the model will take through corpus. 

        
        
- The greater the number of passes, the more **accurate** the model will be. 



- A lot of passes can be **slow** on a very large corpus.

In [29]:
### generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, passes=20)

### we specify the number of topics (clusters). We arrive to the non-elegant result:

In [33]:
ldamodel.print_topics(num_topics=4, num_words=4)

[(0, '0.074*"drive" + 0.074*"brother" + 0.074*"mother" + 0.074*"spend"'),
 (1, '0.135*"health" + 0.052*"pressur" + 0.052*"suggest" + 0.052*"expert"'),
 (2, '0.031*"drive" + 0.031*"mother" + 0.031*"brother" + 0.031*"brocolli"'),
 (3, '0.078*"good" + 0.078*"brocolli" + 0.078*"eat" + 0.078*"mother"')]

of course very small dataset