# Topic Modeling using LDA
We will use the **tweetscrapper** module here to generate tweets using Markov chains. I have used the `tweet_driver.py` script to extract my tweets and store them in an SQLite database. Let us install some dependencies:

```
pip install tweetscrape
pip install gensim
```

Refer: https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

In [1]:
import re
from gensim.parsing.preprocessing import remove_stopwords, strip_multiple_whitespaces, stem_text, strip_punctuation
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel

from tweetscrape.coolstuff.db_helper import SQLiteHelper

In [11]:
sqlite = SQLiteHelper()
fetched_tweets = sqlite.get_all_tweets()
print("Extracted {0} tweets".format(len(fetched_tweets)))
print(fetched_tweets[0:3])

Extracted 788 tweets
[(808418796058869761, 'tweet', 'photomatt', 13479, 1481577035000, '  State of the Word,\xa02016https://ma.tt/2016/12/state-of-the-word-2016/\xa0…', '["https://t.co/hcUtziNYKQ"]', '[]', '[]', 5, 137, 49), (811950723055357952, 'tweet', 'droidconIN', 337801138, 1482419112000, '  Considering doing a rewrite of your Android app?@AdnanM0123 from @bookmyshow shares their experience and insights.http://hsgk.in/2hYoAlR\xa0pic.twitter.com/4IulVFOrNM', '["https://t.co/st5xbQd8sO", "https://t.co/4IulVFOrNM"]', '[]', '["@AdnanM0123", "@bookmyshow"]', 0, 9, 6), (812662401480830976, 'tweet', '5hirish', 428808036, 1482588789000, '  Dependency #Parsing Tutorial in #NLP using @spacy_io #spacy #nltk @honnibal #python @nlp_storieshttps://shirishkadam.com/2016/12/23/dependency-parsing-in-nlp/\xa0…', '["https://t.co/muuDf53uCg"]', '["#Parsing", "#NLP", "#spacy", "#nltk", "#python"]', '["@spacy_io", "@honnibal", "@nlp_stories"]', 0, 32, 18)]


In [12]:
tweets_doc = []
for tweet in fetched_tweets:
    tweets_doc.append(tweet[5])
print(tweets_doc[0:3])

['  State of the Word,\xa02016https://ma.tt/2016/12/state-of-the-word-2016/\xa0…', '  Considering doing a rewrite of your Android app?@AdnanM0123 from @bookmyshow shares their experience and insights.http://hsgk.in/2hYoAlR\xa0pic.twitter.com/4IulVFOrNM', '  Dependency #Parsing Tutorial in #NLP using @spacy_io #spacy #nltk @honnibal #python @nlp_storieshttps://shirishkadam.com/2016/12/23/dependency-parsing-in-nlp/\xa0…']


Looking at the tweets we can tell that there is a lots of noise and unwanted information. We won't need external links, twitter mentions and hastags to determine the topic of a tweets. So lets get ride of them using regular expressions.

We will also have to remove any kind of a stop words, multiple whitespaces or punctuations.

In [13]:
links_pattern = re.compile('(http[^\s]+)')
pics_pattern = re.compile('(pic.twitter.com/[^\s]+)')
mention_pattern = re.compile('\@([a-zA-Z0-9_]+)')
hashtag_pattern = re.compile('#([a-zA-Z0-9_]+)')

In [14]:
def regex_clean(pattern, text):
    return pattern.sub(' ', text)

In [15]:
for pos, tweets_text in enumerate(tweets_doc):
    tweets_text = regex_clean(links_pattern, tweets_text)
    tweets_text = regex_clean(pics_pattern, tweets_text)
    tweets_text = regex_clean(mention_pattern, tweets_text)
    tweets_text = regex_clean(hashtag_pattern, tweets_text)

    tweets_text = strip_punctuation(tweets_text)
    tweets_text = strip_multiple_whitespaces(tweets_text)
    # improve stopwords removal
    tweets_text = remove_stopwords(tweets_text)
    tweets_text = stem_text(tweets_text)
    tweets_tokens = simple_preprocess(tweets_text)
    tweets_doc[pos] = tweets_tokens
print(tweets_doc[0:3])

[['state', 'word'], ['consid', 'rewrit', 'android', 'app', 'share', 'experi', 'insight'], ['depend', 'tutori']]


Using `simple_preprocess()` we have also converted the document into a list of tokens. Next we need to create a Dictionary of these tokens, which is nothing but a mapping between the tokens and thier integer ids.

In [16]:
id2word_tweet = Dictionary(tweets_doc)
print(id2word_tweet)

Dictionary(2914 unique tokens: ['turn', 'newslett', 'auto', 'sylvia', 'bit']...)


The dictionary object now contains all words that appeared in the corpus, along with how many times they appeared. Let's filter out both very infrequent words and very frequent words (stopwords), to clear up resources as well as remove noise. 

Here we ignore the words that appear in less than 3 documents or more than 5% documents.

In [17]:
id2word_tweet.filter_extremes(no_below=3, no_above=0.05)
print(id2word_tweet)

Dictionary(763 unique tokens: ['tesla', 'problem', 'just', 'turn', 'head']...)


Now transform a document into a bag-of-word vector, using a dictionary

In [20]:
tweet_vec = [id2word_tweet.doc2bow(td) for td in tweets_doc]
print(tweet_vec[0:5])

[[(0, 1), (1, 1)], [(2, 1), (3, 1), (4, 1), (5, 1)], [(6, 1), (7, 1)], [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(15, 1), (16, 1)]]


In [21]:
tweet_lda = LdaModel(tweet_vec, num_topics=3, id2word=id2word_tweet, passes=50)

In [22]:
tweet_lda.print_topics(num_topics=3, num_words=3)

[(0, '0.013*"ai" + 0.011*"us" + 0.011*"spaci"'),
 (1, '0.016*"talk" + 0.014*"know" + 0.014*"we"'),
 (2, '0.016*"like" + 0.012*"year" + 0.011*"trump"')]