# WHAT IS A TOPIC MODELLING?

A type of statistical modelling for discovering the abstract "topics" that occur in a collection of documents.
A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/trump-tweets/trumptweets.csv
/kaggle/input/trump-tweets/realdonaldtrump.csv


In [2]:
df=pd.read_csv("/kaggle/input/trump-tweets/trumptweets.csv")
df

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags,geo
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 20:54:25,500,868,,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-05 03:00:10,33,273,,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 15:38:08,12,18,,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 22:40:15,11,24,,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 16:07:28,1399,1965,,,
...,...,...,...,...,...,...,...,...,...
41117,1218962544372670467,https://twitter.com/realDonaldTrump/status/121...,I have never seen the Republican Party as Stro...,2020-01-19 19:24:52,32620,213817,,,
41118,1219004689716412416,https://twitter.com/realDonaldTrump/status/121...,Now Mini Mike Bloomberg is critical of Jack Wi...,2020-01-19 22:12:20,36239,149571,,,
41119,1219053709428248576,https://twitter.com/realDonaldTrump/status/121...,I was thrilled to be back in the Great State o...,2020-01-20 01:27:07,16588,66944,,#,
41120,1219066007731310593,https://twitter.com/realDonaldTrump/status/121...,"“In the House, the President got less due proc...",2020-01-20 02:16:00,20599,81921,@ @ @,,


In [3]:
import nltk

In [4]:
from nltk.corpus import stopwords

In [5]:
import re

In [6]:
from nltk.stem import WordNetLemmatizer 

In [7]:
clean=[]

In [8]:
for i in range(0, 15000):
    review = re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?"', ' ', df['content'][i])
    review = review.lower()
    review = review.split()
    lm= WordNetLemmatizer() 
    review = [lm.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    clean.append(review)

In [9]:
df['content'][0]

'Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!'

In [10]:
clean[0]

'sure tune watch donald trump late night david letterman present top ten list tonight'

# All the contents are cleaned..

In [11]:
df_new=pd.DataFrame(df['content'][0:15000])
df_new

Unnamed: 0,content
0,Be sure to tune in and watch Donald Trump on L...
1,Donald Trump will be appearing on The View tom...
2,Donald Trump reads Top Ten Financial Tips on L...
3,New Blog Post: Celebrity Apprentice Finale and...
4,"""My persona will never be that of a wallflower..."
...,...
14995,"W/a newly expanded 27 holes of golfing, Trump ..."
14996,Thank you @ HauteLivingMag for naming @ TrumpD...
14997,"As I predicted, Obama already caught lying on ..."
14998,“Get to know yourself.You can’t improve upon s...


In [12]:
df_new['tweets']=clean

In [13]:
df_new

Unnamed: 0,content,tweets
0,Be sure to tune in and watch Donald Trump on L...,sure tune watch donald trump late night david ...
1,Donald Trump will be appearing on The View tom...,donald trump appearing view tomorrow morning d...
2,Donald Trump reads Top Ten Financial Tips on L...,donald trump read top ten financial tip late s...
3,New Blog Post: Celebrity Apprentice Finale and...,new blog post celebrity apprentice finale less...
4,"""My persona will never be that of a wallflower...",persona never wallflower rather build wall cli...
...,...,...
14995,"W/a newly expanded 27 holes of golfing, Trump ...",w newly expanded 27 hole golfing trump intl pa...
14996,Thank you @ HauteLivingMag for naming @ TrumpD...,thank hautelivingmag naming trumpdoral 1 golf ...
14997,"As I predicted, Obama already caught lying on ...",predicted obama already caught lying ocare enr...
14998,“Get to know yourself.You can’t improve upon s...,get know improve upon something understand ask...


In [14]:
from nltk.tokenize import word_tokenize

In [15]:
df_new['tokens']=df_new['tweets'].apply(word_tokenize)

In [16]:
df_new

Unnamed: 0,content,tweets,tokens
0,Be sure to tune in and watch Donald Trump on L...,sure tune watch donald trump late night david ...,"[sure, tune, watch, donald, trump, late, night..."
1,Donald Trump will be appearing on The View tom...,donald trump appearing view tomorrow morning d...,"[donald, trump, appearing, view, tomorrow, mor..."
2,Donald Trump reads Top Ten Financial Tips on L...,donald trump read top ten financial tip late s...,"[donald, trump, read, top, ten, financial, tip..."
3,New Blog Post: Celebrity Apprentice Finale and...,new blog post celebrity apprentice finale less...,"[new, blog, post, celebrity, apprentice, final..."
4,"""My persona will never be that of a wallflower...",persona never wallflower rather build wall cli...,"[persona, never, wallflower, rather, build, wa..."
...,...,...,...
14995,"W/a newly expanded 27 holes of golfing, Trump ...",w newly expanded 27 hole golfing trump intl pa...,"[w, newly, expanded, 27, hole, golfing, trump,..."
14996,Thank you @ HauteLivingMag for naming @ TrumpD...,thank hautelivingmag naming trumpdoral 1 golf ...,"[thank, hautelivingmag, naming, trumpdoral, 1,..."
14997,"As I predicted, Obama already caught lying on ...",predicted obama already caught lying ocare enr...,"[predicted, obama, already, caught, lying, oca..."
14998,“Get to know yourself.You can’t improve upon s...,get know improve upon something understand ask...,"[get, know, improve, upon, something, understa..."


In [17]:
df_new['tokens'][90]

['melania',
 'qvc',
 'tomorrow',
 'night',
 '9',
 'p',
 'et',
 'introduce',
 'beautiful',
 'inspiring',
 'melania',
 'timepiece',
 'fashion',
 'jewelry',
 'collection']

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
df_new['tokens'][90]

['melania',
 'qvc',
 'tomorrow',
 'night',
 '9',
 'p',
 'et',
 'introduce',
 'beautiful',
 'inspiring',
 'melania',
 'timepiece',
 'fashion',
 'jewelry',
 'collection']

In [20]:
vect = CountVectorizer().fit(df_new['tokens'][90])
bag_of_words = vect.transform(df_new['tokens'][90])
sum_words = bag_of_words.sum(axis=0) 

In [21]:
sum_words

matrix([[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1]], dtype=int64)

# Get most Frequent Words..

In [22]:
def most_freq_words(s, n=None):
    vect = CountVectorizer().fit(s)
    bag_of_words = vect.transform(s)
    sum_words = bag_of_words.sum(axis=0) 
    freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
    freq =sorted(freq, key = lambda x: x[1], reverse=True)
    return freq[:n]

In [23]:
most_freq_words([ word for tweet in df_new.tokens for word in tweet],20)

[('realdonaldtrump', 2933),
 ('great', 1760),
 ('thanks', 1611),
 ('trump', 1552),
 ('obama', 926),
 ('barackobama', 798),
 ('thank', 667),
 ('good', 640),
 ('get', 610),
 ('like', 607),
 ('time', 603),
 ('people', 577),
 ('president', 568),
 ('donald', 555),
 ('would', 555),
 ('cont', 547),
 ('new', 540),
 ('think', 536),
 ('one', 511),
 ('job', 468)]

# Least Frequent Words..

In [24]:
def least_freq_words(s, n=None):
    vect = CountVectorizer().fit(s)
    bag_of_words = vect.transform(s)
    sum_words = bag_of_words.sum(axis=0) 
    freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
    freq =sorted(freq, key = lambda x: x[1], reverse=False)
    return freq[:n]

In [25]:
least_freq_words([ word for tweet in df_new.tokens for word in tweet],20)

[('wallflower', 1),
 ('cling', 1),
 ('tara', 1),
 ('conner', 1),
 ('achieves', 1),
 ('trumpative', 1),
 ('barnesandnoble', 1),
 ('precipice', 1),
 ('igoogle', 1),
 ('thoughtful', 1),
 ('fb', 1),
 ('url', 1),
 ('sf', 1),
 ('chronicle', 1),
 ('beckham', 1),
 ('britney', 1),
 ('inexplicable', 1),
 ('randal', 1),
 ('pinkett', 1),
 ('lieutenant', 1)]

In [26]:
df_new['tokens'][1]

['donald',
 'trump',
 'appearing',
 'view',
 'tomorrow',
 'morning',
 'discus',
 'celebrity',
 'apprentice',
 'new',
 'book',
 'think',
 'like',
 'champion']

In [27]:
df_new

Unnamed: 0,content,tweets,tokens
0,Be sure to tune in and watch Donald Trump on L...,sure tune watch donald trump late night david ...,"[sure, tune, watch, donald, trump, late, night..."
1,Donald Trump will be appearing on The View tom...,donald trump appearing view tomorrow morning d...,"[donald, trump, appearing, view, tomorrow, mor..."
2,Donald Trump reads Top Ten Financial Tips on L...,donald trump read top ten financial tip late s...,"[donald, trump, read, top, ten, financial, tip..."
3,New Blog Post: Celebrity Apprentice Finale and...,new blog post celebrity apprentice finale less...,"[new, blog, post, celebrity, apprentice, final..."
4,"""My persona will never be that of a wallflower...",persona never wallflower rather build wall cli...,"[persona, never, wallflower, rather, build, wa..."
...,...,...,...
14995,"W/a newly expanded 27 holes of golfing, Trump ...",w newly expanded 27 hole golfing trump intl pa...,"[w, newly, expanded, 27, hole, golfing, trump,..."
14996,Thank you @ HauteLivingMag for naming @ TrumpD...,thank hautelivingmag naming trumpdoral 1 golf ...,"[thank, hautelivingmag, naming, trumpdoral, 1,..."
14997,"As I predicted, Obama already caught lying on ...",predicted obama already caught lying ocare enr...,"[predicted, obama, already, caught, lying, oca..."
14998,“Get to know yourself.You can’t improve upon s...,get know improve upon something understand ask...,"[get, know, improve, upon, something, understa..."


In [28]:
vectorizer = CountVectorizer(min_df=0)# Here "min_df" in the parameter refers to the minimum document frequency and the vectorizer will simply drop all words that occur less than that value set (either integer or in fraction form)
sentence_transform = vectorizer.fit_transform(df_new['tweets'])

In [29]:
sentence_transform

<15000x16145 sparse matrix of type '<class 'numpy.int64'>'
	with 132065 stored elements in Compressed Sparse Row format>

In [30]:
sentence_transform.shape

(15000, 16145)

In [31]:
print("\nThe vectorized array looks like:\n {}".format(sentence_transform.toarray()))


The vectorized array looks like:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [36]:
from sklearn.decomposition import LatentDirichletAllocation

In [37]:
lda = LatentDirichletAllocation(n_components=11, max_iter=5,
                                learning_method = 'online',
                                learning_offset = 50.,
                                random_state = 0)

In [38]:
lda.fit(sentence_transform)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=50.0,
                          max_doc_update_iter=100, max_iter=5,
                          mean_change_tol=0.001, n_components=11, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [41]:
def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]])
        print(message)
        print("="*70)

In [42]:
n_top_words = 40
print("\nTopics in LDA model: ")
tf_feature_names = vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model: 

Topic #0:vote wind alexsalmond turbine scotland inspiration bird role killing north vince definitely convention gary pick beauty ireland key water caught rnc windmill danscavino lombardi begin kill dan scottish aberdeencc achievement 19 ship doonbeg monstrosity hitting beating pressjournal nowhere im dt

Topic #1:thanks happy 2016 luck nice birthday wow trump2016 enjoy yankee pres mean honor left model billmaher experience wish garbage david mike private staff day message bahia6085 seeing christian compliment anniversary financial bos style weapon lincoln gold shark planet including quickly

Topic #2:well loser entrepreneur terrible touch fun oscar midas dannyzuker hater interesting business advice post order young robert trumptowerny bought together brilliant matter danny gut fighting cancer spot often judge suck changed light trial happened highly personally retweet pls atrium learned

Topic #3:team hear already speak food word agschneiderman general ugly bost

In [43]:
first_topic = lda.components_[0]
second_topic = lda.components_[1]
third_topic = lda.components_[2]

In [None]:
first_topic.shape

In [46]:
second_topic.shape

(16145,)