<a href="https://colab.research.google.com/github/ShwetaBaranwal/Topic_Modelling/blob/master/LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing packages

In [0]:
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import nltk
import re


n_features is the vocab size (V)

n_components is the #topics (k)


In [0]:
n_features = 1000
n_components = 10
#for printing top words of the topic 
n_top_words = 20


##Loading newpaper articles dataset

In [0]:
print("Loading dataset...")
t0 = time()

categories = ['rec.sport.baseball', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

print("total #reviews in training = {}".format(len(newsgroups_train.filenames)))
print("total #reviews in test = {}".format(len(newsgroups_test.filenames)))

print("done in %0.3fs." % (time() - t0))


Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Loading dataset...
total #reviews in training = 2151
total #reviews in test = 1431
done in 11.672s.


In [0]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [0]:
newsgroups_train.data[:4]

[': >Based on the amount of E-Mail from fellow Christians who have read the\n: >posts and told me I was wasting my time with Butler and Joslin, I told\n: >them I wasn\'t doing it for DB or  DJ but for other Christians.  They\n: >have told me that DB\'s and DJ\'s arguments won\'t convince most Bible\n: >studying Christians.  So I have reevaluated my purpose here and it\'s\n: >also contributed to my decision.\n\n: So most Bible-studying Christians won\'t be convinced by my arguments? \n: And this is supposed to be a Good Thing, I presume?\n\nWhere does this "Most Bible studying Christians think as Frank\ndoes" come from.  And what implied "good" are you doing for other\nChristians?\n\nAt least some of what you are teaching has been demonstrated as\nwrong.  Has it ever occured to you that you may be doing more harm\nthan good to your fellow Christians?\n\nBTW, I used to think like Frank does.  I went to a fundamentalist\nchurch for a while.  I didn\'t start to really think about what\nthe

In [0]:
newsgroups_train.target[:4]

array([3, 3, 1, 2])

In [0]:
[newsgroups_train.target_names[i] for i in newsgroups_train.target[:4]]


['talk.religion.misc', 'talk.religion.misc', 'rec.sport.baseball', 'sci.space']

##Cleaning data

####Removing punctuations and converting to lower case

In [0]:
def data_clean(data):
  # Remove punctuation
  data_samples_processed = list(map(lambda x: re.sub('[\d+{}\.!?>"/#$&%;=:<()--\n@\t]', '', x), data))
  print("after removing punctuation\n")
  print("\n".join(data_samples_processed[:3]))

  # Convert the titles to lowercase
  data_samples_processed_lower = list(map(lambda x: x.lower(), data_samples_processed))
  print("\nafter converting to lowercase\n")
  print("\n".join(data_samples_processed_lower[:3]))

  return data_samples_processed_lower

In [0]:
training_data = data_clean(newsgroups_train.data)

after removing punctuation

 Based on the amount of EMail from fellow Christians who have read the posts and told me I was wasting my time with Butler and Joslin I told them I wasn't doing it for DB or  DJ but for other Christians  They have told me that DB's and DJ's arguments won't convince most Bible studying Christians  So I have reevaluated my purpose here and it's also contributed to my decision So most Biblestudying Christians won't be convinced by my arguments  And this is supposed to be a Good Thing I presumeWhere does this Most Bible studying Christians think as Frankdoes come from  And what implied good are you doing for otherChristiansAt least some of what you are teaching has been demonstrated aswrong  Has it ever occured to you that you may be doing more harmthan good to your fellow ChristiansBTW I used to think like Frank does  I went to a fundamentalistchurch for a while  I didn't start to really think about whatthey were saying until I noticed a God's Science phamphlet

In [0]:
test_data = data_clean(newsgroups_test.data)

after removing punctuation

I understand the when one is in orbit the inward force of gravity atone's center of mass is exactly balanced by the outward centrifugalforce from the orbiting motion resulting in weightlessness I want to know what weightlessness actually FEELS like For example isthere a constant sensation of falling And what is the motion sicknessthat some astronauts occasionally experience  Please reply only if you are either a former or current astronaut or someone who has had this discussion firsthand with an astronaut Thanks
T H E  G R A P H I C S  B B S                                                                                                                                                                                                                                                                                                                                                                                        rentcom         It's better than a sharp stick in

In [0]:
[newsgroups_test.target_names[i] for i in newsgroups_test.target[:3]]

['sci.space', 'comp.graphics', 'rec.sport.baseball']

In [0]:
# from nltk import stem
# l = stem.WordNetLemmatizer()
# word_list = ['advance', 'advanced']
# [l.lemmatize(i) for i in word_list]

['advance', 'advanced']

In [0]:
# from nltk import word_tokenize
# from nltk import stem
# nltk.download('punkt')
# nltk.download('wordnet')

# class LemmaTokenizer(object):
#     def __init__(self):
#         self.wnl = stem.WordNetLemmatizer()
#     def __call__(self, doc):
#         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]


# tf_vectorizer = CountVectorizer( max_df=0.95, 
#                               min_df=2,
#                               max_features=n_features,
#                               stop_words='english', 
#                               tokenizer=LemmaTokenizer())


##Converting the document corpus into countvectorizer

In [0]:

print("Use tf (raw term count) features for LDA.")
    
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(training_data)
print("done in %0.3fs." % (time() - t0))


Use tf (raw term count) features for LDA.
done in 0.308s.


In [0]:
tf.shape

(2151, 1000)

In [0]:
tf_vectorizer.get_feature_names()[-20:]

['words',
 'work',
 'working',
 'works',
 'world',
 'worth',
 'wouldn',
 'write',
 'writing',
 'written',
 'wrong',
 'wrote',
 'xv',
 'yankees',
 'year',
 'years',
 'yes',
 'yesterday',
 'york',
 'young']

In [0]:
tf_vectorizer.get_feature_names()[:20]

['ability',
 'able',
 'accept',
 'access',
 'according',
 'act',
 'activities',
 'acts',
 'actual',
 'actually',
 'add',
 'addition',
 'address',
 'advance',
 'advanced',
 'aerospace',
 'age',
 'agency',
 'ago',
 'agree']

##Implementing LDA

In [0]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (len(training_data), n_features))
# lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
#                                 learning_method='online',
#                                 learning_offset=50.,
#                                 random_state=0)

lda = LatentDirichletAllocation(n_components=n_components, max_iter=10,
                                learning_method='batch',
                                random_state=0)

t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))


Fitting LDA models with tf features, n_samples=2151 and n_features=1000...
done in 6.460s.


In [0]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


Printing top words from each topic

In [0]:
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)



Topics in LDA model:
Topic #0: know does mr said article list jim just judas post new tyre day don read context think point greek email
Topic #1: space launch nasa shuttle satellite lunar mission program new data commercial flight moon earth satellites technology station mars spacecraft solar
Topic #2: like just don think know time ll did good really ve people didn say got sure make right lot going
Topic #3: data image available information ftp email images software send graphics contact package processing use address mail user systems free program
Topic #4: won lost game th hit new cubs york home fan san st games second phillies play ball reds run louis
Topic #5: image jpeg file gif images color files bit format use version quality does programs mode don free display tiff convert
Topic #6: year team good runs win better pitching points game games baseball average league best don years think braves season teams
Topic #7: graphics program help like know need looking thanks use files hi

Validating on training set

In [0]:
res_train = lda.transform(tf)
res_train.shape

(2151, 10)

In [0]:
res_train[0]

array([0.06856659, 0.0013335 , 0.00133379, 0.00133359, 0.03080147,
       0.00133347, 0.00133359, 0.00133359, 0.89129694, 0.00133346])

In [0]:
training_data[0]

" based on the amount of email from fellow christians who have read the posts and told me i was wasting my time with butler and joslin i told them i wasn't doing it for db or  dj but for other christians  they have told me that db's and dj's arguments won't convince most bible studying christians  so i have reevaluated my purpose here and it's also contributed to my decision so most biblestudying christians won't be convinced by my arguments  and this is supposed to be a good thing i presumewhere does this most bible studying christians think as frankdoes come from  and what implied good are you doing for otherchristiansat least some of what you are teaching has been demonstrated aswrong  has it ever occured to you that you may be doing more harmthan good to your fellow christiansbtw i used to think like frank does  i went to a fundamentalistchurch for a while  i didn't start to really think about whatthey were saying until i noticed a god's science phamphletthere  i read it and notice

In [0]:
newsgroups_train.target_names[newsgroups_train.target[0]]

'talk.religion.misc'

Validating on test set

In [0]:
tf_test = tf_vectorizer.transform(test_data)
res_test = lda.transform(tf_test)
res_test.shape

(1431, 10)

In [0]:
np.argmax(res_test[3])

1

In [0]:
test_data[3]

"i like this statement though for my own reasons  cost comparisons dependa lot on whether the two options are similar and then it becomes veryrevealing to consider what their differences are  can soyuz launch thelong exposure facility  course not  will the shuttle take my television relay to leo by year's end  almost certainly not but the russians arepretty good about making space accessible on a tight schedulecomparing s and ss points up that there are two active spacelauncherandworkplatform resources with similarities and differenceswhere they are in direct competition we may get to see some marketeconomics come into playtombaker"

In [0]:
newsgroups_test.target_names[newsgroups_test.target[3]]

'sci.space'

In [0]:
https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0