# Topic Modelling

Here, we'll use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. 
First, we need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. 

Finally, we'll Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

In [9]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('data/newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

print("A sample row from our input data is as follows:\n===================\n\n", newsgroup_data[5])

char_count = 0
for item in newsgroup_data:
    char_count += len(item)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')

# Fit and transform
X = vect.fit_transform(newsgroup_data)
print("\n\n===================\nWe've passed our data through a CountVectorizer, and then fitted and transformed it as our training data.")
print("\n\nIn the training data:\nNumber of corpora = {}. Total number of characters = {}".format(len(newsgroup_data), char_count))

A sample row from our input data is as follows:

 
There would be no problems as long as the OS didn't set up a DMA transfer
to an area above the 16 mb area (the DMA controller probably can't be
programmed that way anyways, so there probably isin't a problem with this)


We've passed our data through a CountVectorizer, and then fitted and transformed it as our training data.


In the training data:
Number of corpora = 2000. Total number of characters = 1667950


In [10]:
# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
print("Now, we'll convert our training data X to gensim corpus\n")

Now, we'll convert our training data X to gensim corpus



In [11]:
# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


####
print("Creating a mapping from word IDs to words for loading into the LdaModel.")
print("The ID map is as follows:\n")
for items in list(id_map.items())[:10]:
    print(items) 

Creating a mapping from word IDs to words for loading into the LdaModel.
The ID map is as follows:

(76, 'best')
(335, 'group')
(33, 'america')
(409, 'know')
(726, 'similar')
(544, 'organization')
(23, 'address')
(514, 'new')
(899, 'york')
(842, 'usa')


In [12]:
%%time
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

# Your code here:
trained_ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=id_map, passes=25, random_state=34)
print('''Training Successful!
Params:
Number of topics: 10,
passes: 25,
random state: 34''')

Training Successful!
Params:
Number of topics: 10,
passes: 25,
random state: 34
CPU times: total: 1min 32s
Wall time: 1min 38s


### lda_topics

Our lda model returns a list of topic numbers, and their probability
However, these are TOPIC NUMBERS, and they aren't named/

Using `trained_ldamodel`, we can find a list of the 10 topics and the most significant 10 words in each topic. 
We can then use this info to manully name our name our TOPIC NUMBERS.

First, we'll see the topic number & it's respective keywords.
This will be structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')`

Now, we can easily see that the topic list can be "Space, nasa, science etc."
Since "space" seems to be the most relevant topic, we'll rename TOPIC NUMBER 9 as "space"

In [16]:
def lda_topics():
    return list(trained_ldamodel.show_topics(num_topics=10, num_words=10))

lda_topics()

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.0

### topic_names

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

*This function should return a list of 10 strings.*

In [33]:
# Manually assigning topic names to each topic number.

topic_names = ['Education','Science','Computers & IT','Religion','Automobiles','Sports','Science','Religion','Computers & IT','Science']
topic_numbers = [int(i) for i in range(10)]

name_mapping = {}
name_mapping.update(list(zip(topic_numbers, topic_names)))
name_mapping

{0: 'Education',
 1: 'Science',
 2: 'Computers & IT',
 3: 'Religion',
 4: 'Automobiles',
 5: 'Sports',
 6: 'Science',
 7: 'Religion',
 8: 'Computers & IT',
 9: 'Science'}

### topic_distribution

For the new document `new_doc`, let's find the topic distribution. 

As with all input text, We'll use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function will return a list of tuples, where each tuple is `(#topic, probability)`*

In [34]:
# this function takes in a list of strings, preprocesses it.
# it then returns a list of tuples containing topic names and their respective probabilities.

def topic_distribution(doc_to_test):
    
    sparse_doc = vect.transform(doc_to_test)
    gen_corpus = gensim.matutils.Sparse2Corpus(sparse_doc, documents_columns=False)
    
    topics = sorted(list(trained_ldamodel[gen_corpus])[0], key=lambda x: -x[1]) # It's a list of lists! We just want the first one.
    return [(name_mapping[x[0]], x[1]) for x in topics]

In [35]:
test1 = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

topic_distribution(test1)

[('Religion', 0.4964944),
 ('Science', 0.34348097),
 ('Sports', 0.020004135),
 ('Automobiles', 0.020004045),
 ('Science', 0.02000333),
 ('Computers & IT', 0.020003129),
 ('Education', 0.020003106),
 ('Science', 0.020002974),
 ('Religion', 0.020002646),
 ('Computers & IT', 0.020001281)]

In [37]:
test2 = ['It is considered to be the oldest living religion in the world. Hinduism is often called a "way of life", and anyone sincerely following that way of life can consider themselves to be a Hindu.']

topic_distribution(test2)

[('Religion', 0.9181615)]

In [43]:
test3 = ['''Linux is a family of open-source Unix-like operating systems based on the Linux kernel,
an operating system kernel first released on September 17, 1991, by Linus Torvalds.
Linux is typically packaged in a Linux distribution.''']
topic_distribution(test3)

[('Science', 0.49019933),
 ('Computers & IT', 0.37642974),
 ('Computers & IT', 0.016674902),
 ('Science', 0.01667401),
 ('Religion', 0.016671907),
 ('Education', 0.01667108),
 ('Religion', 0.01667102),
 ('Science', 0.016670303),
 ('Sports', 0.016668897),
 ('Automobiles', 0.016668787)]

In [45]:
test4 = ['''With 15 matches left to play in the league stage of IPL 2022, as of Monday May 9,
there remain as many as 32,768 possible combinations of results. 
Sunday's two games have brought that figure down from a staggering 1,31,072. ''']

topic_distribution(test4)

[('Sports', 0.8874408),
 ('Religion', 0.012510252),
 ('Science', 0.0125086205),
 ('Religion', 0.012507285),
 ('Computers & IT', 0.012507194),
 ('Computers & IT', 0.012506805),
 ('Science', 0.012505957),
 ('Science', 0.012505735),
 ('Automobiles', 0.012504364),
 ('Education', 0.012502977)]