<img src="https://blog.edugrad.com/wp-content/uploads/2019/03/logo-icon.png" style="height:100px" align ="right">

In [1]:
import pandas as pd

import spacy
nlp = spacy.load('en_core_web_sm')

import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary

import re

#### Here, we are loading the packages to be used for this particular type of project.
 - We loaded the Pandas library to work upon data structures in Python
 - We loaded spacy library to work upon Natural Language Processing
 - Then we loaded the en_core_web_sm model to work upon English Text
 - Thereafter we loaded the <b>gensim</b> package which contains high level text processing packages. We loaded the <b>LdaModel</b> which is stored under models in gensim and Dictionary for processing in gensim corpora
 - At the end we loaded the regular expression library to find specific patterns in data

In [2]:
data = pd.read_excel('topic_modelling.xlsx')

We then load the excel file <b>topic_modelling.xlsx</b> using pandas read_excel function

In [3]:
data.head()

Unnamed: 0,Startup News,Summary,Posted By,Description
0,How PUBG has redefined the Indian gaming ecosy...,Indians are well and truly addicted to PUBG th...,Puneet Kumar,"Earlier this month, I went to my home town in ..."
1,Dream11 closes $100 M funding led by Tencent; ...,,Vishal Krishna,Sports fantasy gaming company Dream11 has rais...
2,How enterprise gaming has grown from a ridicul...,,Kamalika Bhattacharya,Having spent a large part of my career in the ...
3,These MBA grads are changing the way India loo...,"From online tournaments to events, online poke...",Sindhu Kashyap,"A deal, bated breath, a flush, a straight hand..."
4,Facebook is celebrating two years of encouragi...,,Team YS,"Two years ago, Facebook India launched a uniqu..."


We analyzed the data and find that there are several NaN values in Summary column.

In [4]:
data.isnull().sum()

Startup News    0
Summary         6
Posted By       0
Description     0
dtype: int64

To see if any other column contains Null or NaN value we used <b>.isnull().sum()</b> functionality and it revealed that only Summary column contains 6 Null values.

In [5]:
data.fillna(' ',inplace=True)

Since, we are working upon text hence we filled the Null values by space which is being denoted by <b>' '<b>

In [6]:
data.head()

Unnamed: 0,Startup News,Summary,Posted By,Description
0,How PUBG has redefined the Indian gaming ecosy...,Indians are well and truly addicted to PUBG th...,Puneet Kumar,"Earlier this month, I went to my home town in ..."
1,Dream11 closes $100 M funding led by Tencent; ...,,Vishal Krishna,Sports fantasy gaming company Dream11 has rais...
2,How enterprise gaming has grown from a ridicul...,,Kamalika Bhattacharya,Having spent a large part of my career in the ...
3,These MBA grads are changing the way India loo...,"From online tournaments to events, online poke...",Sindhu Kashyap,"A deal, bated breath, a flush, a straight hand..."
4,Facebook is celebrating two years of encouragi...,,Team YS,"Two years ago, Facebook India launched a uniqu..."


Analyzing the data we find out that instead of <b>NaN</b> now space is being displayed

In [7]:
data['cumulative data'] = data['Startup News'] + data['Summary'] + data['Description']

Now, our aim is to find attributes of data such as Person Name, Location and Organization so we combined <b>Startup News, Summary</b> and <b>Description</b> column as anyone of them can contain potential detail

In [8]:
data.head()

Unnamed: 0,Startup News,Summary,Posted By,Description,cumulative data
0,How PUBG has redefined the Indian gaming ecosy...,Indians are well and truly addicted to PUBG th...,Puneet Kumar,"Earlier this month, I went to my home town in ...",How PUBG has redefined the Indian gaming ecosy...
1,Dream11 closes $100 M funding led by Tencent; ...,,Vishal Krishna,Sports fantasy gaming company Dream11 has rais...,Dream11 closes $100 M funding led by Tencent; ...
2,How enterprise gaming has grown from a ridicul...,,Kamalika Bhattacharya,Having spent a large part of my career in the ...,How enterprise gaming has grown from a ridicul...
3,These MBA grads are changing the way India loo...,"From online tournaments to events, online poke...",Sindhu Kashyap,"A deal, bated breath, a flush, a straight hand...",These MBA grads are changing the way India loo...
4,Facebook is celebrating two years of encouragi...,,Team YS,"Two years ago, Facebook India launched a uniqu...",Facebook is celebrating two years of encouragi...


Analyzing the final dataset after combining dataset

In [9]:
text = data['cumulative data'].values.tolist()
my_stop_words = [u'say',u'\s',u'Mr',u'Mrs',u'said',u'says',u'saying']

We extracted the <b>cumulative data</b> column from the data dataframe and converted it to list for faster accessibility.
<br>Then, we defined our custom stopwords inside a list</br>

In [10]:
for stopword in my_stop_words:
    nlp.vocab[stopword].is_stop = True

We iterated over <b>my_stop_words</b> list and using nlp pipeline vocab functionality we made the boolean result of operation <b>is_stop</b> of each stopword as True, i.e. each word in my_stop_words list is made a stopword in spacy corpus

## Applying Topic modelling using LDA algorithm

In [11]:
def topic_modeller(texting):
    doc = nlp(texting)
    article = []
    for w in doc:
        if w.is_stop!=True and not w.is_punct and not w.like_num:
            article.append(w.lemma_)
    bigram = gensim.models.Phrases(article)
    texts = [bigram[line] for line in article]
    article = [d.split() for d in article]
    dictionary = Dictionary(article)
    corpus = [dictionary.doc2bow(text) for text in article]
    ldamodel = LdaModel(corpus=corpus,id2word=dictionary,num_topics=8,random_state=42)
    ab = ldamodel.show_topics(num_topics=2)
    b = re.sub("[^a-zA-Z]{4,}"," ",str(ab))
    b = b.split()
    unique_list = []
    for x in b: 
        if x not in unique_list:
            unique_list.append(x)
    line = [i for i in unique_list if len(i) > 1]
    return line

Then, we defined a function called <b>topic_modeller</b> which accepts a string as input. We tokenized the input string so recieved.
<br>We iterated over each token of tokenized input string and checked whether it is a stop word, punctuation mark or a number.
If it is not then its lemmatized version is being stored into article list</br>
<br>Then, we made bigrams and converted whole tokens into bigram to check possibility of words occuring together such as New York, Big Data etc.</br>
<br>Afterwards, we prepared a dictionary and corpus to be used by LdaModel. We called the LdaModel and passed formed corpus and dictionary and decided 8 as number of topics</br>
<br>The final output contains 2 sets of 8-8 topics making in total 16 topics to justify a document completely.</br>
<br>Then, we applied regular expression to extract tags of length more than 4 and stored only unique topics.
Thereafter, we returned the list <b>line</b> which contains final topics for the input story</br>

In [12]:
result = []
for j in text:
    result.append(topic_modeller(j))
data['Tags'] = result
data['Tags'] = data['Tags'].str.join(', ')

export = data[['Startup News','Summary','Posted By','Description','cumulative data','Tags']]

We passed each story one by one to topic_modeller function and stored the result into <b>result</b> variable

We then store all Tags to column <b>Tags</b> in dataframe data where each story contains corresponding tags

In [13]:
export.head()

Unnamed: 0,Startup News,Summary,Posted By,Description,cumulative data,Tags
0,How PUBG has redefined the Indian gaming ecosy...,Indians are well and truly addicted to PUBG th...,Puneet Kumar,"Earlier this month, I went to my home town in ...",How PUBG has redefined the Indian gaming ecosy...,"player, work, download, India, day, long, mobi..."
1,Dream11 closes $100 M funding led by Tencent; ...,,Vishal Krishna,Sports fantasy gaming company Dream11 has rais...,Dream11 closes $100 M funding led by Tencent; ...,"Dream, percent, gamer, million, round, Nazara,..."
2,How enterprise gaming has grown from a ridicul...,,Kamalika Bhattacharya,Having spent a large part of my career in the ...,How enterprise gaming has grown from a ridicul...,"mobile, opportunity, widespread, player, junk,..."
3,These MBA grads are changing the way India loo...,"From online tournaments to events, online poke...",Sindhu Kashyap,"A deal, bated breath, a flush, a straight hand...",These MBA grads are changing the way India loo...,"MadOverPoker, face, Abhishek, dramatic, change..."
4,Facebook is celebrating two years of encouragi...,,Team YS,"Two years ago, Facebook India launched a uniqu...",Facebook is celebrating two years of encouragi...,"encourage, sheleadstech, entrepreneur, resourc..."


At last we finally analyzed the new dataframe export containing all column.