Topic modeling is a type of statistical modeling used to uncover abstract "topics" that are present in a collection of documents. This process involves creating one topic per document template and words per topic template, modeled as Dirichlet distributions. Through this method, it is possible to gain a better understanding of the underlying themes and topics present in a set of documents.The field of Topic modeling has become increasingly important in recent years. Subject modeling is an unsupervised machine learning way to organize text (or image or DNA, etc.) information so that associated pieces of text can be identified.

## What is Topic Modelling?
In machine learning and natural language processing, topic modeling is a powerful statistical model used to uncover abstract topics that appear in a collection of documents. This text mining tool is often used to detect latent semantic structures in text. 

Intuitively, since a document is about a particular topic, one would expect that certain words would appear more or less frequently in the document: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in cat documents, and “the” and “is” will appear roughly equally in both. Generally, a document covers several topics in varying proportions; for example, in a document that is 10% cat and 90% dog, there would likely be about nine times more dog words than cat words. The “topics” generated by topic modeling techniques are clusters of similar words. 

A topic modeling machine learning model captures this concept mathematically, allowing us to analyze a set of documents and, based on the statistics of each word, determine the topics and the proportions of each topic in the document.

## Data
Let's begin our journey into topic modeling with Python by importing the necessary libraries we need for this task. To illustrate the process, we'll use a real-life example of research articles. The dataset we'll use is available for download from Kaggle.com here:https://github.com/amankharwal/Website-data/blob/master/topic%20modeling.zip

## Importing the necessary libraries
With the libraries imported and the dataset ready, we can now start our topic modeling journey.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Now, the next step is to read all the datasets that I am using in this task:

In [2]:
train = pd.read_csv("/content/drive/MyDrive/filename/Train.csv")
test = pd.read_csv("/content/drive/MyDrive/filename/Test.csv")
tags = pd.read_csv("/content/drive/MyDrive/filename/Tags.csv")
sample_sub = pd.read_csv("/content/drive/MyDrive/filename/SampleSubmission.csv")

## Exploratory Data Analysis
Exploratory Data Analysis explores the data to find the relationship between measures that tell us they exist, without the cause. They can be used to formulate hypotheses. EDA helps you discover relationships between measures in your data, which do not prove the existence of correlation, as indicated by the expression.

Now I will perform some EDA to find some patterns and relationships in the data before getting into topic modeling:

In [3]:
print(train.isna().sum)

<bound method NDFrame._add_numeric_operations.<locals>.sum of           id  ABSTRACT  Computer Science  Mathematics  Physics  Statistics  \
0      False     False             False        False    False       False   
1      False     False             False        False    False       False   
2      False     False             False        False    False       False   
3      False     False             False        False    False       False   
4      False     False             False        False    False       False   
...      ...       ...               ...          ...      ...         ...   
13999  False     False             False        False    False       False   
14000  False     False             False        False    False       False   
14001  False     False             False        False    False       False   
14002  False     False             False        False    False       False   
14003  False     False             False        False    False       False   

 

In [4]:
print(test.isna().sum)

<bound method NDFrame._add_numeric_operations.<locals>.sum of          id  ABSTRACT  Computer Science  Mathematics  Physics  Statistics
0     False     False             False        False    False       False
1     False     False             False        False    False       False
2     False     False             False        False    False       False
3     False     False             False        False    False       False
4     False     False             False        False    False       False
...     ...       ...               ...          ...      ...         ...
5997  False     False             False        False    False       False
5998  False     False             False        False    False       False
5999  False     False             False        False    False       False
6000  False     False             False        False    False       False
6001  False     False             False        False    False       False

[6002 rows x 6 columns]>


In [6]:
train["Number of Characters"] = train["ABSTRACT"].apply(lambda x: len(str(x)))
test["Number of Characters"] = test["ABSTRACT"].apply(lambda x: len(str(x)))
fig = make_subplots(rows=1, cols=2)
trace1 = go.Histogram(x = train["Number of Characters"])
fig.add_trace(trace1, row=1, col=1)

trace2 = go.Box(y = train["Number of Characters"])
fig.add_trace(trace2, row=1, col=2)
fig.update_layout(showlegend=False)
fig.show()

There is great variability in the number of characters in the Abstracts of the Train set. We have a minimum of 54 to a maximum of 4551 characters on the train. The median number of characters is 1065.

In [7]:

fig = make_subplots(rows=1, cols=2)
trace1 = go.Histogram(x = test["Number of Characters"])
fig.add_trace(trace1, row=1, col=1)

trace2 = go.Box(y = test["Number of Characters"])
fig.add_trace(trace2, row=1, col=2)
fig.update_layout(showlegend=False)
fig.show()

The test set looks better than the training set as the minimum number of characters in the test set is 46, while the maximum is 2841. So the median number of characters in the test set is 1058, which is very similar to the training set.

In [8]:
train['Number of Words'] = train['ABSTRACT'].apply(lambda x: len(str(x).split()))
test['Number of Words'] = test['ABSTRACT'].apply(lambda x: len(str(x).split()))
fig = make_subplots(rows = 1, cols = 2)
trace1 = go.Histogram(x = train['Number of Words'])
fig.add_trace(trace1, row = 1, col = 1)

trace2 = go.Box(y = train['Number of Words'])
fig.add_trace(trace2, row = 1, col = 2)

fig.update_layout(showlegend = False)
fig.show()

The learning set has a similar trend in the number of words as we have seen in the number of characters. Minimum of 8 words and maximum of 665 words. So the median word count is 153.

In [9]:
fig = make_subplots(rows = 1, cols = 2)
trace1 = go.Histogram(x = test['Number of Words'])
fig.add_trace(trace1, row = 1, col = 1)

trace2 = go.Box(y = test['Number of Words'])
fig.add_trace(trace2, row = 1, col = 2)

fig.update_layout(showlegend = False)
fig.show()

Minimum of 7 words in an abstract and maximum of 452 words in the test set. The median here is exactly the same as that observed in the training set and is equal to 153.

## Topic Modeling Using Tags


In [10]:
main_tags = ['Computer Science',
 'Mathematics',
 'Physics',
 'Statistics']

countTagsTrain = pd.DataFrame(train[main_tags].sum(axis = 0) / len(train))
countTagsTest = pd.DataFrame(test[main_tags].sum(axis = 0) / len(test))

trace0 = go.Bar(x = countTagsTrain.index, y = countTagsTrain[0],name = 'Train Set')
trace1 = go.Bar(x = countTagsTest.index, y = countTagsTest[0],name = 'Test Set')

fig = go.Figure([trace0,trace1])
fig.show()