# Topic Modelling

Topic Modelling is a Natural Language Processing technique to uncover hidden topics from text documents. It helps identify topics of the text documents to find relationships between the content of a text document and the topic.

In [27]:
!pip install nltk



In [28]:
#Importing libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from  nltk.stem.wordnet import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [29]:
data = pd.read_csv('/kaggle/input/articles/articles.csv',encoding = 'latin1')

In [30]:
print(data.head(10))

                                             Article  \
0  Data analysis is the process of inspecting and...   
1  The performance of a machine learning algorith...   
2  You must have seen the news divided into categ...   
3  When there are only two classes in a classific...   
4  The Multinomial Naive Bayes is one of the vari...   
5  You must have seen the news divided into categ...   
6  Natural language processing or NLP is a subfie...   
7  By using a third-party application or API to m...   
8  Twitter is one of the most popular social medi...   
9  The squid game is currently one of the most tr...   

                                               Title  
0                  Best Books to Learn Data Analysis  
1         Assumptions of Machine Learning Algorithms  
2          News Classification with Machine Learning  
3  Multiclass Classification Algorithms in Machin...  
4        Multinomial Naive Bayes in Machine Learning  
5          News Classification with Machine Learning 

As we are working on a Natural Language Processing problem, we need to clean the textual content by removing punctuation and stopwords. Here’s how we can clean the textual data:

In [31]:
def preprocess_text(text):
    #to lower case
    text = text.lower()
    #remove punctuations
    text = text.translate(str.maketrans('','',string.punctuation))
    #Tokenize the text
    tokens = nltk.word_tokenize(text)
    #Remove Stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    #Lemmatize tokes
    lemma = WordNetLemmatizer()
    tokens = [lemma.lemmatize(word) for word in tokens]
    #Join tokens to form preprocessed text
    preprocessed_tokens = ''.join(tokens)

    return preprocessed_tokens
data['Cleaned_Article'] = data['Article'].apply(preprocess_text)

Now we need to convert the textual data into a numerical representation. We can use text vectorization here:

In [32]:
#vectorizer = TfidfVectorizer()
vectorizer = CountVectorizer(max_df= 0.95,min_df=2,stop_words = 'english')
x= vectorizer.fit_transform(data['Article'].values)

Now we will use an algorithm to identify relationships between the textual data to assign topic labels. We can use the Latent Dirichlet Allocation algorithm for this task. Latent Dirichlet Allocation (LDA) is a generative probabilistic algorithm used to uncover the underlying topics in a corpus of textual data. Let’s use the LDA algorithm to assign topic labels:

In [33]:
lda = LatentDirichletAllocation(n_components=5,random_state=42)
lda.fit(x)

In [34]:
topic_modelling = lda.transform(x)

In [35]:
topic_labels = np.argmax(topic_modelling,axis =1)
data['topic_labels'] = topic_labels

Now here’s the final data with topic labels:

In [36]:
print(data.head())

                                             Article  \
0  Data analysis is the process of inspecting and...   
1  The performance of a machine learning algorith...   
2  You must have seen the news divided into categ...   
3  When there are only two classes in a classific...   
4  The Multinomial Naive Bayes is one of the vari...   

                                               Title  \
0                  Best Books to Learn Data Analysis   
1         Assumptions of Machine Learning Algorithms   
2          News Classification with Machine Learning   
3  Multiclass Classification Algorithms in Machin...   
4        Multinomial Naive Bayes in Machine Learning   

                                     Cleaned_Article  topic_labels  
0  dataanalysisprocessinspectingexploringdatagene...             1  
1  performancemachinelearningalgorithmparticulard...             0  
2  mustseennewsdividedcategorygonewswebsitepopula...             1  
3  twoclassclassificationproblemproblembinaryclas.

In [37]:
data['topic_labels'].unique()

array([1, 0, 3, 4, 2])

So this is how you can assign topic labels with Machine Learning using the Python programming language

In [38]:
feature_names =vectorizer.get_feature_names_out()

In [39]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}: ", end='')
        print(", ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

print_top_words(lda, feature_names, n_top_words=10)

Topic 1: learning, machine, algorithms, deep, books, best, algorithm, use, applications, introduce
Topic 2: python, data, learning, using, learn, machine, want, news, language, stock
Topic 3: insurance, people, want, learn, using, python, analysis, task, sentiment, analyze
Topic 4: clustering, machine, learning, algorithm, using, classification, python, algorithms, implementation, clusters
Topic 5: algorithm, bayes, learning, naive, based, clustering, machine, classification, introduction, similar


In [40]:
topic_names = {0:'Algorithms', 1:'Machine Learning',2:'Python',3:'Classification',4:'Clustering'}

In [41]:
data['topic_names'] = data['topic_labels'].map(topic_names)

In [42]:
panel = pyLDAvis.prepare(
    topic_term_dists=lda.components_ / lda.components_.sum(axis=1)[:, None],
    doc_topic_dists=lda.transform(x),
    doc_lengths=x.sum(axis=1).A1,
    vocab=vectorizer.get_feature_names_out(),
    term_frequency=x.sum(axis=0).A1
)

pyLDAvis.enable_notebook()
pyLDAvis.display(panel)