https://thecleverprogrammer.com/2023/02/13/topic-modelling-using-python/

# 1. Importing the dataset:

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# 2. Loading the dataset:

In [3]:
data = pd.read_csv("data/articles.csv", encoding='latin1')
data.head()

Unnamed: 0,Article,Title
0,Data analysis is the process of inspecting and...,Best Books to Learn Data Analysis
1,The performance of a machine learning algorith...,Assumptions of Machine Learning Algorithms
2,You must have seen the news divided into categ...,News Classification with Machine Learning
3,When there are only two classes in a classific...,Multiclass Classification Algorithms in Machin...
4,The Multinomial Naive Bayes is one of the vari...,Multinomial Naive Bayes in Machine Learning


In [4]:
data['Title'].value_counts()

Title
News Classification with Machine Learning                    2
Best Books to Learn Data Analysis                            1
Tata Motors Stock Price Prediction with Machine Learning     1
Mean Shift Clustering in Machine Learning                    1
BIRCH Clustering in Machine Learning                         1
Agglomerative Clustering in Machine Learning                 1
DBSCAN Clustering in Machine Learning                        1
K-Means Clustering in Machine Learning                       1
Animated Scatter Plot using Python                           1
Apple Stock Price Prediction with Machine Learning           1
For Loop Over Keys and Values in a Python Dictionary         1
Best Books to Learn Deep Learning                            1
Applications of Deep Learning                                1
Introduction to Recommendation Systems                       1
Use Cases of Different Machine Learning Algorithms           1
Naive Bayes Algorithm in Machine Learning        

In [5]:
data.shape

(34, 2)

# 3. Data Cleaning:

As we are working with textual data, it is essential to clean it by removing punctuation and stop words, as they do not contribute to the task and may introduce complications.

For that, we will create a function called `preprocess_text()` and then apply it to the Article feature.

In [6]:
def preprocess_text(text):
    """
    A function that takes in text and cleans it from punctation, stop-words to lemmatization.
    """
    
    # 1. First conver the text into lower-case:
    text = text.lower()
    
    # 2. Remove the punctuations:
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # 3. Tokenize the text:
    tokens = nltk.word_tokenize(text)
    
    # 4. Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # 5. Lemmatization:
    lemma = WordNetLemmatizer()
    clean_tokens = [lemma.lemmatize(token) for token in tokens]
    
    # 6. Finally join the tokens to form a cleaned text:
    pre_processed_text = ' '.join(clean_tokens)
    
    return pre_processed_text
                          

In [7]:
data['Article'][0]

'Data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions. With the use of data in decision making, most businesses today need data analysts. So, if you want to know about the best books to learn data analysis, this article is for you. In this article, I will introduce you to some of the best books to learn data analysis.'

In [8]:
data['Article'] = data['Article'].apply(preprocess_text)

In [9]:
data['Article'][0]

'data analysis process inspecting exploring data generated particular population find information needed make decision draw conclusion use data decision making business today need data analyst want know best book learn data analysis article article introduce best book learn data analysis'

# 4. Creating the vector space for the text:

In [10]:
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(data['Article'].values)

In [11]:
x

<34x413 sparse matrix of type '<class 'numpy.float64'>'
	with 908 stored elements in Compressed Sparse Row format>

The resulted value will be a sparse matrix, and the if we convert the first element to an array, it would look like below:

In [12]:
x[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.36646841,
        0.16015137, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.0829327 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.2053336 , 0.        , 0.        , 0.        ,
        0.24431227, 0.        , 0.        , 0.        , 0.16015137,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.16015137, 0.        , 0.        , 0.  

# 5. Using LDA for topic assignment:

In [13]:
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(x)

In [14]:
topic_modelling = lda.transform(x)

In [15]:
topic_labels = np.argmax(topic_modelling, axis=1)

In [16]:
topic_labels

array([2, 3, 1, 3, 1, 1, 2, 4, 0, 2, 0, 1, 2, 3, 0, 1, 1, 3, 0, 2, 3, 1,
       1, 3, 4, 3, 3, 2, 1, 3, 1, 0, 4, 3], dtype=int64)

In [17]:
data['topic_labels'] = topic_labels

In [18]:
data.head(20)

Unnamed: 0,Article,Title,topic_labels
0,data analysis process inspecting exploring dat...,Best Books to Learn Data Analysis,2
1,performance machine learning algorithm particu...,Assumptions of Machine Learning Algorithms,3
2,must seen news divided category go news websit...,News Classification with Machine Learning,1
3,two class classification problem problem binar...,Multiclass Classification Algorithms in Machin...,3
4,multinomial naive bayes one variant naive baye...,Multinomial Naive Bayes in Machine Learning,1
5,must seen news divided category go news websit...,News Classification with Machine Learning,1
6,natural language processing nlp subfield artif...,Best Books to Learn NLP,2
7,using thirdparty application api manage functi...,Send Instagram Messages using Python,4
8,twitter one popular social medium apps people ...,Pfizer Vaccine Sentiment Analysis using Python,0
9,squid game currently one trending show netflix...,Squid Game Sentiment Analysis using Python,2


In [19]:
data[data['topic_labels'] == 4]['Article'][7]

'using thirdparty application api manage functionality application automating application send message post photo video follow someone without opening instagram directly mean automating instagram want learn send instagram message automatically using python article article present tutorial send instagram message using python'

In [20]:
data[data['topic_labels'] == 4]['Article'][24]

'kmeans clustering clustering algorithm capable clustering unlabeled dataset quickly efficiently iteration article take kmeans clustering machine learning using python'

In [21]:
data[data['topic_labels'] == 4]['Article'][32]

'machine learning naive bayes algorithm based bayes theorem na\x8bve assumption make easier train model assuming feature independent article give introduction naive bayes algorithm machine learning implementation using python'

Lets define 5 topics for values ranging from 0 to 4:

In [22]:
topics = {
    '0': 'Introduction to NLP, Sentiment Analysis, and Computer Vision',
    '1': 'News Classification and Naive Bayes in Machine Learning',
    '2': 'Best Resources for Learning Data Science and NLP',
    '3': 'Machine Learning Assumptions and Practical Implementations',
    '4': 'Automating Tasks with APIs and Clustering Techniques in Machine Learning'
}

Now, we will apply a function to the dataframe on `topic_labels` column and add a new column where the integers will be mapped to their respective title names:

Since, this data already had the `Title` column which we never needed and instead used LDA to determine 5
topics based on the text values, we will drop it and use these new titles:

In [23]:
data.columns

Index(['Article', 'Title', 'topic_labels'], dtype='object')

In [24]:
data = data[['Article', 'topic_labels']]

In [25]:
data.head()

Unnamed: 0,Article,topic_labels
0,data analysis process inspecting exploring dat...,2
1,performance machine learning algorithm particu...,3
2,must seen news divided category go news websit...,1
3,two class classification problem problem binar...,3
4,multinomial naive bayes one variant naive baye...,1


In [26]:
data['Title'] = data['topic_labels'].astype(str).map(topics)

In [27]:
data.head()

Unnamed: 0,Article,topic_labels,Title
0,data analysis process inspecting exploring dat...,2,Best Resources for Learning Data Science and NLP
1,performance machine learning algorithm particu...,3,Machine Learning Assumptions and Practical Imp...
2,must seen news divided category go news websit...,1,News Classification and Naive Bayes in Machine...
3,two class classification problem problem binar...,3,Machine Learning Assumptions and Practical Imp...
4,multinomial naive bayes one variant naive baye...,1,News Classification and Naive Bayes in Machine...


So, finally we have the topics and the final implementation when we get the new text would be as below:

In [28]:
def topic_modeling():
    text = str(input("Enter the text to identify its title: "))
    pre_processed_text = preprocess_text(text)
    transformed_text = vectorizer.transform([pre_processed_text])
    modeling = lda.transform(transformed_text)
    topic = np.argmax(modeling, axis=1)

    print(f"The topic for this text is: \nTopic Number: {topic[0]} \nTitle: {topics[str(topic[0])]}")
   

In [29]:
topic_modeling()

Enter the text to identify its title: Now, we will apply a function to the dataframe on topic_labels column and add a new column where the integers will be mapped to their respective title names:  Since, this data already had the Title column which we never needed and instead used LDA to determine 5 topics based on the text values, we will drop it and use these new titles:
The topic for this text is: 
Topic Number: 0 
Title: Introduction to NLP, Sentiment Analysis, and Computer Vision
