https://thecleverprogrammer.com/2023/02/13/topic-modelling-using-python/

# 1. Importing the dataset:

In [8]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [9]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\karki\AppData\Roaming\nltk_data...


True

# 2. Loading the dataset:

In [7]:
data = pd.read_csv("data/articles.csv", encoding='latin1')
data.head()

Unnamed: 0,Article,Title
0,Data analysis is the process of inspecting and...,Best Books to Learn Data Analysis
1,The performance of a machine learning algorith...,Assumptions of Machine Learning Algorithms
2,You must have seen the news divided into categ...,News Classification with Machine Learning
3,When there are only two classes in a classific...,Multiclass Classification Algorithms in Machin...
4,The Multinomial Naive Bayes is one of the vari...,Multinomial Naive Bayes in Machine Learning


# 3. Data Cleaning:

As we are working with textual data, it is essential to clean it by removing punctuation and stop words, as they do not contribute to the task and may introduce complications.

For that, we will create a function called `preprocess_text()` and then apply it to the Article feature.

In [15]:
def preprocess_text(text):
    """
    A function that takes in text and cleans it from punctation, stop-words to lemmatization.
    """
    
    # 1. First conver the text into lower-case:
    text = text.lower()
    
    # 2. Remove the punctuations:
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # 3. Tokenize the text:
    tokens = nltk.word_tokenize(text)
    
    # 4. Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # 5. Lemmatization:
    lemma = WordNetLemmatizer()
    clean_tokens = [lemma.lemmatize(token) for token in tokens]
    
    # 6. Finally join the tokens to form a cleaned text:
    pre_processed_text = ' '.join(clean_tokens)
    
    return pre_processed_text
                          

In [17]:
data['Article'][0]

'Data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions. With the use of data in decision making, most businesses today need data analysts. So, if you want to know about the best books to learn data analysis, this article is for you. In this article, I will introduce you to some of the best books to learn data analysis.'

In [18]:
data['Article'] = data['Article'].apply(preprocess_text)

In [19]:
data['Article'][0]

'data analysis process inspecting exploring data generated particular population find information needed make decision draw conclusion use data decision making business today need data analyst want know best book learn data analysis article article introduce best book learn data analysis'

# 4. Creating the vector space for the text:

In [21]:
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(data['Article'].values)

In [28]:
x

<34x413 sparse matrix of type '<class 'numpy.float64'>'
	with 908 stored elements in Compressed Sparse Row format>

The resulted value will be a sparse matrix, and the if we convert the first element to an array, it would look like below:

In [27]:
x[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.36646841,
        0.16015137, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.0829327 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.2053336 , 0.        , 0.        , 0.        ,
        0.24431227, 0.        , 0.        , 0.        , 0.16015137,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.16015137, 0.        , 0.        , 0.  

# 5. Using LDA for topic assignment: