## Topic Modeling Using Latent Dirichlet Allocation (LDA) on NPR Dataset.
<b> We will be using articles from NPR (National Public Radio), obtained from their website www.npr.org.
    
<b> The objective of this project will be to have a  visual representation of the dominant topics.
    
<b> Understanding the topic modeling pipeline and being able to implement it.
    
<b> Topic Modeling is try to club together different objects(documents in this case) on the basis of some similar words. 
This means that if 2 documents contain similar words, then there is a very high chance that they both might fall under the same category. 

### <b> Importing all the necessary libraries.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.simplefilter("ignore")

### Loading the Dataset

<b> Load the dataset by using read_csv() to read the dataset and save it to the 'df' variable and take a look at the first 5 lines using the head() method.

In [2]:
# Load the dataset 
df = pd.read_csv("npr.csv")

# Display the first 5 lines using the head() method.
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


### Checking info of our data by using info() method.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Article  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


<b> We can see that our data is having just one column named Article with 1000 entries and there is no missing values.

### Text Preprocessing (Text Cleaning)
<b> Text preprocessing involves transforming text into a clean and consistent format that can then be fed into a model for further analysis and learning.
- <b> Remove punctuation
- <b> Remove Stop words
- <b> Stemming/Lemmatization

<b> Importing all the necessary libraries for Text Cleaning.

In [4]:
import nltk                                    # import nltk
import re                                      # import regular expression library
from nltk.corpus import stopwords              # import stopwords class from nltk.corpus library
from nltk.stem import WordNetLemmatizer        # import WordNetLemmatizer class from nltk.stem library
wnl = WordNetLemmatizer()                      # Initializa the WordNetLemmatizer class as "wnl"

In [5]:
# initilise the "corpus" empty list
corpus=[]

# Using for loop for iteration each records.
for i in range(len(df)):
    # remove the punctuation and same store as "rp" object.
    rp = re.sub('[^a-zA-Z]'," ",df['Article'][i])
    # lowering the "rp" object and storing in the same object.
    rp = rp.lower()
    # split the words from sentences and same store in the same object.
    rp = rp.split()
    # converting words into their root word by lemmatizetion (by using list comprehension)
    rp = [wnl.lemmatize(word) for word in rp if not word in set(stopwords.words('English'))]
    # join the words.
    rp = " ".join(rp)
    # append/add the result to corpus list
    corpus.append(rp)

# print the corpus    
print(corpus)



<b> We can see that, we have done with text cleaning, i.e. Removed punctuation, Removed Stop words and applyied Lemmatization.
    
<b> Now we need to do before we actually run LDA is to perform a little bit of preprocessing. Now, we will be using a feature extraction method to convert the tokens(words) from the article into a matrix with its frequency of occurrence. this can be done with the help of CountVectorizer.
    
<b> Creating a Document Term Matrix (DTM) of our data.
    
<b> Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned.

<b> Here we applying CountVectorizer(): CountVectorizer is one of the simplest techniques that is used for converting text into vectors.

In [6]:
# import the CountVectorizer() class from sklearn.feature_extraction.text library
from sklearn.feature_extraction.text import CountVectorizer

# creating object cv of class CountVectorizer()
cv = CountVectorizer()

# Fit and transform the corpus and same store as "X" object.
X = cv.fit_transform(corpus)         


<b> Since the Article are not labeled, we are using unsupervised learning, where we fit and transform the CountVectorizer to our data set without creating any test train split.
    
### Modelling the Latent Dirichlet Allocation object (LDA):
    
<b> Now, our data is ready to be subjected to LDA.
    
<b> let’s import the Latent Dirichlet Allocation from sklearn and create an instance of the same.

<b> The most important hyper-parameter that we need to set are n_components.
The n_components is the part where there is no right or wrong answer, it purely depends on the domain knowledge and requirement. we need to give an integer value to return ’N’ number of generalized topics. If you want to get into more detail such as maybe subcategories of politics like international politics versus national politics give higher value for n_components. in this case, we will stick to 7, so that we cluster the articles into 7 generalized topics. Due to randomness in LDA it is better to assign a random_state value. in our case lets have it as 42.

In [8]:
# let’s import the Latent Dirichlet Allocation from sklearn
from sklearn.decomposition import LatentDirichletAllocation

# create an instance of the same as "model" and pass the n_components=7 and random_state=42
model = LatentDirichletAllocation(n_components=7, random_state=42)

# fit the model and store the result in 'topic_results' object
topic_results = model.fit_transform(X)

### Printing a list of  top features/words on which clustering will be done.
<b> We are going to find the most probable words for all the 7 topics that we have clustered. Here we are dispaying top 15 words from each topics.

In [9]:
# Using for loop for iteration each records
for i,arr in enumerate(model.components_):
    
    print(f'TOP 15 WORDS FOR TOPIC #{i}')
    print([cv.get_feature_names()[i] for i in arr.argsort()[-15:]]) 
    print('\n\n')

TOP 15 WORDS FOR TOPIC #0
['department', 'time', 'administration', 'one', 'white', 'senate', 'people', 'state', 'obama', 'say', 'also', 'would', 'president', 'said', 'trump']



TOP 15 WORDS FOR TOPIC #1
['first', 'home', 'many', 'make', 'orphan', 'company', 'would', 'like', 'new', 'people', 'said', 'drug', 'year', 'one', 'say']



TOP 15 WORDS FOR TOPIC #2
['state', 'also', 'plan', 'year', 'american', 'house', 'say', 'president', 'russia', 'republican', 'obama', 'health', 'would', 'trump', 'said']



TOP 15 WORDS FOR TOPIC #3
['life', 'animal', 'way', 'human', 'facebook', 'get', 'also', 'said', 'time', 'people', 'new', 'one', 'year', 'like', 'say']



TOP 15 WORDS FOR TOPIC #4
['get', 'show', 'day', 'first', 'world', 'make', 'life', 'way', 'new', 'time', 'people', 'year', 'like', 'say', 'one']



TOP 15 WORDS FOR TOPIC #5
['percent', 'time', 'think', 'would', 'state', 'child', 'one', 'like', 'year', 'care', 'student', 'health', 'people', 'school', 'say']



TOP 15 WORDS FOR TOPIC #6
[

<b> arr.argsort() will sort the words on the basis of the probability of the occurrence of that word in the document of that specific topic in ascending order we have taken the last 15 words which means the 15 most probable words that will occur for that topic.
    
<b> cv.get_feature_names is just a list of all the words in our corpus.
    
<b> See, the top 15 words of topic #0,1,2,3,4,5 and 6.
    
<b> Now, let us assign a topic number for each article and add it to the dataframe along with the articles and to do that all we need to do is create a list of the actual topics off this document matrix.

In [10]:
# print the shape of the topic_results object.
print(topic_results.shape)

# print the first record/row of the topic_results
print(topic_results[0])

(1000, 7)
[1.91999774e-01 2.08497436e-04 7.73637326e-01 2.08620531e-04
 2.08693809e-04 3.35284409e-02 2.08647367e-04]


<b> from the above result, we can see that the shape of the variable topic_results shows that for each and every article we have a array of 7 values. these values represent the probability of the article falling in the specific cluster of topics. just rounding of the values of the topic_results and reading the 0th index.

In [11]:
# rounding the first record/row of the topic_results with 2 decimals.
topic_results[0].round(2)

array([0.19, 0.  , 0.77, 0.  , 0.  , 0.03, 0.  ])

<b> the above result shows the probability of the article falling into the topic with respect to the index of the value. For instance, in this case, the article at 0th index has the following probabilities to fall under respective topics.
    
**Topic #0** — probability = 0.02                                                                             
**Topic #1** — probability = 0.68                                                                          
**Topic #2** — probability = 0                                                                                     
**Topic #3** — probability = 0                                                                                     
**Topic #4** — probability = 0.3                                                                                         
**Topic #5** — probability = 0                                                                                             
**Topic #6** — probability = 0                                                                                              

<b> from this we can come to a conclusion that there is highest probability that the article belongs to Topic #1, hence we can assign the this word to Topic #1 for the article.

<b> We can do the same by using the simple code below to create the additional feature carrying the topic number of respective article. Before going forward lets familiarize our-self with argmax() function.
    
<b> numpy.argmax(array, axis = None, out = None) : Returns indices of the max element of the array in a particular axis.

In [12]:
# giving topic numbers to documents/dataset.
df['Topic'] = topic_results.argmax(axis=1)

# print the data.
df

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",2
1,Donald Trump has used Twitter — his prefe...,0
2,Donald Trump is unabashedly praising Russian...,0
3,"Updated at 2:50 p. m. ET, Russian President Vl...",2
4,"From photography, illustration and video, to d...",4
...,...,...
995,"When last spotted in his indigenous habitat, J...",5
996,"On Wednesday morning, a Red Cross staffer in A...",4
997,There’s a vibrance to the current music of Esm...,4
998,"Like many awards shows, the Grammys are about ...",4


<b> Alas, we have assigned the topic numbers to relevant articles.
    
<b> To mention again, there is no one right or any wrong way to do this topic modeling. it is entirely up to the users needs based on the domain or customer requirement.