<a href="https://colab.research.google.com/github/AkshayAI007/Topic-modelling-of-news-article-/blob/main/Topic_Modelling_on_News_Articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Topic modeling of news article



##### **Project Type**    - Unsupervised ML
##### **Contribution**    - Individual


# **Project Summary -**

The exponential growth of digital content has led to an overwhelming influx of news articles across various domains. Manual categorization and analysis of this vast amount of information are labor-intensive and time-consuming. Unsupervised topic modeling offers a solution by automating the process of identifying key themes within the articles, enabling efficient content organization and information retrieval.**Unsupervised machine learning** techniques have gained significant traction in the field of natural language processing (NLP) due to their ability to extract valuable insights from unstructured text data. This project focuses on applying clustering algorithms, particularly **Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA)**, to perform topic modeling on a diverse collection of news articles. The goal is to uncover hidden thematic structures within the articles and categorize them into coherent topics.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective of this project is to leverage unsupervised machine learning techniques to perform topic modeling on a collection of news articles from BBC using clustering algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). The goal is to extract meaningful topics from the articles, allowing for efficient content organization, information retrieval, and trend analysis.The dataset Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
Natural Classes: 5 (business, entertainment, politics, sport, tech)

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [12]:

import os

# importing CountVectorizer for feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Importing data manipulation libraries
import numpy as np
import pandas as pd

# importing tqdm and display modules for progress meters/bars
from IPython.display import display
from tqdm import tqdm

# importing wordcloud to represent topics wordcloud
from wordcloud import WordCloud

# Model selection modules
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from collections import Counter

import ast

# importing data visualization modules
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE

# importing mlab for implementing MATLAB functions
import matplotlib.mlab as mlab

# importing statistics module
import scipy.stats as stats

# importing decomposition modules
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation

# importing Natural Language Toolkit and other NLP modules
import nltk
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word

# Importing warnings library. The warnings module handles warnings in Python.
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


In [13]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


 1.   For loading the dataset, the **glob and re** libraries were used.While reading the data, each paragraph was read into a different row with the column index being the title.

 2. This data was organised into three features - the first column named "Title", the rest of the article in "Description" and the category assigned to the article in "Category" column
 1.   Articles from each category were stored in separate DataFrames (for any furture usage), and all were concatenated to form a single final DataFrame






In [14]:

# The variable "directory" holds the address of text files stored in drive
directory = '/content/drive/MyDrive/Projects/Topic_Modeling_on_news_articles/bbc/bbc'

# All 5 sub-categories provided
subdirs = ['business', 'entertainment', 'politics', 'sport', 'tech']

# Create dataframe for gathering the articles
bbc = pd.DataFrame()

# Iterate over sub-directories to access the text files
for subdir in subdirs:

  # address to the subdirectory
  dir = directory + '/' + subdir

  # Iterate over all the text files present in a sub-directory
  for filename in os.listdir(dir):

    # Get file address
    filepath = os.path.join(dir, filename)

    # Traversing over text files and storing the articles into the dataframe
    try:
      data = open(filepath,'r').read()

      # escape characters to be ignored in the text
      escape = ['\n']

      # removing escape characters from text
      for elem in escape:
        data = data.replace(elem, ' ')

      # Storing article to the dataframe
      dict1 = {'Filename': filename.split('.')[0], 'Contents': data.lower(), 'Category':subdir}
      bbc = bbc.append(dict1, ignore_index=True, verify_integrity = True)

    # Ignore exception, if any
    except:
      pass

### Dataset Loading

In [15]:
bbc.shape

(0, 0)

**The dataset contains a set of news articles for each major segment consisting of business, entertainment, politics, sports and technology. There are over 2000 news article available in these categories.**

In [16]:
bbc.sample(10)


ValueError: ignored

**The dataset consists of 4 columns:**


1.   **Index** : Entry number

1.   **Filname** : Destination File Name/ Number
2.   **Contents** : Complete transcript of the article, the complete textual data

2.   **Category** : Article topic







### Dataset First View

In [None]:
bbc.head()

### Dataset Rows & Columns count

In [None]:
bbc.info()

### Dataset Information

In [None]:
# Dataset Info
df = bbc.copy()                                                               ## First creating a deep copy


In [None]:
df.info()


#### Duplicate Values

In [None]:
#Let's check for duplicates cause having duplicates will result in inconsistencies
df.duplicated(subset = ['Contents']).sum()

In [None]:
#Dropping the duplicate values
df.drop_duplicates(subset = ['Contents'],inplace = True)
df

**The dataset contains total 99 duplicate rows and does not contain any missing values**

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


### What did you know about your dataset?

**Dataset does not contain any missing values**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

## 3. ***Data Wrangling***

### Data Wrangling Code

**Let's add a column which shows us the number of words used in each article**

In [None]:
df['Contents_len'] = df['Contents'].str.len()


In [None]:
plt.figure(figsize=(15, 5))
sns.distplot(df['Contents_len']).set_title('News length distribution');

**We can see here that most of the contents length lie between 0-5000 but a few goes as high as 25000.**

### What all manipulations have you done and insights you found?

None

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Dist plot for each category
sns.displot(df, x="Contents_len", hue="Category", kind="kde",height=7,aspect =1 )

**Here we can see the content length distribution for each categories.**

In [None]:
df

In [None]:
#Checking the number of Article per categories:
cat_count =df.groupby(['Category'],)['Category'].count()


In [None]:
plt.figure(figsize=(10,5))
cat_count.plot(kind ='bar', grid =True)
plt.ylabel("Count")
plt.title("Count per Category")

##### 1. Why did you pick the specific chart?

**Bar chart gives us an apt idea about the number of articles in each category**

##### 2. What is/are the insight(s) found from the chart?

**We can see here the highest number of content is for the business and the spors category.**

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

#### What all missing value imputation techniques have you used and why did you use those techniques?

**No missing values hence not needed**

### 2. Handling Outliers

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Not needed**

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?

**Not needed**

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Let's Check the textual content of the Data:

In [None]:
content = df.reset_index()
content = content['Contents']
content

#### 2. Lower Casing

In [None]:
# Lower Casing
df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space,
    # which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
df['Contents'] = df['Contents'].apply(remove_punctuation)
df.head(10)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

def remove_urls(text):
    # Regular expression pattern to match URLs
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

    # Replace URLs with an empty string
    cleaned_text = re.sub(url_pattern, '', text)

    return cleaned_text

df['Contents']=df['Contents'].apply(remove_urls)



#### 5. Tokenization

**CountVectorizer(Top Words):**

In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called **Tokenization.**

These words then need to be encoded as integers, or floating-point values, such that they can be used as inputs in machine learning algorithms. This process is called **Feature Extraction (or Vectorization)**.

In [None]:
# Tokenization
# Function to extract top n words with highest frequency
def get_top_n_words(n_top_words, count_vectorizer, text_data):
    '''
    The function returns a tuple of the top n words in a sample and their
    accompanying counts, given a CountVectorizer object and text sample as inputs
    '''
    # encoding the document using countvectorizer object
    vectorized_content = count_vectorizer.fit_transform(text_data.values)
    vectorized_total = np.sum(vectorized_content, axis=0)

    # extracting specifics for words
    word_indices = np.flip(np.argsort(vectorized_total)[0,:], 1)
    word_values = np.flip(np.sort(vectorized_total)[0,:],1)

    # creating a vector matrix for words
    word_vectors = np.zeros((n_top_words, vectorized_content.shape[1]))
    for i in range(n_top_words):
        word_vectors[i,word_indices[0,i]] = 1

    # display Vector matrix
    print(word_vectors)

    # collect the words
    words = [word[0].encode('ascii').decode('utf-8') for
             word in count_vectorizer.inverse_transform(word_vectors)]

    return (words, word_values[0,:n_top_words].tolist()[0])


# **6.Stopword removal**

In [None]:
# Remove Stopwords
# dowloading nltk stopwords module
nltk.download('stopwords')
# extracting all stopwords for english language
stop = nltk.corpus.stopwords.words('english')
stop[0:10]

## **7.Vectorization**

In [None]:
# creating vectorizer object
count_vectorizer = CountVectorizer(stop_words=stop)

# calling the function to get words and their counts
words, word_values = get_top_n_words(n_top_words=25,
                                     count_vectorizer=count_vectorizer,
                                     text_data=content)

# display top 25 words using bar plot
fig, ax = plt.subplots(figsize=(16,8))
ax.bar(range(len(words)), word_values, edgecolor='red')
ax.set_xticks(range(len(words)))
ax.set_xticklabels(words, rotation='vertical')
ax.set_title('Top words in new articles dataset (excluding stop words)')
ax.set_xlabel('Words')
ax.set_ylabel('Number of occurences')
plt.show()

## **8. Text Normalization**

**STEMMING AND LEMMATIZING THE DATA**

**Stemming**: is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

**Lemmatization**: This algorithm collects all inflected forms of a word in order to break them down to their root dictionary form or lemma. Words are broken down into a part of speech (the categories of word types) by way of the rules of grammar.

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# making a lemmatizer object
sno = nltk.stem.SnowballStemmer('english')


# lemmatizing an article to see what snowball lemmatizer returns
for rows in content:
 print(rows)
 test = [sno.stem(words) for words in rows.split(' ')]
 print(test)
 break


**As we use a Lemmatization technique we can see that it has worked but some words dont make any sense now such as 'associated' is 'associ' and 'aggregate' to 'aggreg'.Hence the meaning of the words are lost in the process so we wont be using this method.**

#### 10. Text Vectorization

**Vectorization is a technique that converts the text content to numerical feature vectors. Bag of Words takes a document from a corpus and converts it into a numeric vector by mapping each document word to a feature vector for the machine learning model.**

In [None]:
# Vectorizing Text
# creating a countvectorizer object
count_vectorizer = CountVectorizer(stop_words = stop, max_features = 4000)

# text before vectorization
text_sample = content
print('Content after removing Stopwords and Punctuations: \n{}'.format(text_sample[23]))

# encode the textual content
document_term_matrix = count_vectorizer.fit_transform(text_sample)

# text after vectorization
print('Vectorization: \n{}'.format(document_term_matrix[23]))



##### Which text vectorization technique have you used and why?

Answer Here.

## ***7. ML Model Implementation***

# **Model - Latent Semantic Analysis (LSA):**

**Latent Semantic Analysis (LSA)** is a method that allows us to extract topics from documents by converting their text into word-topic and document-topic matrices. The procedure for LSA is relatively straightforward: Convert the text corpus into a document-term matrix. Implement truncated singular value decomposition.

In [None]:
from sklearn.decomposition import TruncatedSVD


In [None]:
# ML Model - 1 Implementation

# SVD represent documents and terms in vectors
svd_model = TruncatedSVD(n_components=5, algorithm='randomized', n_iter=100, random_state=23)
# Fit the Algorithm
svd_model.fit(document_term_matrix)
# Predict on the model

In [None]:

terms = count_vectorizer.get_feature_names_out()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:40]
    print(" ")
    print("Topic "+str(i)+": ")
    print(" ")
    for t in sorted_terms:
        print(t[0])

As we can see from above the LSA model does group the words into topics but the words are heavily mismatched for just 40 words,for example:



1.   Topic 0 has definded 'music' , 'government' and 'game' in the same cluster

1.  Topic 1 has definded 'songs' and 'mobile' in the same cluster

2.  Topic 2 has definded 'labour' and 'party' in the same cluster

2.   Topic 3 has definded 'nadal' and 'election' in the same cluster

1.  Topic 4 is comparatively clustered better because most of the words belongs to technology but even so there are words like 'government' , 'wage' , 'urban' which are not accurate.


# ***Since the LSA model performs very poorly we won't be using this as our clustering algorithm.***






#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

# **Model -Latent Dirichlet Allocation (LDA):**

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Latent Dirichlet Allocation (LDA)** algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus.

An advantage of the LDA technique is that one does not have to know in advance what the topics will look like. By tuning the LDA parameters to fit different dataset shapes, one can explore topic formation and resulting document clusters.

The goal of LDA is to map all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics.


> We will be using pyLDAvis which allows a better visualization.

>We will be using T-Sne for lowering down the dimensions of our feature-space

***Model Evaluation:***

For Information retreival task like Topic modelling, there two metrics to judge the model performance.

>**Perplexity:** is a mesure of model complication by trained model when exposed to unseen documents. For good model perplexity score should be less. Higher the score model not generlise well.

>**Coherence score:** is measure of sum of sementic similarity between most occurence words for every topic.

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_20newsgroups



# Create an LDA model
lda = LatentDirichletAllocation()

# Define hyperparameter grid for tuning
param_grid = {
    'n_components': [5, 10, 15],          # Number of topics
    'learning_method': ['online', 'batch'],  # Learning method
    'learning_decay': [0.5, 0.7, 0.9]        # Learning rate decay
}

# Perform grid search using 5-fold cross-validation
grid_search = GridSearchCV(lda, param_grid, cv=5, verbose=2)

# Fit the grid search to the data
grid_search.fit(document_term_matrix)

# Print the best hyperparameters and corresponding score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Log Likelihood Score:", grid_search.best_score_)



In [None]:

# installing pyLDAvis to visualize the results of LDA model
!pip install pyLDAvis

In [None]:
# importing pyLDAvis module
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()



In [None]:
# generate and display the graph
lda_viz = gensimvis.prepare(lda, corpus, dictionary)
lda_panel

In [None]:

# creating docterms dataframe
docterms = lda_panel.token_table.sort_values(by = ['Freq'], ascending=False)

# display docterms df
docterms

In [None]:
# create topics dataframe
topicsdf = pd.DataFrame()


# adding top 50 most relevant terms for each topic to the dataframe
for i in range(1,6):
  Topicdict ={ "Topic":i, "Terms":list(docterms[docterms['Topic']==i]['Term'].head(50))  }
  topicsdf=topicsdf.append(Topicdict,ignore_index=True)
topicsdf

**TOPIC 1 : POLITICS**

In [None]:
# creating term freq dict for topic 1
t1dict = {}
for vals in docterms[docterms['Topic']==1].head(40).values:
  t1dict[vals[2]] =vals[1]
t1dict

In [None]:
# generating the wordcloud for topic 1
wordcloud = WordCloud(width = 1200, height = 700,
                min_font_size = 10).generate(' '.join(list(t1dict.keys())))
wordcloud = wordcloud.generate_from_frequencies(frequencies=t1dict)



# plotting the WordCloud image
plt.figure(figsize = (12,7), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

**TOPIC 2: TECH**


In [None]:
# creating term freq dict for topic 2
t2dict = {}
for vals in docterms[docterms['Topic']==2].head(40).values:
  t2dict[vals[2]] =vals[1]
t2dict

In [None]:
# generating the wordcloud for topic 1
wordcloud = WordCloud(width = 1200, height = 700,
                min_font_size = 10).generate(' '.join(list(t2dict.keys())))
wordcloud = wordcloud.generate_from_frequencies(frequencies=t2dict)



# plotting the WordCloud image
plt.figure(figsize = (12,7), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

**TOPIC 3: SPORTS**

In [None]:
# creating term freq dict for topic 3
t3dict = {}
for vals in docterms[docterms['Topic']==3].head(40).values:
  t3dict[vals[2]] =vals[1]
t3dict

In [None]:
# generating the wordcloud for topic 1
wordcloud = WordCloud(width = 1200, height = 700,
                min_font_size = 10).generate(' '.join(list(t3dict.keys())))
wordcloud = wordcloud.generate_from_frequencies(frequencies=t3dict)



# plotting the WordCloud image
plt.figure(figsize = (12,7), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

**TOPIC 4: ENTERTAINMENT**

In [None]:
# creating term freq dict for topic 4
t4dict = {}
for vals in docterms[docterms['Topic']==4].head(40).values:
  t4dict[vals[2]] =vals[1]
t4dict

In [None]:
# generating the wordcloud for topic 1
wordcloud = WordCloud(width = 1200, height = 700,
                min_font_size = 10).generate(' '.join(list(t4dict.keys())))
wordcloud = wordcloud.generate_from_frequencies(frequencies=t4dict)



# plotting the WordCloud image
plt.figure(figsize = (12,7), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

**TOPIC 5: BUSINESS**

In [None]:
# creating term freq dict for topic 5
t5dict = {}
for vals in docterms[docterms['Topic']==5].head(40).values:
  t5dict[vals[2]] =vals[1]
t5dict

In [None]:
# generating the wordcloud for topic 1
wordcloud = WordCloud(width = 1200, height = 700,
                min_font_size = 10).generate(' '.join(list(t5dict.keys())))
wordcloud = wordcloud.generate_from_frequencies(frequencies=t5dict)



# plotting the WordCloud image
plt.figure(figsize = (12,7), facecolor = None)
plt.imshow(wordcloud)
plt.tight_layout(pad = 0)

plt.show()

**CONCLUSION:**

**EDA Conclusion :**
1. Most of the content length lies between 0-5000 but some cases the length extends upto 25,000.
2. The most articles are in the business and sports categories, followed by politics, entertainment, and technology.
3. The term "said" appears the most in all article.



**Model Conclusion :**
* Even though the LSA model did segregate the contents into five topic but the result was very bad, there were mixed up words in each topic.
* So we won't be considering the LSA, and start of with LDA
* Best log likelihood Score for the LDA model is -645320.7760682276
* LDA model Perplexity on train data is 1601.6094311751326

We successfully categorised the contents using the LDA method by analysing the common words related for each category.



## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***