# Analysis of speeches given by P.M Mr. Modi

In this notebook, we're going to analyse the speeches given my P.M Mr Modi. We will be performing Topic Modelling & Text Summarization. As always, I have tried to keep the notebook well commented & organized for easy reading. I hope you find this notebook helpful.


# Setup

In [None]:
'''--- Libraries ---'''

# Generic Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os, gc, warnings, datetime, pickle, re
warnings.filterwarnings("ignore")


# Plotting
import plotly.express as px
import plotly.graph_objects as go
#from pyLDAvis import sklearn as sklda
import pyLDAvis.gensim 
import pyLDAvis.sklearn

#Gensim Library for Text Processing
import gensim.parsing.preprocessing as gsp
from gensim import utils, corpora, models

# SK Learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

# Transformer
from transformers import TFAutoModelWithLMHead, AutoTokenizer, pipeline

# Spacy
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])

### Data

In this notebook, we'll be focussing on the english speeches made by P.M Mr Modi in 2020. 

In [None]:
'''--- Data ---'''

# Load
url = '../input/speeches-modi/PM_Modi_speeches.csv'
data = pd.read_csv(url, header='infer')

# Convert date column datatype to datetime
data['date'] = pd.to_datetime(data['date'], errors='ignore')

# Extract Year
data['year'] = pd.DatetimeIndex(data['date']).year

# Selecting only english speeches delivered in 2020
df = data[(data['year'] == 2020) & (data['lang'] =='en')]

# Dropping Unwanted Columns
df.drop(['lang','year', 'url'], axis=1, inplace=True)

# Total Speeches
print("Total Speeches Made in 2020: ", df.shape[0])

# Inspect
df.head()

In [None]:
# Visualize
fig = px.line(df, x='date', y='words', title="Speeches made by P.M Mr Modi in 2020")
fig.show()

# Topic Modelling

Topic Modelling is a process of identifying the topic that is being discussed in a document. Latent Dirichlet Allocation (LDA) is a widely used topic modelling technique. We will apply LDA to convert set of speeches to a set of topics. Speech topic modelling is an unsupervised machine learning method that helps us discover hidden semantic structures in a speech, that allows us to learn topic representations of the speech.

**Trivia:** An average person takes around 1hr to speak 20,000 words. 

Based on this little trivia, for **Topic Modelling**, we will be focussing on the **english speeches with >= 20,000 words made by P.M Mr Modi in 2020**.



### Selecting Records & Text Cleaning

In [None]:
'''Selecting Records''' 
word_count = 20000
df_20k = df[df['words'] >= word_count]
df_20k.reset_index(drop=True, inplace=True)



'''Text Cleaning Utility Function'''

processes = [
               gsp.strip_tags, 
               gsp.strip_punctuation,
               gsp.strip_multiple_whitespaces,
               gsp.strip_numeric,
               gsp.remove_stopwords, 
               gsp.strip_short #, 
               #gsp.stem_text,
               #utils.tokenize
            ]

# Utility Function
def proc_txt(txt):
    text = txt.lower()
    text = utils.to_unicode(text)
    for p in processes:
        text = p(text)
    return text

# Applying the function to text column
df_20k['cleaned_txt'] = df_20k['text'].apply(lambda x: proc_txt(x))


# Inspect
df_20k.head()

### Modelling using SKLearn Library

We will now transform the textual data in a format that will serve as an input for training the LDA model. Firstly, we will convert the documents into a simple vector representation (Bag of Words BOW). Then, we shall convert a list of text into lists of vectors, all with length equal to the vocabulary.

In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(df_20k['cleaned_txt'])


# Parameters
num_topics = 5    # can be changed
num_words = 20    # can be changed


# Utility Function
def topics (model, vectors,num_top_wrds):
    words = count_vectorizer.get_feature_names()
    
    print(f"*** {num_topics} TOPICS DISPLAYED WITH {num_top_wrds} WORDS ***\n")
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic Index: %d" %topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-num_top_wrds - 1:-1]]), "\n")

# LDA Model
lda = LDA(n_components=num_topics, n_jobs=-1, random_state=42, verbose=0)
lda.fit(count_data)
              
# Topics Detected by LDA Model
topics(lda, count_vectorizer, num_words)

For Visualizing the topics we will use **pyLDAvis** which is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

**Saliency**: a measure of how much the term tells you about the topic.

**Relevance**: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.
The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.

In [None]:
# Visualize Topics
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(num_topics))
LDAvis_prepared = pyLDAvis.sklearn.prepare(lda, count_data, count_vectorizer)
pyLDAvis.display(LDAvis_prepared)

### Observations:

* The most salient terms used by P.M  Mr. Modi are INDIA, COUNTRY, FRIENDS, PEOPLE, CORONA, VILLAGE.
* The term CORONA was used in 3 out of 5 topics


**Note**: When there are >=10 topics, we will see clustering of certain topics, this indicates the similarity between topics.

### Modelling using Gensim

Firstly, will create a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [None]:
'''Text Cleaning Utility Function'''

processes = [
               gsp.strip_tags, 
               gsp.strip_punctuation,
               gsp.strip_multiple_whitespaces,
               gsp.strip_numeric,
               gsp.remove_stopwords, 
               gsp.strip_short, 
               #gsp.stem_text,
               utils.tokenize
            ]

# Utility Function
def proc_txt(txt):
    text = txt.lower()
    text = utils.to_unicode(text)
    for p in processes:
        text = p(text)
    return list(text)

# Applying the function to text column
df_20k['cleaned_txt'] = df_20k['text'].apply(lambda x: proc_txt(x))

In [None]:
# Dictionary & Corpus
dictionary = corpora.Dictionary(df_20k['cleaned_txt'])
corpus = [dictionary.doc2bow(txt) for txt in df_20k['cleaned_txt']]

# Saving the dictionary
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [None]:
# Gensim LDA Model
lda_model = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=20)
lda_model.save("ldamodel.gensim")

# Display Topics
print(f"*** {num_topics} TOPICS DISPLAYED WITH {num_words} WORDS ***\n")
topics = lda_model.print_topics(num_words=num_words)
for topic in topics:
    print(topic,"\n")
    
# Save Dictionary
dictionary = corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = models.ldamodel.LdaModel.load('ldamodel.gensim')

# Visualize Topics
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

### Observations:

* The term CORONA was used in all the 5 topics.
* Terms likes VILLAGE, FARMERS, PANCHAYAT also makes it into the top 30 salient terms.
* We can observe that topic#1 & topic#3 are clustered which means there are similarities
* Health & Well being related terms such as DOCTOR, FAMILY, YOGA, AYUSHMAN were used frequently
* Epidemic related terms such as CORONA, QUARANTINE, LOCKDOWN, VIRUS were used frequently

In [None]:
# Garbage Collection
gc.collect()

# Summarization

Summarization is a task of producing a concise and fluent summary while preserving key information and overall meaning. There are two types of summarization, abstractive and extractive summarization

* Abstractive Summarization (Input document > understand context > semantics > create own summary)
* Extractive Summarization (Input document > sentences similarity > weight sentences > select sentences with higher rank)

For the scope of this notebook, we're going to focus on the **Extractive Summarization** technique and the speech made by P.M Mr Modi on **15 Aug 2020**.

In [None]:
'''Select Text'''
article = df_20k['text'].iloc[1]


'''Text Cleaning Utility Function'''

processes = [
               gsp.strip_tags, 
               gsp.strip_punctuation,
               gsp.strip_multiple_whitespaces,
               gsp.strip_numeric,
               gsp.remove_stopwords, 
               gsp.strip_short
            ]

# Utility Function
def proc_txt(txt):
    text = txt.lower()
    text = utils.to_unicode(text)
    for p in processes:
        text = p(text)
    return text

# Cleaning the article
article = proc_txt(article)

### Using Pre-Trained Transformer Model

In [None]:
# Instantiate Model
model = TFAutoModelWithLMHead.from_pretrained("bert-large-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-large-cased")

In [None]:
# Define Input
input = tokenizer.encode(article, return_tensors="tf", max_length=256)
output = model.generate(input, max_length=150, min_length=40, length_penalty=2.0, num_beams=4,early_stopping=True)
print("Summarized Text: \n", tokenizer.decode(output[0], skip_special_tokens=True))

In [None]:
# Garbage Collection
gc.collect()

## I hope that was enlighting and helpful. Please do consider to UPVOTE :)