 <h1><center>Topic Modeling of BBC News Articles </center></h1>

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#Section_1"> Importing Required Libraries and Data</a></li>
        <li><a href="#Section_2"> Preprocessing Data Frame and creating TF-IDF Matrix</a></li>
        <li><a href="#Section_3"> LDA and LSA by Cleaning Method 1 </a> </li>
        <li><a href="#Section_4"> LDA and LSA by Cleaning Method 2</a></li>
        <li><a href="#Section_5"> LDA and LSA by Clenaing Method 3</a></li>
        <li><a href="#Section_6"> Five most common keywords across these six groups of keywords</a></li>
        <li><a href="#Section_6"> Observations</a></li>
    </ol>
</div>

<h1 id="#Section_1"> 1. Importing Required Libraries and Data</h1>

In [21]:
import nltk
from nltk.corpus import stopwords
import pandas as pd
from string import punctuation
import re
from gensim.models import TfidfModel, LsiModel, CoherenceModel, LdaModel
import numpy as np
from gensim.corpora import Dictionary
from textblob import TextBlob
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')

In [22]:
articles = pd.read_csv("BBC-articles.csv") # importing csv into a dataframe

<h2 id="#Section_2"> 2. Preprocessing Data Frame and creating TF-IDF Matrix</h2>

<h3 id="#Section_2"> Basic Cleaning</h3>

In [23]:
# clean the articles by removing the punctuation, fullstops, stopwords, 'words_len>2' 
def cleantext(text):
    text = text.strip(punctuation).lower()
    text = re.sub(r'[!?,.\:;\n\t]+', '', text)
    word= nltk.tokenize.word_tokenize(text)#tokenization
    word = [w for w in word if w.isalpha()]# selecting only words
    word = [w for w in word if w not in stopwords.words('english') and len(w) > 2]#removing stopwords 
    return word

<h3 id="#Section_2"> TF-IDF Matrix by different matrix</h3>

In [31]:
def tfidf_maker(articles,clean_method):
    # creating a list of token of all the articles(documents)
    token = []    
    if clean_method==1:
        #More cleaning with the help of lemmatizing words 
        for i in articles.index:
            words = cleantext(articles.loc[i, 'text']) #calling basic function
            wordnet = nltk.stem.WordNetLemmatizer() #Normalization using Lemmatization technique
            lemmatized_words = [wordnet.lemmatize(w) for w in words] # keeping lemmatized words
            token.append(lemmatized_words)             #appending to empty token list        
        my_dict = Dictionary(token)  #Converting words into a dictonary Tokenization 
        return my_dict,token 
    elif clean_method==2:
        #to exclude the top 10% of the most frequent words and words that appear less than 5 times in the documents
        for i in articles.index:
            words = cleantext(articles.loc[i, 'text'])
            token.append(words) #appending to a empty token list
        my_dict = Dictionary(token)  #Converting words into a dictonary Tokenization
        #exclude the top 10% and words that appear less than 5 times
        my_dict.filter_extremes(no_below=5, no_above=0.90)
        return my_dict,token
    elif clean_method==3:
        #Limiting the word list with nouns
        for i in articles.index:
            words = cleantext(articles.loc[i, 'text'])
            modified_text=' '.join([w for w in words])
            blob_object = TextBlob(modified_text)
            #Limiting the word list with nouns
            word_list_nouns = [word for word,pos in blob_object.tags if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
            token.append(word_list_nouns) #apending a empty token list
        my_dict = Dictionary(token)   #Converting words into a dictonary Tokenization
        return my_dict,token                                       

<h3 id="#Section_2"> Determining Max Coherence/topics</h3>

In [25]:
# Determining optimum number of topics using coherence values 
def maxCoherence(corpus, isLsi,my_dict,token):
    coherence_values = []
    model_list = []
    min_topics, max_topics, step = 1, 10, 1
    for i in range(min_topics, max_topics, step):
        if (isLsi) :
            model = LsiModel(corpus, id2word=my_dict, num_topics=i)
        else:
            model = LdaModel(corpus, id2word=my_dict, num_topics=i)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=token, dictionary=my_dict, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return coherence_values.index(max(coherence_values))

<h3 id="#Section_2"> Dominant topic and Keyword for each article/topics</h3>

In [26]:
# Get dominant topic and corresponding keywords for each article
def getkeywords(model, corpus): 
    # Init output
    topickeyword_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(model[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = model.show_topic(topic_num, topn=5)
                #topn = 5 gives top 5 kwywords 
                topic_keywords = ", ".join([word for word, prop in wp])
                topickeyword_df = topickeyword_df.append(pd.Series([topic_keywords]), ignore_index=True)
            else:
                break
    return(topickeyword_df)

<h3 id="#Section_2"> Modelling Function</h3>

In [27]:
def models_method(clean_method):
    #convert a list of words to bag of words
    my_dict,token=tfidf_maker(articles,clean_method)
    dtm = [my_dict.doc2bow(doc) for doc in token] #convert a list of words to bag of words
    tfidf = TfidfModel(dtm) # TF-IDF Vectorization for the document term matrix
    tfidf = tfidf[dtm]

    # Gensim: LSI
    lsi_model = LsiModel(corpus=tfidf, id2word=my_dict, num_topics=maxCoherence(tfidf,isLsi=True,my_dict = my_dict,token = token))

    # Gensim: LDA
    lda_model = LdaModel(corpus=tfidf, id2word=my_dict, num_topics=maxCoherence(tfidf,isLsi=False,my_dict = my_dict,token = token))
    return lsi_model,lda_model,tfidf

<h2 id="#Section_2"> 3.LSA and LDA by cleaning method 1</h2>

In [28]:
lsi_model_1,lda_model_1,tfidf = models_method(1)
# add top 5 keywords for each model into the dataframe after vectorization 
articles['LSI Clean Keywords'] = getkeywords(model=lsi_model_1, corpus=tfidf)
articles['LDA Clean Keywords'] = getkeywords(model=lda_model_1, corpus=tfidf)

In [29]:
articles.head(3)

Unnamed: 0,category,text,LSI Clean Keywords,LDA Clean Keywords
0,tech,tv future in the hands of viewers with home th...,"labour, election, blair, tax, brown","search, blair, holmes, blog, mobile"
1,business,worldcom boss left books alone former worldc...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy"
2,sport,tigers wary of farrell gamble leicester say ...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy"


<h2 id="#Section_2"> 4.LSA and LDA by cleaning method 2</h2>

In [32]:
lsi_model_2,lda_model_2,tfidf = models_method(2)
# add top 5 keywords for each model into the dataframe after vectorization 
articles['LSI Clean Keywords 2'] = getkeywords(model=lsi_model_2, corpus=tfidf)
articles['LDA Clean Keywords 2'] = getkeywords(model=lda_model_2, corpus=tfidf)

In [33]:
articles.head(3)

Unnamed: 0,category,text,LSI Clean Keywords,LDA Clean Keywords,LSI Clean Keywords 2,LDA Clean Keywords 2
0,tech,tv future in the hands of viewers with home th...,"labour, election, blair, tax, brown","search, blair, holmes, blog, mobile","labour, blair, election, people, brown","mobile, games, music, blair, players"
1,business,worldcom boss left books alone former worldc...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players"
2,sport,tigers wary of farrell gamble leicester say ...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players"


<h2 id="#Section_3"> 5.LSA and LDA by cleaning method 3</h2>

In [34]:
lsi_model_3,lda_model_3,tfidf = models_method(3)
# add top 5 keywords for each model into the dataframe after vectorization 
articles['LSI Clean Keywords 3'] = getkeywords(model=lsi_model_3, corpus=tfidf)
articles['LDA Clean Keywords 3'] = getkeywords(model=lda_model_3, corpus=tfidf)

In [35]:
articles.head(3)

Unnamed: 0,category,text,LSI Clean Keywords,LDA Clean Keywords,LSI Clean Keywords 2,LDA Clean Keywords 2,LSI Clean Keywords 3,LDA Clean Keywords 3
0,tech,tv future in the hands of viewers with home th...,"labour, election, blair, tax, brown","search, blair, holmes, blog, mobile","labour, blair, election, people, brown","mobile, games, music, blair, players","election, blair, government, party, people","dollar, bank, prices, growth, oil"
1,business,worldcom boss left books alone former worldc...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players","election, blair, government, party, people","search, google, party, users, people"
2,sport,tigers wary of farrell gamble leicester say ...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players","election, blair, government, party, people","blair, party, election, tax, kennedy"


<h2 id="#Section_3"> 6.Five most common keywords across these six groups of keywords</h2>

In [36]:
#combining keywords from LSA , LDA after 3 ceaing methods into a new keyword column
articles['keyword'] = articles[articles.columns[2:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1)

In [37]:
#  To Get 5 most common keywords from all the LSI and LDA Keywords
from collections import Counter 
for i in articles.index:
    key_word = articles.loc[i, 'keyword']
    key_word = key_word.split(',')
    most_occur = Counter(key_word).most_common(5) 
    articles.loc[i, 'Top 5 Words'] = ','.join([word[0] for word in most_occur])

<h3 id="#Section_3"> To CSV </h3>

In [38]:
articles = articles.drop(columns=['keyword']) #every keyword
articles.to_csv('BBC_Keywords.csv',index=False,encoding='utf-8') #write to csv

In [39]:
articles.head(5)

Unnamed: 0,category,text,LSI Clean Keywords,LDA Clean Keywords,LSI Clean Keywords 2,LDA Clean Keywords 2,LSI Clean Keywords 3,LDA Clean Keywords 3,Top 5 Words
0,tech,tv future in the hands of viewers with home th...,"labour, election, blair, tax, brown","search, blair, holmes, blog, mobile","labour, blair, election, people, brown","mobile, games, music, blair, players","election, blair, government, party, people","dollar, bank, prices, growth, oil","blair,labour, election, brown, people"
1,business,worldcom boss left books alone former worldc...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players","election, blair, government, party, people","search, google, party, users, people","blair, people,labour, election, brown"
2,sport,tigers wary of farrell gamble leicester say ...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players","election, blair, government, party, people","blair, party, election, tax, kennedy","blair, election,labour, tax, brown"
3,sport,yeading face newcastle in fa cup premiership s...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","mobile, games, music, blair, players","growth, economy, film, bank, oil","dollar, bank, prices, growth, oil","blair,labour, election, brown, economy"
4,entertainment,ocean s twelve raids box office ocean s twelve...,"labour, election, blair, tax, brown","bank, dollar, player, sale, economy","labour, blair, election, people, brown","film, box, oscar, office, mercedes","election, blair, government, party, people","film, attacks, turkey, glasgow, morrison","blair,labour, election, brown, people"


<h2 id="#Section_3"> 7. Observatons </h2>

After vectorizing the text using TF-IDF vector in three different ways:
1. normal cleaning
2. using term frequncy
3. part of speech as noun
and using LSI/LSA and LDA algorithms for topic modeling.

From the results - LDA model using normal cleaning has better keywords and relevant to each article.

