<a href="https://colab.research.google.com/github/Rahul711sharma/Topic-Modeling/blob/main/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project your task is to identify major themes/topics across a collection of BBC news articles. You can use clustering algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc.

## <b> Data Description </b>

### The dataset contains a set of news articles for each major segment consisting of business, entertainment, politics, sports and technology. You need to create an aggregate dataset of all the news articles and perform topic modeling on this dataset. Verify whether these topics correspond to the different tags available.

##**Libraries**

In [10]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *
import glob

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') 
nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer

from textblob import TextBlob
import scipy.stats as stats

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Import Data**




In [15]:
path ="/content/drive/MyDrive/Capstone Projects/Unsupervised Learning/bbc/"

In [16]:
#Importing text file paths
business = glob.glob(path+'/business/*')
entertainment = glob.glob(path+'/entertainment/*')
politics = glob.glob(path+'/politics/*')
sports = glob.glob(path+'/sport/*')
tech = glob.glob(path+'/tech/*')

In [17]:
business[0:5]

['/content/drive/MyDrive/Capstone Projects/Unsupervised Learning/bbc//business/349.txt',
 '/content/drive/MyDrive/Capstone Projects/Unsupervised Learning/bbc//business/321.txt',
 '/content/drive/MyDrive/Capstone Projects/Unsupervised Learning/bbc//business/311.txt',
 '/content/drive/MyDrive/Capstone Projects/Unsupervised Learning/bbc//business/301.txt',
 '/content/drive/MyDrive/Capstone Projects/Unsupervised Learning/bbc//business/341.txt']

In [20]:
len(business)

197

In [18]:
def make_list(data):
    list = []
    for x in range(len(data)):
      file = open(data[x],'r')
      list.append(file.read())
    return(list)

In [19]:
business_texts= make_list(business)
entertainment_text = make_list(entertainment)
politics_texts= make_list(politics)
sport_texts= make_list(sports)
tech_text = make_list(tech)

In [21]:
#Number of documents in every topics
print(len(business_texts),len(entertainment_text),len(politics_texts),len(sport_texts),len(tech_text))

197 196 50 74 110


In [22]:
complete_text = business_texts + entertainment_text + politics_texts + sport_texts + tech_text

In [23]:
len(complete_text)

627

In [24]:
data = pd.DataFrame({'Texts': complete_text})
data.head()

Unnamed: 0,Texts
0,S Korean lender faces liquidation\n\nCreditors...
1,"Diageo to buy US wine firm\n\nDiageo, the worl..."
2,Stormy year for property insurers\n\nA string ...
3,Libya takes $1bn in unfrozen funds\n\nLibya ha...
4,Jarvis sells Tube stake to Spain\n\nShares in ...


In [25]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [31]:
#Data Cleaning 
def text_processing(data):
  data = data.map(lambda x: x.replace('\n',' '))
  data = data.map(lambda x: x.lower())
  # data = data.map(lambda x: ''.join([i for i in x if i not in string.punctuation]))
  data = data.map(lambda x: ' '.join([i for i in x.split(' ') if i not in stopwords.words('english')]))
  return data

In [41]:
data['Texts']= text_processing(data['Texts'])


In [42]:
data.head()

Unnamed: 0,Texts
0,korean lender faces liquidation creditors sou...
1,"diageo buy us wine firm diageo, world's bigge..."
2,"stormy year property insurers string storms, ..."
3,libya takes $1bn unfrozen funds libya withdra...
4,jarvis sells tube stake spain shares engineer...


In [57]:
data['Sentence lengths'] = [len(i) for i in data['Texts'].apply(nltk.sent_tokenize)]

In [58]:
data.head()

Unnamed: 0,Texts,Sentence lengths
0,korean lender faces liquidation creditors sou...,17
1,"diageo buy us wine firm diageo, world's bigge...",7
2,"stormy year property insurers string storms, ...",9
3,libya takes $1bn unfrozen funds libya withdra...,8
4,jarvis sells tube stake spain shares engineer...,9


In [60]:
data['Sentence lengths'].nlargest(15)

249    229
406    148
547    147
278    137
387    134
578    107
487     79
550     58
354     51
602     51
366     48
521     46
618     46
268     45
520     45
Name: Sentence lengths, dtype: int64

In [70]:
def number_of_words(data):
  words_count = [len(i.split()) for i in data['Texts']]
  data['Number of words'] = words_count
  return data.head() 

In [71]:
number_of_words(data)

Unnamed: 0,Texts,Sentence lengths,Number of words
0,korean lender faces liquidation creditors sou...,17,228
1,"diageo buy us wine firm diageo, world's bigge...",7,104
2,"stormy year property insurers string storms, ...",9,131
3,libya takes $1bn unfrozen funds libya withdra...,8,112
4,jarvis sells tube stake spain shares engineer...,9,143


In [75]:
def count_complex_words(data):
  count_complex_words = []
  for i in data['Texts']:
    count = 0
    for j in range(len(i.split())):
      if len(i.split()[j])>4:
        count+=1
      if j == len(i.split())-1:
        count_complex_words.append(count)
  data['Number of Complex words'] = count_complex_words
  return data.head()

In [76]:
count_complex_words(data)

Unnamed: 0,Texts,Sentence lengths,Number of words,Number of Complex words
0,korean lender faces liquidation creditors sou...,17,228,167
1,"diageo buy us wine firm diageo, world's bigge...",7,104,79
2,"stormy year property insurers string storms, ...",9,131,107
3,libya takes $1bn unfrozen funds libya withdra...,8,112,82
4,jarvis sells tube stake spain shares engineer...,9,143,113


In [None]:
#lemmetization of words in document
lemmatizer = WordNetLemmatizer()

data['Texts'] = data['Texts'].map(lambda x: ' '.join([lemmatizer.lemmatize(i) for i in x.split()]))

In [None]:
data['Texts'][626]

'game maker get xbox 2 sneak peek microsoft given game maker glimpse new xbox 2 console detail xboxs performance gaming like device given annual game developer conference u xbox frontman j allard said console looked set capable one trillion calculation per second also title new xbox interface make easy play online buy extra character addons game microsoft saving official unveiling xbox 2 codenamed xenon e3 show may device could shop shelf november however keynote speech gdc mr allard head development gamemaking tool console gave glimpse core software work said gaming entering highdefinition era demanded detailed convincing graphic could adequately compete hdtv people starting watch well hd dvd soon start appear industry watcher took mean xbox 2 push hdtv quality graphic standard well multichannel audio give gamers authentic experience mr allard said microsoft work hard ensure easy game maker produce title xbox 2 player get playing end microsoft building xbox hardware system support hea

##**Vectorization**

In [None]:
vectors = CountVectorizer()
document_term_matrix = vectors.fit_transform(data['Texts'])

In [None]:
# Parameters tuning using Grid Search
from sklearn.model_selection import GridSearchCV
grid_params = {'n_components' : list(range(5,10))}

# LDA model
lda = LatentDirichletAllocation()
lda_model = GridSearchCV(lda,param_grid=grid_params)
lda_model.fit(document_term_matrix)

# Best LDA model
best_lda_model = lda_model.best_estimator_

print("Best LDA model's params" , lda_model.best_params_)
print("Best log likelihood Score for the LDA model",lda_model.best_score_)

Best LDA model's params {'n_components': 5}
Best log likelihood Score for the LDA model -305700.43238564546


In [None]:
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

In [None]:
lda_panel = pyLDAvis.sklearn.prepare(best_lda_model, document_term_matrix,vectors)

lda_panel

TypeError: ignored

In [None]:
from pandas.compat._optional import import_optional_dependency
ne = import_optional_dependency("numexpr")