<a href="https://colab.research.google.com/github/Maystrix/Topic-Modelling-on-News-Articles/blob/main/Capstone_IV_Topic_Modelling_on_BBC_News_Articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Topic Modelling on News Articles**



##### **Project Type**    - Unsupervised 
##### **Contribution**    - Individual


# **Project Summary -**

*The datset contained over 2225 documents, with different topics hidden inside them. The goal of topic modelling was to perform textual pre-processing and implemnt ML algorithms to correctly identify topics given in the documents. After reading all the lines and appending the text & topics in respective list, data was cleaned and visualized wrt length of document and word_count present in the documents.Further test pre-processing was done where-in all the irrelevant string data [ punctuations , non-words, stopwords, nuumbers etc] were removed and after performing vectorization using TF-IDF final clean news was obtained which was used for model implementation. First, LDA was implemnted with considerable results , LSA gave not so satisfactory resluts using CountVectorization & also using TF-IDF . The best topic predictions was done by LDA using gensim library algorithm which correctly classified topics based on the importance of words for each document*

# **GitHub Link -**

https://github.com/Maystrix/Topic-Modelling-on-News-Articles

# **Problem Statement**


**Business Context**

In this project your task is to identify major themes/topics across a collection of BBC news articles. You can use Clustering algorithms such as Latent Dirihlet Allocation (LDA), Latent Semantic Analysis (LSA)

**Data Desccription**

The dataset contains a set of news articles for each major segment consisting of business, entertainment, politics , sports and technology. You need to create an aggregate dataset of all the news articles and perform topic modelling on this dataset. Verify whether these topics correspond to different tags available

# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from IPython.display import display  # interactive display
from tqdm import tqdm   #progress bar of execution
from collections import Counter  # when u want to iterate over something and keep a count of that
import ast

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns
import os                                                                        # for listing files in given directory

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

import spacy
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer     # to create a dict --> how many times which word has occured in the document
from textblob import TextBlob
import scipy.stats as stats

from wordcloud import WordCloud

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import TruncatedSVD              # Singular value decomposition
from sklearn.decomposition import LatentDirichletAllocation  
from sklearn.manifold import TSNE    # similar to PCA --> used for dimensionality reduction

import gensim
from gensim import corpora

import warnings                                                                  ## Ignore warnings
warnings. simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#output_notebook()

%matplotlib inline

#### Installing packages

In [None]:
!pip install contractions                                                        # for removing contractions
!pip install pyLDAvis  

### Dataset Loading

In [None]:
# Load Dataset
path = '/content/drive/MyDrive/Almabetter/Capstone Projects/Unsupervised Learning/Topic Modelling/bbc/'
folders = ["business", "entertainment", "politics", "sport", "tech"]

In [None]:
news = []                                                                        
topics = []

for i in folders:
  lst_file_names = os.listdir(path+i)                                            # list of file names present in directory bbc
  for txt_files in lst_file_names:
    txt_path = path + i+ '/'+ txt_files                                          # exact path of all text files
    with open(txt_path, 'rb') as f:                                              # open a binary file
      text = f.read()                                                            # read all lines
      news.append(text)                                                          # append text files 
      topics.append(i)                                                           # append topics 


In [None]:
# creating a dataframe of news & type
df = pd.DataFrame()
df['News_text'] = news
df['type'] = topics

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

### What did you know about your dataset?

*Dataset contains over 2225 documents, with 98 duplicates and zero null values. The dataset [text] need text pre-processing for better model interpretability*

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# finding length of each news
df['length'] = df['News_text'].apply(len)

In [None]:
# finding word countof each news
df['word_count'] = df['News_text'].apply(lambda x:len(str(x).split(" ")))

In [None]:
df.head()

In [None]:
bbc_df = df.copy()

### What all manipulations have you done and insights you found?

*The length & word_count of each doc is calculated*

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
topic_count = bbc_df['type'].value_counts()
topic_count

In [None]:
plt.figure(figsize = (10, 8))
plt.pie(topic_count , labels = topic_count.index, autopct = '%0.2f%%' )
plt.title("Topic Distribution", size=15)
plt.plot()

##### 1. Why did you pick the specific chart?

*Pie Chart describes distribution of multiple variables*

##### 2. What is/are the insight(s) found from the chart?

*Business and Sports are the most common topics present in the documents from the dataset*

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*No*

#### Chart - 2

In [None]:
# length of news in each type
plt.figure(figsize = (10, 8))
sns.barplot(x= bbc_df['type'], y= bbc_df['length'])
plt.title('Length of News in each type', size=15)
plt.xlabel('Type')
plt.ylabel("Length of news")
plt.plot()

##### 1. Why did you pick the specific chart?

*To check the length of news of each topic*

##### 2. What is/are the insight(s) found from the chart?

*Tech & politics contain the maximum no of text / words of all the topics*

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*Length of news is not related to topic distribution*

#### Chart - 3

In [None]:
# length of news in each type
plt.figure(figsize = (10, 8))
sns.barplot(x= bbc_df['type'], y= bbc_df['word_count'])
plt.title('Word count of News in each type', size=15)
plt.xlabel('Type')
plt.ylabel("Word count of news")
plt.plot()

##### 1. Why did you pick the specific chart?

*to Check no of words in each topic*

##### 2. What is/are the insight(s) found from the chart?

*Same as length , which was obivious*

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart-4

In [None]:
bbc1_business = df[df['type'] == 'business']
bbc1_entertainment = df[df['type'] == 'entertainment']
bbc1_politics = df[df['type'] == 'politics']
bbc1_sport = df[df['type'] == 'sport']
bbc1_tech = df[df['type'] == 'tech']

In [None]:
# topic distribution based on length
import matplotlib
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
bins = 100
plt.hist(bbc1_business['word_count'], alpha = 0.6, bins=bins, label='business')
plt.hist(bbc1_entertainment['word_count'], alpha = 0.6, bins=bins, label='entertainment')
plt.hist(bbc1_politics['word_count'], alpha = 0.6, bins=bins, label='politics')
plt.hist(bbc1_sport['word_count'], alpha = 0.6, bins=bins, label='sport')
plt.hist(bbc1_tech['word_count'], alpha = 0.6, bins=bins, label='tech')
plt.xlabel('word_count')
plt.ylabel('numbers')
plt.legend(loc='upper right')
plt.xlim(0,500)
plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

*to check relation between topic and word_lentgh*

##### 2. What is/are the insight(s) found from the chart?

*Lentgh of words are too dense between 200-400 to distinguish between topic and word_length*

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# dropping duplicate values
bbc_df.drop_duplicates(inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

*Duplicates values were droped*

### 4. Textual Data Preprocessing 


#### 1. Expand Contraction

In [None]:
# Expand Contraction
bbc_df['News_text'][2]

#### 2. Lower Casing

In [None]:
# Lower Casing
bbc_df['News_text']= bbc_df['News_text'].apply(lambda x: x.lower())

In [None]:
# Converting column into astring
bbc_df['News_text'] = bbc_df['News_text'].astype('str') 

#### 3. Removing URLs & Removing non-words

* HTML tags

In [None]:
# removing HTML tags
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re                                                                    # regular expression module
    clean = re.compile('<.*?>')                                                  # removes anything in < >
    return re.sub(clean, '', text)

In [None]:
bbc_df['News_text'] = bbc_df['News_text'].apply(remove_html_tags)

* URLs

In [None]:
# removing url's
def remove_urls(text):
  """ Remove url tags from a string """
  url_pattern = r"https?://+|www\."                       # \S+ --> matches anything non-white space character with repetations ; ? --> matches 0 or 1 occurences of pattern to  its left 
  without_urls = re.sub(pattern=url_pattern, repl=' ', string=text)
  return without_urls

In [None]:
bbc_df['News_text'] = bbc_df['News_text'].apply(remove_urls)

In [None]:
bbc_df['News_text'][2]

* non-words

In [None]:
# removing non-word 
def rem_non_word(text):
  """ Remove non-word characters from the string """
  non_words = r"\s+[a-zA-Z]'\s+"                                               # \s+ --> matches space character with repetation(+) ; [a-zA-Z] --> match text string within range
  without_nw = re.sub(pattern = non_words , repl = '', string =text)
  return without_nw

In [None]:
bbc_df['News_text'] = bbc_df['News_text'].apply(rem_non_word)

In [None]:
bbc_df['News_text'] = [re.sub(r"\\n+", " ", i) for i in bbc_df['News_text']]

In [None]:
bbc_df['News_text'][2]

In [None]:
bbc_df['News_text'] = [re.sub(r"b\'+", "", i) for i in bbc_df['News_text']]

In [None]:
bbc_df['News_text'] = [re.sub(r"\\\'s+", "", i) for i in bbc_df['News_text']]

In [None]:
bbc_df['News_text'][2]

#### 4. Removing Punctuations & digits

* Punctuations

In [None]:
#library that contains punctuation
import string
string.punctuation

In [None]:
def remove_punctuation(text):
  """ Removes punctuations mentioned in the library """
  punctuation_free = "".join([i for i in text if i not in string.punctuation])    # joins everything except punctuations
  return punctuation_free

In [None]:
bbc_df['News_text'] = bbc_df['News_text'].apply(lambda x:remove_punctuation(x))

In [None]:
bbc_df['News_text'][2]

* Digits [Numbers]

In [None]:
def remove_numbers(text):
  num_to_remove = r'\d+'                                                               # \d+ --> matches digits 0-9 with any repetation
  without_num = re.sub(pattern = num_to_remove , repl ="", string=text)
  return without_num

In [None]:
bbc_df['News_text'] = bbc_df['News_text'].apply(remove_numbers)

In [None]:
bbc_df['News_text'][2]

#### 5. Removing Stopwords 

In [None]:
# Remove Stopwords

nltk.download('stopwords')

In [None]:
# extracting the stopwords from nltk library
sw = stopwords.words('english')
# displaying the stopwords
np.array(sw)

In [None]:
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word for word in text.split() if word not in sw]
    # joining the list of words with space separator
    return " ".join(text)

In [None]:
bbc_df['News_wo_stopwords'] = bbc_df['News_text'].apply(stopwords)

In [None]:
bbc_df.head()

#### 6. Tokenization

In [None]:
#Word tokenization function
def tokenization(text):
    tokens = re.split('\W+',text)                           # creates words as tokens
    return tokens

In [None]:
bbc_df['tokens'] = bbc_df['News_text'].apply(tokenization)

In [None]:
bbc_df.head()

#### 7. Text Normalization

In [None]:
#defining the object for Lemmatization
lmt = WordNetLemmatizer()

In [None]:
#defining the function for lemmatization
def lemmatizer(text):
  """ This function lematizes each word in the text """
  lemat_text = [lmt.lemmatize(word) for word in text.split()]
  return lemat_text

In [None]:
bbc_df['lematized_news'] = bbc_df['News_text'].apply(lambda x:lemmatizer(x))

In [None]:
bbc_df.head()

* Clean News

In [None]:
bbc_df['clean_news'] = [' '.join(text) for text in bbc_df['lematized_news']] 

In [None]:
bbc_df['clean_news'][2]

##### Which text normalization technique have you used and why?

*Word-Net Lemmatization technique was used for normalization, since it considers the meaning & context of the words and connects it to root word* 

#### 9. Part of speech tagging

In [None]:
# using spacy library

def pos_tagging(text):
  allowed_postags = ['NOUN', 'ADJ']
  tag_txt = []
  doc = nlp(text)
  tag_txt.append([token.text for token in doc if token.pos_ in allowed_postags])   
  return ' '.join(tag_txt[0])


In [None]:
nlp = spacy.load("en_core_web_sm" ,disable=['parser', 'ner'])
bbc_df['news'] = bbc_df['clean_news'].apply(lambda x:pos_tagging(x))

In [None]:
bbc_df.head()

In [None]:
bbc_df['news'][0]

### 4. Feature Manipulation & Selection

* Top Words

In [None]:
# Define helper functions
def get_top_n_words(n_top_words, count_vectorizer, text_data):
    '''
    returns a tuple of the top n words in a sample and their 
    accompanying counts, given a CountVectorizer object and text sample
    '''
    vectorized_headlines = count_vectorizer.fit_transform(text_data.values)        # .values --> creates a numpy array
    vectorized_total = np.sum(vectorized_headlines, axis=0)
    word_indices = np.flip(np.argsort(vectorized_total)[0,:], 1)           # index / position of each word in all documents
    word_values = np.flip(np.sort(vectorized_total)[0,:],1)                # values of words at that position
    
    word_vectors = np.zeros((n_top_words, vectorized_headlines.shape[1]))         # n top words binary matrix for all the headlines
    for i in range(n_top_words):
        word_vectors[i,word_indices[0,i]] = 1

    words = [word[0].encode('ascii').decode('utf-8') for                           # n top words 
             word in count_vectorizer.inverse_transform(word_vectors)]

    return (words, word_values[0,:n_top_words].tolist()[0])

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')
words, word_values = get_top_n_words(n_top_words=15,
                                     count_vectorizer=count_vectorizer, 
                                     text_data=bbc_df['news'])

In [None]:
words

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')
words, word_values = get_top_n_words(n_top_words=15,
                                     count_vectorizer=count_vectorizer, 
                                     text_data=bbc_df['news'])

fig, ax = plt.subplots(figsize=(16,8))
ax.bar(range(len(words)), word_values)
ax.set_xticks(range(len(words)))
ax.set_xticklabels(words, rotation='vertical')
ax.set_title('Top words in headlines dataset (excluding stop words)', size = 15)
ax.set_xlabel('Word', size =12 )
ax.set_ylabel('Number of occurences', size=12)
plt.show()

*  Word-Cloud

In [None]:
plt.figure(figsize=(15,10))
cl_words = ' '.join(bbc_df['news'])
wordCloud = WordCloud(width=800, height=500, background_color="black", max_font_size=100).generate(cl_words)
plt.imshow(wordCloud, interpolation="bilinear", cmap = 'BuPu')
plt.axis('off')
plt.show()

In [None]:
plt.figure(figsize=(15,10))
allWords = ' '.join([topic for topic in bbc_df[bbc_df['type']=='business']['news']])
wordCloud = WordCloud(width=500, height=300, background_color="black", random_state=21, max_font_size=100).generate(allWords)
plt.imshow(wordCloud, interpolation="bilinear", cmap = 'Greys')
plt.axis('off')
plt.show()

#### 10. Text Vectorization

In [None]:
vectorizer = TfidfVectorizer(max_df=0.8, max_features=8000,min_df=0.05)
tfidf_matrix = vectorizer.fit_transform(bbc_df['news'])
feature_names = vectorizer.get_feature_names()

In [None]:
print('Headline after vectorization : \n{}'.format(feature_names))

##### Which text vectorization technique have you used and why?

*TF-IDF technique is used, because CountVectorizer only counts how many times a word appears in a document. But TF-IDF Vectorizer takes into account not only how many times a word apppears in a document but also, how important the word is to the whole corpus*

## ***7. ML Model Implementation***

 ### ML Model -1. Latent Dirichlet Allocation (LDA)  



In [None]:
# Parameter tuning 
from sklearn.model_selection import GridSearchCV 
grid_params = {'n_components':range(5, 10)}

In [None]:
# LDA model
lda = LatentDirichletAllocation()
lda_model = GridSearchCV(lda, param_grid = grid_params)
lda_model.fit(tfidf_matrix)

In [None]:
# Best LDA model
best_lda_model = lda_model.best_estimator_

print("Best LDA model's params" , lda_model.best_params_)
print("Best log likelihood Score for the LDA model",lda_model.best_score_)
print("LDA model Perplexity on train data", best_lda_model.perplexity(tfidf_matrix))

In [None]:
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

In [None]:
lda_panel = pyLDAvis.sklearn.prepare(best_lda_model, tfidf_matrix ,vectorizer,mds='tsne')
lda_panel

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

*LDA correctly describes the most of the topics we predicted --> 1.Sports ,   2. politics, 3. Business, 4. tech, 5.Business with incorrectly predicted entertainment topic*

### ML Model - 2. Latent Semantic Analysis (LSA)

*  Using Count vectorization

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')                         ## Common for all models
count_data = count_vectorizer.fit_transform(bbc_df['news'])
feature_names = count_vectorizer.get_feature_names()
number_topics = 5
top_words = 20


In [None]:
# document clustering for LSA
tsvd = TruncatedSVD(n_components = 5)
tsvd.fit(count_data)
tsvd_mat = tsvd.transform(count_data)

In [None]:
s_list =[]

for clus in tqdm(range(2, 21)):
  km = KMeans(n_clusters=clus, n_init=50, max_iter=1000)                         # Instantiate KMeans clustering
  km.fit(tsvd_mat)                                                               # Run KMeans clustering
  s = silhouette_score(tsvd_mat, km.labels_)
  s_list.append(s)

In [None]:
plt.plot(range(2,21), s_list)
plt.show()

In [None]:
# TSNE plot
tsne = TSNE(n_components=2)
tsne_mat = tsne.fit_transform(tsvd_mat)

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(tsne_mat[:,0],tsne_mat[:,1],hue=bbc_df['type'])

In [None]:
def create_top_words_dict(model, features , n_top_words):                        # Defining function for top words
    """ This function gives top words."""
    top_words_dict = {}
    for topic_id, topic in enumerate(model.components_):
        top_words_dict[topic_id] = [features[i] for i in topic.argsort()[:-n_top_words - 1:-1]]        
    return top_words_dict

In [None]:
print("Top 20 words bagged by SVD model using CountVectorizer")

svd_top_words = create_top_words_dict(tsvd, feature_names , top_words)

print(svd_top_words)

In [None]:
pd.set_option('display.max_colwidth', None)    

In [None]:
terms = count_vectorizer.get_feature_names()

for i, comp in enumerate(tsvd.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:15]
    print("Topic \n" +str(i)+" " )
    for t in sorted_terms:
        print(t[0],end=" ")

* Using TF-IDF 

In [None]:
vectorizer = TfidfVectorizer(max_df=0.8, max_features=8000,min_df=0.05)
tfidf_matrix = vectorizer.fit_transform(bbc_df['news'])

In [None]:
tfidf_matrix.shape

In [None]:
# svd model

svd_model = TruncatedSVD(n_components = 5, algorithm ='randomized', n_iter =100, random_state = 0)
svd_model.fit(tfidf_matrix)

In [None]:
def create_top_words_dict(model, features , n_top_words):                        # Defining function for top words
    """ This function gives top words."""
    top_words_dict = {}
    for topic_id, topic in enumerate(model.components_):
        top_words_dict[topic_id] = [features[i] for i in topic.argsort()[:-n_top_words - 1:-1]]        
    return top_words_dict

In [None]:
print("Top 20 words bagged by SVD model using CTF-IDF")

svd_top_words = create_top_words_dict(svd_model , feature_names , top_words)

print(svd_top_words)

In [None]:
terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:15]
    print("Topic \n" +str(i)+" " )
    for t in sorted_terms:
        print(t[0],end=" ")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

*LSA dosen't provide good results with Countectorizer , and average reslt with TF-IDF vectorization, maybe since LSA focuses more on dimensionality reduction that importance of words*

### ML Model - 3. LDA using Gensim library

In [None]:
# creating word vector for gensim 
dtm_g = bbc_df['news'].str.split().tolist()

In [None]:
dtm_g[:1]

In [None]:
id2word = corpora.Dictionary(dtm_g)                                                # mapping words to tokens
# counting the number of occurrences of each distinct word,--> converting to its integer word id and return the result as a sparse vector.
corpus = [id2word.doc2bow(text) for text in dtm_g]                                

In [None]:
print(corpus[:120])

In [None]:
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

In [None]:
lda_g = gensim.models.ldamodel.LdaModel                                           # creating object for lda using gensim library
lda_model = lda_g(corpus = corpus, num_topics = 5, id2word=id2word, random_state= 101, chunksize = 500 , passes=10 , eval_every =None )

In [None]:
lda_model.print_topics()

*The above output means: each of the unique words are given weights based on the topics.This implies which of the words dominate the topics.*

In [None]:
print(lda_model.print_topics(num_topics=6, num_words=5))

In [None]:
import pyLDAvis.gensim_models

In [None]:
pyLDAvis.enable_notebook()
vis=pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)
vis

In [None]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=dtm_g, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

*LDA with gensim provides the best topic predictions for the documents --> 1. Politics, 2. Tech, 3. Sports, 4. Entertainment, 5. Business*

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

*LDA using gensim library, it correctly classifies topics based on the importance of words for each document*

# **Conclusion**



* Overall 3 different algorithms were used for topic modelling on news articles.
* LDA provided considerable result , with distinct topics as expected.
* LSA with CountVectorizer gave somoewhat considerable results, but with TF-IDF the results were not as expected.
* LDA using gensim library provided the best result with a considerable coherence score of 0.52 
* Topics represented in LDA were adjacent, with hidden topics and relationship between words and documents were found with multiple probability distribution


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***