# **Algorithm for Topic Extraction Using LDA**

This document present the process of development of an unsupervised algorithm for topic extraction. Currently the most reliable technic is the LDA (Latent Dirichlet Allocation) algorithm, which is base on the Dirichlet distribution.

*This notebook is meant to save the important notes and the project decisions. However, a ".py" file containing the same code is available on this folder.*

## **Approach**

## Input the Data

#### The first step is to make the input of the data and the filter by language to avoid inconsistent results

In [1]:
from Algorithms import preProcessing_BBC

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




  from pandas import Panel


In [2]:
data = preProcessing_BBC.import_files("articles_bbc_2018_01_30.csv",
                                   preCleaning = True,
                                   dropna = 'index',
                                   verbose = True)

Input Format:
Rows: 309, Columns: 2

Pre cleaning format:
Rows: 308, Columns: 2


In [3]:
preProcessing_BBC.language_detection(data,
                                 verbose = True)

HBox(children=(IntProgress(value=0, max=308), HTML(value='')))


en    257
fa      9
fr      7
id      5
vi      4
ru      4
hi      4
uk      4
ar      4
sw      3
tr      2
pt      2
es      2
de      1
Name: lang, dtype: int64

Most Frequent language: en    257
Name: lang, dtype: int64


en    257
Name: lang, dtype: int64

In [4]:
data = preProcessing_BBC.language_cleaning(dataFile = data,
                                        language = 'en',
                                        verbose = True)

Cleaning data using 'en' language.


## Pre Processing the data

#### **Tokenizing** the documents to the word level.

In [5]:
data = preProcessing_BBC.tokenization(data, level = 'word', verbose = True)

Tokenizing data to word
.


HBox(children=(IntProgress(value=0, max=257), HTML(value='')))




HBox(children=(IntProgress(value=0, max=257), HTML(value='')))




#### Preparing for **Lemmatizizing** using POS tagging.

In [6]:
data = preProcessing_BBC.POS_tagging(data)

HBox(children=(IntProgress(value=0, max=257), HTML(value='')))




In [7]:
data = preProcessing_BBC.lemmatizing(data)

HBox(children=(IntProgress(value=0, max=257), HTML(value='')))




#### Removing **StopWords** using the english stopwords from the Natural Language Toolkit (NLTK) and removing any token less than 2 characthers.

In [8]:
data = preProcessing_BBC.removeStopWords(data, minSize = 2)

### **Trainning the LDA Model**

### Preparing the Data

#### Generating the tokens using the Bigram and the Trigram Model

In [9]:
tokens = preProcessing_BBC.Bi_n_TrigramModel(data, min_cnt = 1, verbose = True)

Getting tokens From data file and converting into a list of tokens.
Building the Bigram Model
Building the Trigram Model
Importing the Trigram Model and converting into list


#### Creating the dictionary using the the tokens

In [10]:
dictionary = preProcessing_BBC.generateDictionary(tokens, min_thld = 3, verbose = True)

Generating the Ditionary.
Filtering dictionary using the minimun threshold: 3


#### Generating the BOW for the LDA model

In [11]:
bow = preProcessing_BBC.generateBOW(dictionary, tokens, verbose = True)

Generating Bag Of Words.


#### Generating the LDA based model

In [16]:
%time LDAModel = preProcessing_BBC.trainModel(bow, dictionary, numTopics = 20, numPasses = 4, verbose = True)

Trainning LDA model using the inputed BOW and Dictionary.
Parameters: Topics: 20, Passes: 4
CPU times: user 1.75 s, sys: 0 ns, total: 1.75 s
Wall time: 1.75 s


In [17]:
for i,topic in LDAModel.show_topics(formatted=True, num_topics=20, num_words=20):
    print(str(i)+": "+ topic)
    print()

0: 0.016*"africa" + 0.014*"find" + 0.012*"animal" + 0.012*"light" + 0.009*"news" + 0.009*"specie" + 0.008*"continent" + 0.008*"long" + 0.008*"much" + 0.008*"help" + 0.008*"transparent" + 0.008*"twitter" + 0.007*"rather" + 0.007*"time" + 0.007*"giant" + 0.006*"predator" + 0.006*"plate" + 0.005*"rock" + 0.005*"study" + 0.005*"live"

1: 0.008*"water" + 0.008*"wave" + 0.008*"ocean" + 0.008*"find" + 0.006*"animal" + 0.006*"night" + 0.006*"world" + 0.005*"northern_mali" + 0.005*"light" + 0.005*"group" + 0.005*"area" + 0.005*"logo" + 0.004*"call" + 0.004*"create" + 0.004*"day" + 0.004*"country" + 0.004*"seal" + 0.004*"include" + 0.004*"healthcare" + 0.004*"design"

2: 0.011*"radio" + 0.011*"case" + 0.010*"ansar_dine" + 0.009*"people" + 0.009*"review" + 0.008*"subject" + 0.008*"find" + 0.008*"tuareg" + 0.007*"report" + 0.007*"university" + 0.006*"think" + 0.006*"digital" + 0.006*"tell" + 0.006*"add" + 0.006*"head" + 0.006*"saudi_arabia" + 0.006*"government" + 0.006*"call" + 0.005*"last_year" +