# **Algorithm for Topic Extraction Using LDA**

This document present the process of development of an unsupervised algorithm for topic extraction. Currently the most reliable technic is the LDA (Latent Dirichlet Allocation) algorithm, which is base on the Dirichlet distribution.

*This notebook is meant to save the important notes and the project decisions. However, a ".py" file containing the same code is available on this folder.*

## **Approach**

## Input the Data

#### The first step is to make the input of the data and the filter by language to avoid inconsistent results

In [1]:
from Algorithms import preProcessing

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




  from pandas import Panel


In [2]:
paths = ['datasets/articles_bbc_2018_01_30.csv', 'datasets/transcripts.csv']
targets = ['articles','transcript']

In [3]:
data = preProcessing.import_files(paths, targets,
                                   preCleaning = True,
                                   dropna = 'index',
                                   verbose = True)

Loading the dataset 0.

Input Format:
Rows: 309, Columns: 2.

Pre cleaning format:
Rows: 308, Columns: 2
Loading the dataset 1.

Input Format:
Rows: 2467, Columns: 2.

Pre cleaning format:
Rows: 2467, Columns: 2
Removing unwanted information using targets.


In [5]:
preProcessing.language_detection(data,
                                 verbose = True)

HBox(children=(IntProgress(value=0, max=2775), HTML(value='')))


en    2723
fa       9
fr       8
id       5
vi       4
uk       4
hi       4
ar       4
ru       4
sw       3
pt       2
es       2
tr       2
de       1
Name: lang, dtype: int64

Most Frequent language: en    2723
Name: lang, dtype: int64


en    2723
Name: lang, dtype: int64

In [6]:
data = preProcessing.language_cleaning(dataFile = data,
                                        language = 'en',
                                        verbose = True)

Cleaning data using 'en' language.


## Pre Processing the data

#### **Tokenizing** the documents to the word level.

In [7]:
data = preProcessing.tokenization(data, level = 'word', verbose = True)

Tokenizing data to word
.


HBox(children=(IntProgress(value=0, max=2723), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2723), HTML(value='')))




#### Preparing for **Lemmatizizing** using POS tagging.

In [8]:
data = preProcessing.POS_tagging(data)

HBox(children=(IntProgress(value=0, max=2723), HTML(value='')))




In [9]:
data = preProcessing.lemmatizing(data)

HBox(children=(IntProgress(value=0, max=2723), HTML(value='')))




#### Removing **StopWords** using the english stopwords from the Natural Language Toolkit (NLTK) and removing any token less than 2 characthers.

In [10]:
data = preProcessing.removeStopWords(data, minSize = 2)

## Trainning the LDA Model

### Preparing the Data

#### Generating the tokens using the Bigram and the Trigram Model

In [11]:
tokens = preProcessing.Bi_n_TrigramModel(data, min_cnt = 1, verbose = True)

Getting tokens From data file and converting into a list of tokens.
Building the Bigram Model
Building the Trigram Model
Importing the Trigram Model and converting into list


#### Creating the dictionary using the the tokens

In [12]:
dictionary = preProcessing.generateDictionary(tokens, min_thld = 3, verbose = True)

Generating the Ditionary.
Filtering dictionary using the minimun threshold: 3


#### Generating the BOW for the LDA model

In [13]:
bow = preProcessing.generateBOW(dictionary, tokens, verbose = True)

Generating Bag Of Words.


#### Generating the LDA based model

In [15]:
numberOfTopics = 20
%time LDAModel = preProcessing.trainModel(bow, dictionary, numTopics = numberOfTopics, numPasses = 4, verbose = True)

Trainning LDA model using the inputed BOW and Dictionary.
Parameters: Topics: 20, Passes: 4
CPU times: user 2min 35s, sys: 1min 37s, total: 4min 12s
Wall time: 1min 8s


In [16]:
for i,topic in LDAModel.show_topics(formatted=True, num_topics = numberOfTopics, num_words=20):
    print(str(i)+": "+ topic)
    print()

0: 0.062*"love" + 0.021*"sleep" + 0.012*"question" + 0.011*"romantic_love" + 0.008*"relationship" + 0.008*"study" + 0.007*"metaphor" + 0.006*"fall_love" + 0.006*"golf" + 0.005*"teen" + 0.005*"choice" + 0.005*"hummus" + 0.004*"feel" + 0.004*"teenager" + 0.004*"understand" + 0.004*"animal" + 0.004*"environmental_protection" + 0.003*"someone" + 0.003*"song" + 0.003*"article"

1: 0.008*"brain" + 0.006*"story" + 0.005*"face" + 0.004*"understand" + 0.004*"help" + 0.004*"feel" + 0.004*"study" + 0.004*"case" + 0.004*"human" + 0.004*"reason" + 0.003*"might" + 0.003*"fear" + 0.003*"percent" + 0.003*"decision" + 0.003*"example" + 0.003*"learn" + 0.003*"behavior" + 0.003*"maybe" + 0.003*"system" + 0.003*"bad"

2: 0.009*"country" + 0.004*"government" + 0.004*"problem" + 0.004*"human" + 0.004*"percent" + 0.003*"create" + 0.003*"question" + 0.003*"believe" + 0.003*"help" + 0.003*"learn" + 0.003*"system" + 0.003*"important" + 0.003*"example" + 0.003*"number" + 0.003*"job" + 0.002*"society" + 0.002*"mo

In [18]:
import pickle as pkl

file = open('models/LDAmodelExtended.pkl', 'wb')

pkl.dump(LDAModel, file, protocol = pkl.DEFAULT_PROTOCOL)

file.close()



## Testing the Model

In [20]:
LDAModel[bow[0]]

[(2, 0.58108824), (5, 0.06877321), (17, 0.04227818), (18, 0.30683613)]