# **Algorithm for Topic Extraction Using LDA**

This document present the process of development of an unsupervised algorithm for topic extraction. Currently the most reliable technic is the LDA (Latent Dirichlet Allocation) algorithm, which is base on the Dirichlet distribution.

*This notebook is meant to save the important notes and the project decisions. However, a ".py" file containing the same code is available on this folder.*

## **Approach**

## Input the Data

#### The first step is to make the input of the data and the filter by language to avoid inconsistent results

In [1]:
from Algorithms import preProcessing, modelUsageAPI

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




  from pandas import Panel


In [2]:
paths = ['datasets/articles_bbc_2018_01_30.csv', 'datasets/transcripts.csv', 'datasets/topics.csv']
targets = ['articles','transcript', 'question_text']

In [3]:
data = preProcessing.import_files(paths, targets,
                                   preCleaning = True,
                                   dropna = 'index',
                                   verbose = True)

Loading the dataset 0.

Input Format:
Rows: 309, Columns: 2.

Pre cleaning format:
Rows: 308, Columns: 2
Loading the dataset 1.

Input Format:
Rows: 2467, Columns: 2.

Pre cleaning format:
Rows: 2467, Columns: 2
Loading the dataset 2.

Input Format:
Rows: 5000, Columns: 3.

Pre cleaning format:
Rows: 5000, Columns: 3

Removing unwanted information using targets.


In [4]:
data.shape

(7775, 1)

In [5]:
preProcessing.language_detection(data,
                                 verbose = True)

HBox(children=(IntProgress(value=0, max=7775), HTML(value='')))


en    7723
fa       9
fr       8
id       5
uk       4
vi       4
ar       4
ru       4
hi       4
sw       3
es       2
pt       2
tr       2
de       1
Name: lang, dtype: int64

Most Frequent language: en    7723
Name: lang, dtype: int64


en    7723
Name: lang, dtype: int64

In [6]:
data = preProcessing.language_cleaning(dataFile = data,
                                        language = 'en',
                                        verbose = True)

Cleaning data using 'en' language.


## Pre Processing the data

#### **Tokenizing** the documents to the word level.

In [7]:
data = preProcessing.tokenization(data, level = 'word', verbose = True)

Tokenizing data to word
.


HBox(children=(IntProgress(value=0, max=7723), HTML(value='')))




HBox(children=(IntProgress(value=0, max=7723), HTML(value='')))




#### Preparing for **Lemmatizizing** using POS tagging.

In [8]:
data = preProcessing.POS_tagging(data)

HBox(children=(IntProgress(value=0, max=7723), HTML(value='')))




In [9]:
data = preProcessing.lemmatizing(data)

HBox(children=(IntProgress(value=0, max=7723), HTML(value='')))




#### Removing **StopWords** using the english stopwords from the Natural Language Toolkit (NLTK) and removing any token less than 2 characthers.

In [10]:
data = preProcessing.removeStopWords(data, minSize = 2)

## Trainning the LDA Model

### Preparing the Data

#### Generating the tokens using the Bigram and the Trigram Model

In [11]:
tokens = preProcessing.Bi_n_TrigramModel(data, min_cnt = 1, verbose = True)

Getting tokens From data file and converting into a list of tokens.
Building the Bigram Model
Building the Trigram Model
Importing the Trigram Model and converting into list


#### Creating the dictionary using the the tokens

In [12]:
dictionary = preProcessing.generateDictionary(tokens, min_thld = 3, verbose = True)

Generating the Ditionary.
Filtering dictionary using the minimun threshold: 3


#### Generating the BOW for the LDA model

In [13]:
bow = preProcessing.generateBOW(dictionary, tokens, verbose = True)

Generating Bag Of Words.


#### Generating the LDA based model

In [14]:
numberOfTopics = 16
%time LDAModel = preProcessing.trainModel(bow, dictionary, numTopics = numberOfTopics, numPasses = 4, verbose = True)

Trainning LDA model using the inputed BOW and Dictionary.
Parameters: Topics: 16, Passes: 4
CPU times: user 1min 39s, sys: 55.5 s, total: 2min 35s
Wall time: 56.1 s


#### A quick view on the topics

In [15]:
for i,topic in LDAModel.show_topics(formatted=True, num_topics = numberOfTopics, num_words=20):
    print(str(i)+": "+ topic)
    print()

0: 0.020*"think" + 0.016*"actually" + 0.014*"thing" + 0.012*"look" + 0.009*"right" + 0.008*"way" + 0.008*"laughter" + 0.008*"kind" + 0.006*"work" + 0.006*"brain" + 0.006*"lot" + 0.005*"people" + 0.005*"two" + 0.005*"good" + 0.005*"show" + 0.005*"find" + 0.005*"time" + 0.005*"happen" + 0.005*"sort" + 0.005*"tell"

1: 0.016*"people" + 0.009*"think" + 0.008*"work" + 0.007*"look" + 0.007*"time" + 0.007*"thing" + 0.006*"good" + 0.006*"way" + 0.006*"percent" + 0.005*"country" + 0.005*"woman" + 0.005*"start" + 0.005*"many" + 0.004*"year" + 0.004*"give" + 0.004*"world" + 0.004*"could" + 0.004*"find" + 0.004*"first" + 0.004*"actually"

2: 0.010*"laughter" + 0.009*"think" + 0.009*"look" + 0.008*"life" + 0.008*"time" + 0.007*"could" + 0.007*"thing" + 0.006*"back" + 0.006*"way" + 0.005*"find" + 0.005*"day" + 0.005*"woman" + 0.005*"call" + 0.005*"start" + 0.005*"people" + 0.005*"work" + 0.005*"tell" + 0.004*"right" + 0.004*"talk" + 0.004*"first"

3: 0.015*"world" + 0.013*"country" + 0.010*"people" 

In [16]:
modelUsageAPI.save(LDAModel,'models/LDAmodelExtended.pkl')

## Validating the Model

#### Validating the model on the document in position 0.

In [17]:
nbDoc = 0
valDoc = data.articles.loc[nbDoc]

In [18]:
print(valDoc[:500])

Image copyright PA/EPA Image caption Oligarch Roman Abramovich (l) and PM Dmitry Medvedev are on the list

Russian President Vladimir Putin says a list of officials and businessmen close to the Kremlin published by the US has in effect targeted all Russian people.

The list names 210 top Russians as part of a sanctions law aimed at punishing Moscow for meddling in the US election.

However, the US stressed those named were not subject to new sanctions.

Mr Putin said the list was an unfr


In [19]:
LDAModel[bow[nbDoc]]

[(2, 0.14982004), (3, 0.30157804), (8, 0.37452465), (12, 0.17343038)]