## Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Step 1: Load the dataset

The dataset we'll use is the 20newsgroup dataset that is available from sklearn. This dataset has news articles grouped into 20 news categories

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

In [2]:
print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


### As you can see that there are some distinct themes in the news categories like sports, religion, science, technology, politics etc.

In [3]:
# Lets look at some sample news
newsgroups_train.data[:2]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [4]:
print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)

(11314,) (11314,)


## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.


In [5]:
'''
Loading Gensim and nltk libraries
'''
%pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

Note: you may need to restart the kernel to use updated packages.


In [6]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sheen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatizer Example
Before preprocessing our dataset, let's first look at an lemmatizing example. What would be the output if we lemmatized the word 'went':

In [7]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


Prueba en español

In [8]:
# Array de palabras en español para lematización
palabras_conjugadas = [
    "correr", "corría", "corrí", "corrido", "corriendo",
    "comer", "comí", "comía", "comido", "comiendo",
    "feliz", "felices",
    "triste", "tristes",
    "bueno", "buena", "buenos", "buenas",
    "malo", "mala", "malos", "malas",
    "amar", "amaba", "amé", "amado", "amando",
    "estudiar", "estudié", "estudiaba", "estudiado", "estudiando",
    "comprar", "compré", "compraba", "comprado", "comprando",
    "vender", "vendí", "vendía", "vendido", "vendiendo",
    "leer", "leí", "leía", "leído", "leyendo",
    "construir", "construí", "construía", "construido", "construyendo",
    "ver", "vi", "veía", "visto", "viendo",
    "oír", "oí", "oía", "oído", "oyendo",
    "decir", "dije", "decía", "dicho", "diciendo",
    "hacer", "hice", "hacía", "hecho", "haciendo",
    "ir", "fui", "iba", "ido", "yendo",
    "tener", "tuve", "tenía", "tenido", "teniendo",
    "trabajar", "trabajé", "trabajaba", "trabajado", "trabajando",
    "vivir", "viví", "vivía", "vivido", "viviendo",
    "aprender", "aprendí", "aprendía", "aprendido", "aprendiendo",
    "enseñar", "enseñé", "enseñaba", "enseñado", "enseñando",
    "viajar", "viajé", "viajaba", "viajado", "viajando",
    "creer", "creí", "creía", "creído", "creyendo",
    "esperar", "esperé", "esperaba", "esperado", "esperando",
    "empezar", "empecé", "empezaba", "empezado", "empezando",
    "terminar", "terminé", "terminaba", "terminado", "terminando",
    "jugar", "jugué", "jugaba", "jugado", "jugando",
    "ganar", "gané", "ganaba", "ganado", "ganando",
    "perder", "perdí", "perdía", "perdido", "perdiendo",
    "sentir", "sentí", "sentía", "sentido", "sintiendo",
    "pensar", "pensé", "pensaba", "pensado", "pensando",
    "caminar", "caminé", "caminaba", "caminado", "caminando",
    "saber", "supe", "sabía", "sabido", "sabiendo",
    "dar", "di", "daba", "dado", "dando",
    "decidir", "decidí", "decidía", "decidido", "decidiendo",
    "visitar", "visité", "visitaba", "visitado", "visitando",
    "entender", "entendí", "entendía", "entendido", "entendiendo",
    "conocer", "conocí", "conocía", "conocido", "conociendo",
    "descubrir", "descubrí", "descubría", "descubierto", "descubriendo",
    "intentar", "intenté", "intentaba", "intentado", "intentando",
    "recordar", "recordé", "recordaba", "recordado", "recordando",
    "escuchar", "escuché", "escuchaba", "escuchado", "escuchando",
    "buscar", "busqué", "buscaba", "buscado", "buscando",
    "abrir", "abrí", "abría", "abierto", "abriendo",
    "cerrar", "cerré", "cerraba", "cerrado", "cerrando",
    "cambiar", "cambié", "cambiaba", "cambiado", "cambiando",
    "esperar", "esperé", "esperaba", "esperado", "esperando",
    "crear", "creé", "creaba", "creado", "creando",
    "trabajar", "trabajé", "trabajaba", "trabajado", "trabajando",
    "aparecer", "aparecí", "aparecía", "aparecido", "apareciendo",
    "ganar", "gané", "ganaba", "ganado", "ganando",
    "perder", "perdí", "perdía", "perdido", "perdiendo",
    "gustar", "gusté", "gustaba", "gustado", "gustando",
    "disfrutar", "disfruté", "disfrutaba", "disfrutado", "disfrutando",
    "odiar", "odié", "odiaba", "odiado", "odiando",
    "empezar", "empecé", "empezaba", "empezado", "empezando",
    "terminar", "terminé", "terminaba", "terminado", "terminando",
    "viajar", "viajé", "viajaba", "viajado", "viajando",
    "vivir", "viví", "vivía", "vivido", "viviendo",
    "morir", "morí", "moría", "muerto", "muriendo",
    "nacer", "nací", "nacía", "nacido", "naciendo",
    "llorar", "lloré", "lloraba", "llorado", "llorando",
    "reír", "reí", "reía", "reído", "riendo",
    "gritar", "grité", "gritaba", "gritado", "gritando",
    "callar", "callé", "callaba", "callado", "callando",
    "mirar", "miré", "miraba", "mirado", "mirando",
    "ver", "vi", "veía", "visto", "viendo",
    "hablar", "hablé", "hablaba", "hablado", "hablando",
    "decir", "dije", "decía", "dicho", "diciendo",
    "contar", "conté", "contaba", "contado", "contando",
    "preguntar", "pregunté", "preguntaba", "preguntado", "preguntando",
    "responder", "respondí", "respondía", "respondido", "respondiendo",
    "esperar", "esperé", "esperaba", "esperado", "esperando",
    "sentir", "sentí", "sentía", "sentido", "sintiendo",
    "pensar", "pensé", "pensaba", "pensado", "pensando",
    "saber", "supe", "sabía", "sabido", "sabiendo",
    "conocer", "conocí", "conocía", "conocido", "conociendo",
    "amar", "amé", "amaba", "amado", "amando",
    "odiar", "odié", "odiaba", "odiado", "odiando",
    "girar", "giré", "giraba", "girado", "girando",
    "correr", "corrí", "corría", "corrido", "corriendo",
    "caminar", "caminé", "caminaba", "caminado", "caminando",
    "nadar", "nadé", "nadaba", "nadado", "nadando",
    "volar", "volé", "volaba", "volado", "volando",
    "pensar", "pensé", "pensaba", "pensado", "pensando",
    "creer", "creí", "creía", "creído", "creyendo",
    "esperar", "esperé", "esperaba", "esperado", "esperando",
    "decidir", "decidí", "decidía", "decidido", "decidiendo",
    "poder", "pude", "podía", "podido", "pudiendo",
    "querer", "quise", "quería", "querido", "queriendo",
    "necesitar", "necesité", "necesitaba", "necesitado", "necesitando",
    "tener", "tuve", "tenía", "tenido", "teniendo",
    "hacer", "hice", "hacía", "hecho", "haciendo",
    "ir", "fui", "iba", "ido", "yendo",
    "venir", "vine", "venía", "venido", "viniendo",
    "traer", "traje", "traía", "traído", "trayendo",
    "llevar", "llevé", "llevaba", "llevado", "llevando",
    "ver", "vi", "veía", "visto", "viendo",
    "oír", "oí", "oía", "oído", "oyendo",
    "decir", "dije", "decía", "dicho", "diciendo",
    "hablar", "hablé", "hablaba", "hablado", "hablando",
    "dar", "di", "daba", "dado", "dando",
    "tomar", "tomé", "tomaba", "tomado", "tomando",
    "llevar", "llevé", "llevaba", "llevado", "llevando",
    "dejar", "dejé", "dejaba", "dejado", "dejando",
    "empezar", "empecé", "empezaba", "empezado", "empezando",
    "terminar", "terminé", "terminaba", "terminado", "terminando",
    "trabajar", "trabajé", "trabajaba", "trabajado", "trabajando",
    "jugar", "jugué", "jugaba", "jugado", "jugando",
    "cantar", "canté", "cantaba", "cantado", "cantando",
    "bailar", "bailé", "bailaba", "bailado", "bailando",
    "saltar", "salté", "saltaba", "saltado", "saltando",
    "caminar", "caminé", "caminaba", "caminado", "caminando",
    "viajar", "viajé", "viajaba", "viajado", "viajando",
    "vivir", "viví", "vivía", "vivido", "viviendo",
    "morir", "morí", "moría", "muerto", "muriendo",
    "nacer", "nací", "nacía", "nacido", "naciendo",
    "llorar", "lloré", "lloraba", "llorado", "llorando",
    "reír", "reí", "reía", "reído", "riendo",
    "gritar", "grité", "gritaba", "gritado", "gritando",
    "callar", "callé", "callaba", "callado", "callando",
    "mirar", "miré", "miraba", "mirado", "mirando",
    "ver", "vi", "veía", "visto", "viendo",
    "oír", "oí", "oía", "oído", "oyendo",
    "decir", "dije", "decía", "dicho", "diciendo",
    "hablar", "hablé", "hablaba", "hablado", "hablando",
    "preguntar", "pregunté", "preguntaba", "preguntado", "preguntando",
    "responder", "respondí", "respondía", "respondido", "respondiendo",
    "esperar", "esperé", "esperaba", "esperado", "esperando",
    "pensar", "pensé", "pensaba", "pensado", "pensando",
    "saber", "supe", "sabía", "sabido", "sabiendo",
    "conocer", "conocí", "conocía", "conocido", "conociendo",
    "creer", "creí", "creía", "creído", "creyendo",
    "esperar", "esperé", "esperaba", "esperado", "esperando",
    "decidir", "decidí", "decidía", "decidido", "decidiendo",
    "necesitar", "necesité", "necesitaba", "necesitado", "necesitando",
    "querer", "quise", "quería", "querido", "queriendo",
    "poder", "pude", "podía", "podido", "pudiendo",
    "haber", "hube", "había", "habido",
    "ser", "fui", "era", "sido",
    "estar", "estuve", "estaba", "estado",
    "ir", "fui", "iba", "ido"
]


In [9]:
import pandas as pd
spanish_stemmer = SnowballStemmer('spanish')

# palabras_conjugadas = ['acaricia', 'vuela', 'muere', 'mula', 'negado', 'murió', 'acordó', 'poseyó', 
#            'humilde', 'dimensionado', 'reunión', 'declarando', 'alistamiento', 'detallación', 'sensacional', 
#            'tradicional', 'referencia', 'colonizador', 'trazado']

# quitar las tildes de cada palabra
# palabras_conjugadas = [palabra.replace('á', 'a').replace('é', 'e').replace('í', 'i').replace('ó', 'o').replace('ú', 'u') for palabra in palabras_conjugadas]
singles = [spanish_stemmer.stem(plural) for plural in palabras_conjugadas]

pd.DataFrame(data={'original word':palabras_conjugadas, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,correr,corr
1,corría,corr
2,corrí,corr
3,corrido,corr
4,corriendo,corr
...,...,...
698,estado,estad
699,ir,ir
700,fui,fui
701,iba,iba


In [10]:
import spacy

# Cargar el modelo en español
nlp = spacy.load("es_core_news_sm")

# Lematizar una palabra o frase
doc = nlp(' '.join(palabras_conjugadas))
lemmas = [token.lemma_ for token in doc]

for i in range(len(palabras_conjugadas)):
    print(f'palabra original: {palabras_conjugadas[i]} - palabra lematizada: {lemmas[i]}')


  from .autonotebook import tqdm as notebook_tqdm


palabra original: correr - palabra lematizada: correr
palabra original: corría - palabra lematizada: correr
palabra original: corrí - palabra lematizada: corrí
palabra original: corrido - palabra lematizada: corrido
palabra original: corriendo - palabra lematizada: correr
palabra original: comer - palabra lematizada: comer
palabra original: comí - palabra lematizada: comí
palabra original: comía - palabra lematizada: comía
palabra original: comido - palabra lematizada: comido
palabra original: comiendo - palabra lematizada: comer
palabra original: feliz - palabra lematizada: feliz
palabra original: felices - palabra lematizada: feliz
palabra original: triste - palabra lematizada: triste
palabra original: tristes - palabra lematizada: trist
palabra original: bueno - palabra lematizada: bueno
palabra original: buena - palabra lematizada: buen
palabra original: buenos - palabra lematizada: buen
palabra original: buenas - palabra lematizada: buena
palabra original: malo - palabra lematizad

In [11]:
import stanza

# Configurar el pipeline para cargar solo lematización y tokenización
nlp = stanza.Pipeline('es')
# nlp = stanza.Pipeline('es', processors='tokenize,lemma', use_gpu=False)
# nlp = stanza.Pipeline('es', processors='lemma', use_gpu=False)


# Función para lematizar una lista de palabras
def lematizar_palabras(palabras):
    # Procesar el texto como una sola frase
    doc = nlp(' '.join(palabras))
    
    # Extraer los lemas
    lemmas = [word.lemma for sent in doc.sentences for word in sent.words]
    return lemmas

# Llamar a la función
cosas = lematizar_palabras(palabras_conjugadas)

# print(lemmas)

pd.DataFrame(data={'original word':palabras_conjugadas, 'stemmed':cosas })



2024-07-02 01:17:51 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 384kB [00:00, 6.29MB/s]                    
2024-07-02 01:17:51 INFO: Downloaded file to C:\Users\Sheen\stanza_resources\resources.json
2024-07-02 01:17:54 INFO: Loading these models for language: es (Spanish):
| Processor    | Package           |
------------------------------------
| tokenize     | combined          |
| mwt          | combined          |
| pos          | combined_charlm   |
| lemma        | combined_nocharlm |
| constituency | combined_charlm   |
| depparse     | combined_charlm   |
| sentiment    | tass2020_charlm   |
| ner          | conll02           |

2024-07-02 01:17:54 INFO: Using device: cpu
2024-07-02 01:17:54 INFO: Loading: tokenize
2024-07-02 01:

Unnamed: 0,original word,stemmed
0,correr,correr
1,corría,correr
2,corrí,corrí
3,corrido,correr
4,corriendo,correr
...,...,...
698,estado,estado
699,ir,ir
700,fui,ser
701,iba,ir


In [12]:
for i in range(len(palabras_conjugadas)):
    print(f'palabra original: {palabras_conjugadas[i]} - palabra lematizada: {cosas[i]}')

palabra original: correr - palabra lematizada: correr
palabra original: corría - palabra lematizada: correr
palabra original: corrí - palabra lematizada: corrí
palabra original: corrido - palabra lematizada: correr
palabra original: corriendo - palabra lematizada: correr
palabra original: comer - palabra lematizada: comer
palabra original: comí - palabra lematizada: comir
palabra original: comía - palabra lematizada: comer
palabra original: comido - palabra lematizada: comer
palabra original: comiendo - palabra lematizada: comer
palabra original: feliz - palabra lematizada: feliz
palabra original: felices - palabra lematizada: feliz
palabra original: triste - palabra lematizada: triste
palabra original: tristes - palabra lematizada: triste
palabra original: bueno - palabra lematizada: buen
palabra original: buena - palabra lematizada: buen
palabra original: buenos - palabra lematizada: buen
palabra original: buenas - palabra lematizada: buen
palabra original: malo - palabra lematizada:

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('omw-1.4')  # Necesario para el soporte multilingüe

lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("mori", pos='v')  # 'v' para verbos

print(lemma)


mori


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Sheen\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Stemmer Example
Let's also look at a stemming example. Let's throw a number of words at the stemmer and see how it deals with each one:

In [14]:
import pandas as pd
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [15]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result



In [16]:
'''
Preview a document after preprocessing
'''
document_num = 50
doc_sample = 'This disk has failed many times. I would like to get it replaced.'

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['This', 'disk', 'has', 'failed', 'many', 'times.', 'I', 'would', 'like', 'to', 'get', 'it', 'replaced.']


Tokenized and lemmatized document: 
['disk', 'fail', 'time', 'like', 'replac']


Let's now preprocess all the news headlines we have. To do that, we iterate over the list of documents in our training sample

In [17]:
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))

In [18]:
'''
Preview 'processed_docs'
'''
print(processed_docs[:2])

[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day',

## Step 3: Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [19]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [20]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten


** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [21]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [22]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
print(bow_corpus[:2])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 5), (34, 1), (35, 1), (36, 1), (37, 1), (38, 2), (39, 1), (40, 2), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 3), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 3), (65, 1), (66, 4)]]


In [23]:
'''
Preview BOW for our sample preprocessed document
'''
document_num = 20
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 18 ("rest") appears 1 time.
Word 166 ("clear") appears 1 time.
Word 336 ("refer") appears 1 time.
Word 350 ("true") appears 1 time.
Word 391 ("technolog") appears 1 time.
Word 437 ("christian") appears 1 time.
Word 453 ("exampl") appears 1 time.
Word 476 ("jew") appears 1 time.
Word 480 ("lead") appears 1 time.
Word 482 ("littl") appears 3 time.
Word 520 ("wors") appears 2 time.
Word 721 ("keith") appears 3 time.
Word 732 ("punish") appears 1 time.
Word 803 ("california") appears 1 time.
Word 859 ("institut") appears 1 time.
Word 917 ("similar") appears 1 time.
Word 990 ("allan") appears 1 time.
Word 991 ("anti") appears 1 time.
Word 992 ("arriv") appears 1 time.
Word 993 ("austria") appears 1 time.
Word 994 ("caltech") appears 2 time.
Word 995 ("distinguish") appears 1 time.
Word 996 ("german") appears 1 time.
Word 997 ("germani") appears 3 time.
Word 998 ("hitler") appears 1 time.
Word 999 ("livesey") appears 2 time.
Word 1000 ("motto") appears 2 time.
Word 1001 ("order") appear

## Step 4: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [24]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [25]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.007*"bike" + 0.006*"game" + 0.005*"team" + 0.004*"run" + 0.004*"player" + 0.004*"play" + 0.003*"homosexu" + 0.003*"pitch" + 0.003*"virginia" + 0.003*"defens"


Topic: 1 
Words: 0.009*"govern" + 0.007*"armenian" + 0.006*"israel" + 0.005*"kill" + 0.005*"isra" + 0.004*"american" + 0.004*"turkish" + 0.004*"countri" + 0.004*"weapon" + 0.004*"live"


Topic: 2 
Words: 0.015*"game" + 0.013*"team" + 0.010*"play" + 0.008*"hockey" + 0.008*"player" + 0.005*"canada" + 0.005*"season" + 0.004*"leagu" + 0.004*"score" + 0.004*"toronto"


Topic: 3 
Words: 0.010*"card" + 0.010*"window" + 0.007*"driver" + 0.006*"sale" + 0.005*"price" + 0.005*"appl" + 0.005*"speed" + 0.005*"engin" + 0.005*"monitor" + 0.005*"video"


Topic: 4 
Words: 0.014*"file" + 0.010*"program" + 0.009*"window" + 0.006*"encrypt" + 0.006*"chip" + 0.006*"data" + 0.006*"imag" + 0.006*"avail" + 0.005*"version" + 0.004*"code"


Topic: 5 
Words: 0.013*"space" + 0.009*"nasa" + 0.006*"scienc" + 0.005*"orbit" + 0.005*"research"

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: Graphics Cards
* 1: Religion
* 2: Space
* 3: Politics
* 4: Gun Violence
* 5: Technology
* 6: Sports
* 7: Encryption 

## Step 6: Testing model on unseen document ##

In [26]:
num = 100
unseen_document = newsgroups_test.data[num]
print(unseen_document)

Subject: help
From: C..Doelle@p26.f3333.n106.z1.fidonet.org (C. Doelle)
Lines: 13

Hello All!

    It is my understanding that all True-Type fonts in Windows are loaded in
prior to starting Windows - this makes getting into Windows quite slow if you
have hundreds of them as I do.  First off, am I correct in this thinking -
secondly, if that is the case - can you get Windows to ignore them on boot and
maybe make something like a PIF file to load them only when you enter the
applications that need fonts?  Any ideas?


Chris

 * Origin: chris.doelle.@f3333.n106.z1.fidonet.org (1:106/3333.26)



In [27]:
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.5851475596427917	 Topic: 0.010*"card" + 0.010*"window" + 0.007*"driver" + 0.006*"sale" + 0.005*"price"
Score: 0.3906135857105255	 Topic: 0.014*"file" + 0.010*"program" + 0.009*"window" + 0.006*"encrypt" + 0.006*"chip"


In [28]:
print(newsgroups_test.target[num])

2


The model correctly classifies the unseen document with 'x'% probability to the X category.