In [2]:
import pandas as pd
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

from preprocess import *
from train_model import *

# Load the dataset into a pandas dataframe:

You can download this dataset from here: https://www.kaggle.com/nzalake52/new-york-times-articles

To rerun the code in this project save the dataset to a "data/" folder in the directory of this notebook. 

In [3]:
with open('./data/nytimes_news_articles.txt') as f:
    document = f.readlines()

articles = []
current_doc = []

for line in document:
    if "URL: htt" not in line:
        current_doc.append(line)
    if line == "\n":
        if len(current_doc) != 1:
            articles.append(" ".join(current_doc))
        current_doc = []
        
# Let's create a pandas dataframe        
dataset = pd.DataFrame({'text': articles})

# Print top 5
dataset.head(5)

Unnamed: 0,text
0,WASHINGTON — Stellar pitching kept the Mets af...
1,Mayor Bill de Blasio’s counsel and chief legal...
2,In the early morning hours of Labor Day last y...
3,It was the Apple Store in New York City before...
4,OMAHA — The United States Olympic swimming tri...


# Running our initial preprocessing

This may take a while.

In [4]:
dataset['processed'] = [preprocess(x) for x in dataset['text'].values]

# Selecting vocabulary size
Remember that larger vocabularies result in a longer training process.

In [5]:
# Let's see our current processed texts.
dataset.head(5)

Unnamed: 0,text,processed
0,WASHINGTON — Stellar pitching kept the Mets af...,"[washington, stellar, pitch, kept, met, afloat..."
1,Mayor Bill de Blasio’s counsel and chief legal...,"[mayor, bill, de, blasio, counsel, chief, lega..."
2,In the early morning hours of Labor Day last y...,"[earli, morn, hour, labor, day, last, year, gr..."
3,It was the Apple Store in New York City before...,"[appl, store, new, york, citi, thing, appl, st..."
4,OMAHA — The United States Olympic swimming tri...,"[omaha, unit, state, olymp, swim, trial, spect..."


We need to reduce the vocabulary used in our model to a computationally efficient size so let's study the words and find a good vocabulary size.

In [6]:
# Let's choose a top k words to select from:
k = 30000
top_k_words = select_top_k(dataset, k)

Least common words of top 30000
[('trough', 4), ('coomb', 4), ('edmundscom', 4), ('cancercaus', 4), ('bivalv', 4), ('115th', 4), ('shaqiri', 4), ('poprock', 4), ('astley', 4), ('dorki', 4)]


Once you find a value you like, do the final processing:

In [7]:
dataset['processed'] = keep_top_k_words(dataset, k)

# Training our Model

Our train_lda function takes a pandas dataframe with the processed texts in a 'processed' column, a number of topics to train, and a model name to save model files in a model folder for later use.

In [9]:
# import warnings; warnings.simplefilter('ignore') # Ignoring divide by 0 in log runtime warning during training.

In [10]:
dictionary,corpus,lda = train_lda(dataset, 50, '50topics')

Training model...
Time to train LDA model on  8897 articles:  0.6633241494496663 min
Saving model files...
Model saved in models folder.


# Now let's visualise!

In [11]:
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

As you can see there are multiple overlapping topics, suggesting perhaps a lower topic number could be used for a general topic model. Let's try this:

In [12]:
dictionary,corpus,lda = train_lda(dataset, 30, '30topics')

Training model...
Time to train LDA model on  8897 articles:  0.5853211323420207 min
Saving model files...
Model saved in models folder.


In [13]:
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

Let's decrease our topics one more time..

In [14]:
dictionary,corpus,lda = train_lda(dataset, 15, '15topics')

Training model...
Time to train LDA model on  8897 articles:  0.3784134825070699 min
Saving model files...
Model saved in models folder.


In [15]:
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

# Comments:

As you can see from this simple example, the iterative process of finding a good general topic model becomes slightly more intuitive when visualising the topic distribution. Studying each topic and their top words can also give insights as to how to change the vocabulary for a better model and how to change the parameters in our train_lda function that define the representation of topics. 

The LDA model is an extremely useful tool for any topic modeling task, and it has a variety of functions including finding similarity between documents (for this you may add "from gensim import similarities" in the imports). For example, this can be used to generate a simple recommendation system for documents in any system.