We choose the number of topics K based on 10-fold cross-validation.

We proceed as follows:

1. Split the train portion of the dataset, specifically the articles published before 2010, randomly into 10 folds.

2.1. Estimate the LDA model on the first 9 folds which serve as a training set. Keep the last 10 samples where the chain has converged.

2.2. Estimate perplexity on the 10th fold which serves as a test set. To do that, calculate perplexity for each of the 10 samples, using the formula:

$ Perplexity = exp\left(- \frac{\sum_{d=1}^{D}\sum_{v=1}^{V} n_{d,v} log(\sum_{k=1}^{K} \hat{\theta_{d}^{k}} \hat{\beta_{k}^{v}})}{\sum_{d=1}^{D} N_d}\right)$,

where $n_{d,v}$ is a count of word $v$ in document $d$, and $N_d$ is the number of words in a document $d$. In this formula, $\hat{\beta_{k}^{v}}$ corresponds to the topic distributions from the LDA estimation on the train set. We re-sample $\hat{\theta_{d}^{k}}$ for each of the 10 samples and all the documents in the test set using 100 iterations of Gibbs sampling. The final perplexity for the 10th fold is calculated by averaging over 10 samples.

3. Repeat the second step 9 more times each time changing the fold that serves as a test set.

4. Calculate the final perplexity as the average over 10 folds.

We start by loading the pre-processed articles from the train set which were prepared in a prior step using the `Pre-processing (train data)` notebook.

In [1]:
import pickle

with open('stems_for_lda_train.pkl', 'rb') as f:
    stems = pickle.load(f)
    
print stems[0]

[u'schalck', u'milliardenkredit', u'sichert', u'zahlungsfah', u'ex', u'ddr', u'darstell', u'devisenbeschaff', u'ehemaligen_ddr', u'schalck', u'golodkowski', u'josef_strau\xdf', u'eingefadelt', u'milliardenkredit', u'erstmal', u'zahlungsfah', u'ddr', u'aufrechterhalt', u'interview', u'ard', u'abend', u'ausgestrahlt', u'schalck', u'angaben_des_senders', u'freies', u'sfb', u'damal', u'nichtsein', u'ddr', u'gegang', u'geword', u'damal', u'ddr', u'kooperation', u'bundesrepubl', u'uberleb', u'parteichef', u'erich_honecker', u'berat', u'konterrevolutionar', u'zuruckgewies', u'ablehn', u'sowjet', u'staatsprasident', u'michail_gorbatschows', u'beigetrag', u'lebenswicht', u'adern', u'sowjetunion', u'durchschnitt', u'angeregt', u'schalck', u'golodkowski', u'sfb', u'bedingungslos', u'syst', u'gedient', u'schalck', u'wies', u'interview', u'vorwurf', u'rahm', u'jahrzehntelang', u'leiter_der_abteilung', u'kommerziell', u'koordinier', u'koko', u'kriminell', u'tatig', u'begang', u'schalck', u'sold', u'

Randomly split the data set into 10 equally-sized folds.

In [2]:
from sklearn.model_selection import ShuffleSplit

# Instantiate a ShuffleSplit object.
ss = ShuffleSplit(n_splits=10, test_size=0.10, random_state=0)

# Apply the split method to the stemmed tokens and save the resulting indices. 
# These indices will be used to split the stemmed tokens into training and test datasets.
indices = list(ss.split(stems))

Create two arrays `train` and `test`. Each of the 10 elements of the `train` array is a list of pre-processed documents that serve as a train set in the current iteration of the cross-validation algorithm. 

In [3]:
import numpy as np

# Initialize empty lists for training and test sets
train = []
test = []

# Loop over all splits (total of 10 as defined in ShuffleSplit)
for s in range(10):
    # For each split, select the stems corresponding to the training indices (indices[s][0]) 
    train.append(np.array(stems)[indices[s][0]])
    
    # Similarly, select the stems corresponding to the test indices (indices[s][1]) 
    # and append this subset of stems to the test list.
    test.append(np.array(stems)[indices[s][1]])    

Run cross-validation algorithm for K in $\{10,50,100\}$.

In this notebook, we run a cross-validation algorithm for the number of topics (K) in the set $\{10,50,100\}$. To manage the computational requirements effectively and circumvent memory constraints, we split the execution of the cross-validation algorithm across two separate notebooks. This allows us to run each notebook on a different computer simultaneously, enhancing our computational efficiency. The other notebook is dedicated to running the algorithm for K in the set $\{150, 200, 250\}$. 

In [4]:
from datetime import datetime
import multiprocessing as mp

# Define the number of cores to use for parallel processing
NUM_CORE = 30

startTime = datetime.now()

# Import the module 'perplexity_fold'
import perplexity_fold

# Define the range of topic numbers to try in cross-validation
k_values = [10, 50, 100]  

# Create a list of tuples, each containing a train set, a test set, and a number of topics (k)
train_test_data = [(train_subset, test_subset, k) for k in k_values for train_subset, test_subset in zip(train, test)]

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    
    # Use the pool to apply the perplexity_fold function to every item in train_test_data, in parallel
    perplexity_test = pool.map(perplexity_fold.perplexity_fold, train_test_data) 
    
    # For each k value, compute the mean perplexity over all 10 splits
    perplexity_k_10_100 = [np.mean(perplexity_test[i:i+10]) for i in range(0, len(perplexity_test), 10)]
    
    pool.close()
    pool.join()

print(datetime.now()-startTime)

5 days, 11:47:45.370000


In [5]:
# Save the calculated perplexity values for each number of topics k={10,50,100} into a CSV file
np.savetxt("perplexity_k_10_100.csv", perplexity_k_10_100, delimiter = ".", fmt='%10.5f')

In [6]:
from IPython.display import display, Javascript
display(Javascript('IPython.notebook.save_checkpoint();'))

<IPython.core.display.Javascript object>