# Topic Modeling of the New York Times

We will use a subset of articles from the New York Times dataset (downloaded from https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).

## Load the Data

We will load the data into the variable `nytimes` using the following piece of code:

In [None]:
import gensim
import numpy as np
import collections
import matplotlib.pyplot as plt
%matplotlib notebook

nytimes = []
with open('nytimes_30000docs.txt') as inputfile:
    for line in inputfile:
        nytimes.append(line.lower().split())

## Preprocess the Data

The dataset has already been preprocessed to remove punctuation and stop words. It contains one document per line, and words within documents have been ordered alphabetically (this doesn't affect the bag-of-words representation).

### Remove high and low-frequency words

**[Task]** Use the cell below to remove the words that appear less than 5 times in the corpus and also remove the 25 most common words in the corpus.

In [None]:
# obtain the frequency of each word
frequency = collections.defaultdict(int)
for doc in nytimes:
    for token in doc:
        frequency[token] += 1

# obtain the frequency of the words as a numpy array
n_most_common = 25
np_freq = np.zeros(len(frequency))
count = 0
for token in frequency:
    np_freq[count] = frequency[token]
    count += 1
# sort the frequencies
np_freq_sorted = np.sort(np_freq)
# obtain the maximum allowed frequency
max_freq = np_freq_sorted[-n_most_common]

# remove words that appear only once or more than M times
doc_noLowFreq = [[token for token in text if frequency[token] > 1 and frequency[token]<max_freq]
                  for text in nytimes]

### Create the dictionary

**[Task]** Create the dictionary in the cell below.

In [None]:
# Create the dictionary
dictionary = gensim.corpora.Dictionary(doc_noLowFreq)

### Create the bag-of-words representation

**[Task]** Use the cell below to obtain the bag-of-words representation.

In [None]:
corpus = [dictionary.doc2bow(doc) for doc in doc_noLowFreq]

## Train LDA

**[Tasks]**

1. Fit an LDA model with `num_topics=100` to the NYT dataset.

2. Plot the dominant topics using the function `print_topics(num_topics=20, num_words=10)`. Read the words in some of the topics. Does the result make sense? Can you think of a "title" that summarizes each topic?

3. Go to the New York Times website. Choose a section and find an article in that section. Copy the article (document) into a single line and remove punctuation marks. In the ipython notebook, obtain the topic proportions that this article (document) exhibits, and plot a bar chart with these proportions. Show the top 10 words of the top 2 topics of that article using the function `print_topic(topicno)`.

**[Warning]** Fitting LDA to this dataset on a laptop may take 2-3 minutes, so be patient. However, if it takes longer than 5 minutes, you may use the dataset with 5000 documents instead of the dataset with 30000 documents. Just replace the corresponding line on the initial cell of the notebook.

In [None]:
model = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=100)

In [None]:
model.print_topics(num_topics=20, num_words=10)

In [None]:
new_doc = 'when babies have an acute ear infection they tug at their ears get cranky and struggle to sleep through the night ear infections are the most common reason doctors prescribe antibiotics to children because of the growing threat of drug-resistant bacteria many physicians had hoped that a shorter course of antibiotics would be as effective as the standard 10 days of treatment for babies If five days did the trick parents would benefit too It would cost less and the inconsolable infant would be back to her old self and to day care more quickly with possibly fewer days of nasty side effects, like diarrhea But a trial published in The New England Journal of Medicine on Wednesday dashed those hopes Our intuition was that shorter courses would likely be as effective and lead to less antibiotic resistance but neither of those proved to be the case said Dr Donald H Arnold an associate professor of pediatrics and emergency medicine at Vanderbilt University School of Medicine Ear infections are often caused by bacteria but some are caused by viruses and should not be treated with antibiotics Babies with a middle-ear infection known as acute otitis media have pain an eardrum that is at least moderately protruding and other symptoms The new study included 520 babies 6 to 23 months old the age group most prone to middle-ear infections By their first birthday almost half of infants will have had one'
# Lower-case the article
new_doc = new_doc.lower()
# Remove punctuation marks (if needed)
import string
exclude = set(string.punctuation)
new_doc_noPunct = str([''.join(ch for ch in new_doc if ch not in exclude)])

# Find the topic proportions for this article
new_doc_topics = model[dictionary.doc2bow(new_doc_noPunct.split())]
print(new_doc_topics)

In [None]:
# Plot the bar chart
aux = np.zeros(100)
for k, val in new_doc_topics:
    aux[k] = val

plt.bar(np.arange(100), aux)

In [None]:
model.print_topic(94)  # <-- replace these with the 2 topics of highest proportion

In [None]:
model.print_topic(70) # <-- replace these with the 2 topics of highest proportion