# Data imports

We import Pandas, numpy and scipy for data structures. We use gensim for LDA, and sklearn for NMF

In [1]:
import pandas as pd;
import numpy as np;
import scipy as sp;
import sklearn;
import sys;
from nltk.corpus import stopwords;
import nltk;
from gensim.models import ldamodel
import gensim.corpora;
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer;
from sklearn.decomposition import NMF;
from sklearn.preprocessing import normalize;
import pickle;

# Loading the data

We are using the ABC News headlines dataset. Some lines are badly formatted (very few), so we are skipping those.

In [2]:
data = pd.read_csv('../Data/abcnews-date-text.csv', error_bad_lines=False);



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 637987: expected 2 fields, saw 3\nSkipping line 672395: expected 2 fields, saw 3\nSkipping line 697406: expected 2 fields, saw 3\nSkipping line 724169: expected 2 fields, saw 3\nSkipping line 738471: expected 2 fields, saw 3\nSkipping line 753796: expected 2 fields, saw 3\nSkipping line 759008: expected 2 fields, saw 3\nSkipping line 761317: expected 2 fields, saw 3\nSkipping line 761491: expected 2 fields, saw 3\nSkipping line 761778: expected 2 fields, saw 3\nSkipping line 763261: expected 2 fields, saw 3\nSkipping line 766836: expected 2 fields, saw 3\nSkipping line 767743: expected 2 fields, saw 3\nSkipping line 768084: expected 2 fields, saw 3\nSkipping line 770979: expected 2 fields, saw 3\nSkipping line 778212: expected 2 fields, saw 3\nSkipping line 781216: expected 2 fields, saw 3\nSkipping line 782529: expected 2 fields, saw 3\nSkipping line 784936: expected 2 fields, saw 3\nSkipping line 785692: expected 2

In [5]:
#We only need the Headlines_text column from the data
data_text = data[['headline_text']];

We need to remove stopwords first. Casting all values to float will make it easier to iterate over.

In [6]:
data_text = data_text.astype('str');

In [None]:
for idx in range(len(data_text)):
    
    #go through each word in each data_text row, remove stopwords, and set them on the index.
    data_text.iloc[idx]['headline_text'] = [word for word in data_text.iloc[idx]['headline_text'].split(' ') if word not in stopwords.words()];
    
    #print logs to monitor output
    if idx % 1000 == 0:
        sys.stdout.write('\rc = ' + str(idx) + ' / ' + str(len(data_text)));

In [105]:
#save data because it takes very long to remove stop words
pickle.dump(data_text, open('data_text.dat', 'wb'))

In [71]:
#get the words as an array for lda input
train_headlines = [value[0] for value in data_text.iloc[0:].values];

In [132]:
#number of topics we will cluster for: 10
num_topics = 10;

# LDA

We will use the gensim library for LDA. First, we obtain a id-2-word dictionary. For each headline, we will use the dictionary to obtain a mapping of the word id to their word counts. The LDA model uses both of these mappings.

In [72]:
id2word = gensim.corpora.Dictionary(train_headlines);

In [73]:
corpus = [id2word.doc2bow(text) for text in train_headlines];

In [74]:
lda = ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics);

# generating LDA topics

We will iterate over the number of topics, get the top words in each cluster and add them to a dataframe.

In [129]:
def get_lda_topics(model, num_topics):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = 20);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict);

In [135]:
get_lda_topics(lda, num_topics)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10
0,queensland,rural,sydney,canberra,wa,police,australian,government,south,trump
1,court,two,world,north,calls,death,australia,says,election,adelaide
2,woman,90,china,perth,new,say,day,found,news,seconds
3,nsw,attack,win,coast,call,missing,country,donald,tasmania,years
4,indigenous,crash,cup,2015,labor,children,one,accused,says,national
5,charged,car,business,afl,funding,nrl,hour,us,water,family
6,child,police,market,gold,nt,darwin,tasmanian,people,victoria,drug
7,power,women,australia,record,council,john,test,trial,top,near
8,farmers,dead,final,four,hobart,west,year,new,take,hospital
9,murder,health,home,interview,qld,cattle,budget,guilty,life,park


# NMF

For NMF, we need to obtain a design matrix. To improve results, I am going to apply TfIdf transformation to the counts.

In [79]:
#the count vectorizer needs string inputs, not array, so I join them with a space.
train_headlines_sentences = [' '.join(text) for text in train_headlines]

Now, we obtain a Counts design matrix, for which we use SKLearn’s CountVectorizer module. The transformation will return a matrix of size (Documents x Features), where the value of a cell is going to be the number of times the feature (word) appears in that document.

To reduce the size of the matrix, to speed up computation, we will set the maximum feature size to 5000, which will take the top 5000 best features that can contribute to our model.

In [80]:
vectorizer = CountVectorizer(analyzer='word', max_features=5000);
x_counts = vectorizer.fit_transform(train_headlines_sentences);

Next, we set a TfIdf Transformer, and transform the counts with the model.

In [81]:
transformer = TfidfTransformer(smooth_idf=False);
x_tfidf = transformer.fit_transform(x_counts);

And now we normalize the TfIdf values to unit length for each row.

In [82]:
xtfidf_norm = normalize(x_tfidf, norm='l1', axis=1)

And finally, obtain a NMF model, and fit it with the sentences.

In [84]:
#obtain a NMF model.
model = NMF(n_components=num_topics, init='nndsvd');

In [85]:
#fit the model
model.fit(xtfidf_norm)

NMF(alpha=0.0, beta=1, eta=0.1, init='nndsvd', l1_ratio=0.0, max_iter=200,
  n_components=10, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

In [136]:
def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = vectorizer.get_feature_names()
    
    word_dict = {};
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words;
    
    return pd.DataFrame(word_dict);

In [139]:
get_nmf_topics(model, 20)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10
0,interview,seconds,police,new,fire,abc,rural,charged,council,court
1,michael,90,missing,zealand,house,weather,news,murder,says,accused
2,extended,business,probe,laws,crews,sport,nsw,crash,water,murder
3,david,sport,search,document,threat,news,national,woman,govt,faces
4,james,weather,investigate,hospital,destroys,entertainment,qld,death,us,front
5,john,news,hunt,year,school,business,podcast,car,plan,told
6,nrl,closer,death,home,home,market,reporter,stabbing,australia,charges
7,ivan,confidence,car,deal,blaze,analysis,country,two,report,case
8,matt,exchange,shooting,centre,suspicious,talks,nrn,assault,back,hears
9,nathan,analysis,officer,york,warning,speaks,hour,sydney,closer,drug


The two tables above, in each section, show the results from LDA and NMF on both datasets. There is some coherence between the words in each clustering. For example, Topic #02 in LDA shows words associated with shootings and violent incidents, as evident with words such as “attack”, “killed”, “shooting”, “crash”, and “police”. Other topics show different patterns. 

On the other hand, comparing the results of LDA to NMF also shows that NMF performs better. Looking at Topic #01, we can see there are many first names clustered into the same category, along with the word “interview”. This type of headline is very common in news articles, with wording similar to “Interview with John Smith”, or “Interview with James C. on …”. 

We also see two topics related to violence. First, Topic #03 focuses on police related terms, such as “probe”, “missing”, “investigate”, “arrest”, and “body”. Second, Topic #08 focuses on assault terms, such as “murder”, “stabbing”, “guilty”, and “killed”. This is an interesting split between the topics because although the terms in each are very closely related, one focuses more on police-related activity, and the other more on criminal activity. Along with the first cluster which obtain first-names, the results show that NMF (using TfIdf) performs much better than LDA.