# Topic Model Parts of Speech

This is a notebook for trying to use topic models for classifying sets of text that are more syntactically similar than topically similar. This notebook attempts to distinguish between discussion and conclusion section of scientific papers.

Below we are loading the dataset for use.

In [5]:
from __future__ import print_function
from time import time
import os
import random

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.cross_validation import train_test_split

import numpy as np

import pickle

my_randoms1 = random.sample(xrange(31), 16)

validDocsDict = dict()
fileList1 = os.listdir("BioMedPOS")
for index, files1 in enumerate(fileList1):
    if index in my_randoms1:
        validDocsDict.update(pickle.load(open("BioMedPOS/" + files1, "rb")))
    
my_randoms2 = random.sample(xrange(10), 5)
    
fileList2 = os.listdir("PubMedPOS")
for index, files2 in enumerate(fileList2):
    if index in my_randoms2:
        validDocsDict.update(pickle.load(open("PubMedPOS/" + files2, "rb"))) 

Here we are setting some vaiables to be used below and defining a function for printing the top words in a topic for the topic modeling.

In [6]:
n_samples = len(validDocsDict.keys())
n_features = 200
n_topics = 2
n_top_words = 10


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Pre-process data

Here we are preprocessing data for use later. This code only grabs the discussion and conclusion sections of the data. We are also creating appropriate labels for the data and spliting the documents up to train and test sets.

In [7]:
print("Loading dataset...")
t0 = time()
documents = []

labels = []
concLengthTotal = 0
discLengthTotal = 0
concCount = 0
discCount = 0

for k in validDocsDict.keys():
    if k.startswith("conclusion"):
        labels.append("conclusion")
        documents.append(validDocsDict[k])
        concCount += 1
        concLengthTotal += len(validDocsDict[k].split(' '))
    elif k.startswith("discussion"):
        labels.append("discussion")
        documents.append(validDocsDict[k])
        discCount += 1
        discLengthTotal += len(validDocsDict[k].split(' '))

print(len(documents))
print(concLengthTotal * 1.0/ concCount)
print(discLengthTotal * 1.0/ discCount)

train, test, labelsTrain, labelsTest = train_test_split(documents, labels, test_size = 0.1)

Loading dataset...
37462
663.641129678
1340.7323688


Here we are splitting the data up some more to train different models. Discussion and conclusion sections are being put into their own training sets. A TFIDF vectorizer is trained with the whole dataset of conclusion AND discussion sections. The multiple different training sets are then transformed using this vectorizer to get vector encodings of the text normalized to sum to 1 which accounts for differing lengths of conclusion and discussion sections.

In [8]:
trainSetOne = []
trainSetTwo = []

for x in range(len(train)):
    if labelsTrain[x] == "conclusion":
        trainSetOne.append(train[x])
    else:
        trainSetTwo.append(train[x])

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
#tf_vectorizer = TfidfVectorizer(max_df=0.95, norm = 'l1', min_df=2, max_features=n_features)
tf_vectorizer = TfidfVectorizer(max_df=0.95, norm = 'l1', min_df=2, max_features=n_features, ngram_range = (1,4))
t0 = time()
tf = tf_vectorizer.fit_transform(train)

tfSetOne = tf_vectorizer.transform(trainSetOne)
tfSetTwo = tf_vectorizer.transform(trainSetTwo)
tfTest = tf_vectorizer.transform(test)
test = tfTest
train = tf
trainSetOne = tfSetOne
trainSetTwo = tfSetTwo

print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 186.513s.


# LDA With Two Topics

Define an LDA topic model on the whole data set with two topics. This is trying to see if the topic model can define the difference between the two groups automatically and prints the top words per topic.

In [13]:
print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=109995 and n_features=200...
done in 260.769s.

Topics in LDA model:
Topic #0:
cd nnp nnp nnp cd nn nn nn nn nn cd cd cd nnp nnp nnp jj nn nn nn nn in
Topic #1:
rb to vbz in dt nn vbd dt jj nn dt nn in vbp nn in dt vbn in


Transform the unknown data through the topic model and calculate which topic it is more associated with according to the ratios. Calculate how many of each type (conclusion and discussion) go into each topic (1 or 2).

In [14]:
results = lda.transform(test)
totalConTop1 = 0
totalConTop2 = 0
totalDisTop1 = 0
totalDisTop2 = 0
for x in range(len(results)):
    val1 = results[x][0]
    val2 = results[x][1]
    total = val1 + val2
    print(str(labelsTest[x]) + " " + str(val1/total) + " " + str(val2/total))
    if val1 > val2:
        if labelsTest[x] == "conclusion":
            totalConTop1 += 1
        else:
            totalDisTop1 += 1
    else:
        if labelsTest[x] == "conclusion":
            totalConTop2 += 1
        else:
            totalDisTop2 += 1

discussion 0.405850513316 0.594149486684
conclusion 0.690283668996 0.309716331004
conclusion 0.351694612017 0.648305387983
discussion 0.373264365562 0.626735634438
conclusion 0.290594785527 0.709405214473
conclusion 0.561680263148 0.438319736852
conclusion 0.588349236974 0.411650763026
discussion 0.474189147454 0.525810852546
discussion 0.336308625202 0.663691374798
discussion 0.439182701209 0.560817298791
conclusion 0.616775337748 0.383224662252
discussion 0.595250461698 0.404749538302
discussion 0.392501930878 0.607498069122
conclusion 0.65623780817 0.34376219183
discussion 0.385852746071 0.614147253929
conclusion 0.370599476658 0.629400523342
conclusion 0.625020651789 0.374979348211
conclusion 0.658711921712 0.341288078288
conclusion 0.375942101739 0.624057898261
discussion 0.424599299074 0.575400700926
discussion 0.373340618627 0.626659381373
discussion 0.348906902102 0.651093097898
discussion 0.452892931469 0.547107068531
discussion 0.430377118701 0.569622881299
conclusion 0.62576

Print out the results from the topic transforms.

In [15]:
print("Total Conclusion Topic One: " + str(totalConTop1))
print("Total Conclusion Topic Two: " + str(totalConTop2))
print("Total Discussion Topic One: " + str(totalDisTop1))
print("Total Discussion Topic Two: " + str(totalDisTop2))

Total Conclusion Topic One: 1093
Total Conclusion Topic Two: 712
Total Discussion Topic One: 234
Total Discussion Topic Two: 1708


Get the parameters for the LDA.

In [16]:
lda.get_params()

{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 50.0,
 'max_doc_update_iter': 100,
 'max_iter': 100,
 'mean_change_tol': 0.001,
 'n_jobs': 1,
 'n_topics': 2,
 'perp_tol': 0.1,
 'random_state': 0,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

# Basic Classifiers

Train three basic classifiers to solve the problem. Try Gaussian, Bernoulli and K Nearest Neighbors classifiers and calculate how accurate they are.

In [9]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.934614358153


In [10]:
from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.802775553776


In [11]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(train, labelsTrain)

classResults = classifier.predict(test)
numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.832666132906


In [12]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())
numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.934347477982


# Two Topic Models

Define two topic models with 20 topics each, one on discussion sections and one on conclusion sections. Then transform both the train and test sets using both topic models to get 40 features for each sample based on the probability distribution for each topic in each LDA.

In [17]:
ldaSet1 = LatentDirichletAllocation(n_topics=20, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)
ldaSet2 = LatentDirichletAllocation(n_topics=20, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

In [18]:
ldaSet1.fit(trainSetOne)
print_top_words(ldaSet1, tf_feature_names, n_top_words)

Topic #0:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #1:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #2:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #3:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #4:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #5:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #6:
rb vbz to in dt nn dt jj nn vbp nn nns nnp nn in dt jj nn in
Topic #7:
cd cd cd nn cd cd nn cd cd cd cd cd cd nn cd cd nns nn nn dt nn cd cd cd nn nn cd cd nn dt jj jj nn nns
Topic #8:
nnp nnp nnp nnp nnp nnp nnp nnp nnp nnp cd nn nn nn in nnp nn cc nnp nn nn cd
Topic #9:
cd cd cd nn cd cd nn cd cd cd cd 

In [19]:
ldaSet2.fit(trainSetTwo)
print_top_words(ldaSet2, tf_feature_names, n_top_words)

Topic #0:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #1:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #2:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #3:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #4:
cd rb vbd to in dt nn vbz cd nn dt jj nn dt nn in nn in dt
Topic #5:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #6:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #7:
cd cd cd cd nn nnp nnp nn nn nnp nnp nnp nnp nnp nnp nnp nn cd nn cd cd cd cd nn cd nn cd nn cd cd cd cd nn nn nn jj
Topic #8:
cd cd cd cd nn nnp nnp nn nn nnp

In [20]:
results1 = ldaSet1.transform(train)
results2 = ldaSet2.transform(train)

resultsTest1 = ldaSet1.transform(test)
resultsTest2 = ldaSet2.transform(test)

In [21]:
results = np.hstack((results1, results2))
resultsTest = np.hstack((resultsTest1, resultsTest2))

Define three classifiers using the transformed train and test sets from the topic models. Print out the accuracy of each one.

In [22]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.722711502535


In [23]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.844675740592


In [24]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.694689084601


Normalize the results of each sample of 40 features so they sum to 1. Then train two more classifiers using the data and print out the accuracy of each.

In [25]:
for x in range(len(results)):
    total = 0
    for y in range(len(results[x])):
        total += results[x][y]
    for y in range(len(results[x])):
        results[x][y] = results[x][y]/total
        
for x in range(len(resultsTest)):
    total = 0
    for y in range(len(resultsTest[x])):
        total += resultsTest[x][y]
    for y in range(len(resultsTest[x])):
        resultsTest[x][y] = resultsTest[x][y]/total

In [26]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.722711502535


In [27]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.844675740592


In [28]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.693354683747
