# Topic Model Parts of Speech

This is a notebook for trying to use topic models for classifying sets of text that are more syntactically similar than topically similar. This notebook attempts to distinguish between discussion and conclusion section of scientific papers.

Below we are loading the dataset for use.

In [1]:
from __future__ import print_function
from time import time
import os

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.cross_validation import train_test_split

import numpy as np

import pickle

validDocsDict = dict()
fileList = os.listdir("PubMedPOS")
for f in fileList:
    validDocsDict.update(pickle.load(open("PubMedPOS/" + f, "rb")))

Here we are setting some vaiables to be used below and defining a function for printing the top words in a topic for the topic modeling.

In [2]:
n_samples = len(validDocsDict.keys())
n_features = 1000
n_topics = 2
n_top_words = 10


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Pre-process data

Here we are preprocessing data for use later. This code only grabs the discussion and conclusion sections of the data. We are also creating appropriate labels for the data and spliting the documents up to train and test sets.

In [3]:
print("Loading dataset...")
t0 = time()
documents = []

labels = []
concLengthTotal = 0
discLengthTotal = 0
concCount = 0
discCount = 0

for k in validDocsDict.keys():
    if k.startswith("conclusion"):
        labels.append("conclusion")
        documents.append(validDocsDict[k])
        concCount += 1
        concLengthTotal += len(validDocsDict[k].split(' '))
    elif k.startswith("discussion"):
        labels.append("discussion")
        documents.append(validDocsDict[k])
        discCount += 1
        discLengthTotal += len(validDocsDict[k].split(' '))

print(len(documents))
print(concLengthTotal * 1.0/ concCount)
print(discLengthTotal * 1.0/ discCount)

train, test, labelsTrain, labelsTest = train_test_split(documents, labels, test_size = 0.6)

Loading dataset...
47990
536.107605751
1219.03392373


Here we are splitting the data up some more to train different models. Discussion and conclusion sections are being put into their own training sets. A TFIDF vectorizer is trained with the whole dataset of conclusion AND discussion sections. The multiple different training sets are then transformed using this vectorizer to get vector encodings of the text normalized to sum to 1 which accounts for differing lengths of conclusion and discussion sections.

In [4]:
trainSetOne = []
trainSetTwo = []

for x in range(len(train)):
    if labelsTrain[x] == "conclusion":
        trainSetOne.append(train[x])
    else:
        trainSetTwo.append(train[x])

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
#tf_vectorizer = TfidfVectorizer(max_df=0.95, norm = 'l1', min_df=2, max_features=n_features)
tf_vectorizer = TfidfVectorizer(max_df=0.95, norm = 'l1', min_df=2, max_features=n_features, ngram_range = (1,4))
t0 = time()
tf = tf_vectorizer.fit_transform(train)

tfSetOne = tf_vectorizer.transform(trainSetOne)
tfSetTwo = tf_vectorizer.transform(trainSetTwo)
tfTest = tf_vectorizer.transform(test)
test = tfTest
train = tf
trainSetOne = tfSetOne
trainSetTwo = tfSetTwo

print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 128.298s.




# LDA With Two Topics

Define an LDA topic model on the whole data set with two topics. This is trying to see if the topic model can define the difference between the two groups automatically and prints the top words per topic.

In [5]:
print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=137601 and n_features=1000...
done in 218.108s.

Topics in LDA model:
Topic #0:
vbn rb dt jj vb jj nns vbz to vbd in dt nn dt jj nn
Topic #1:
cd nnp nnp nnp cd nn nn nn nn nn cd cd cd nnp nnp nnp jj nn nn in jj


Transform the unknown data through the topic model and calculate which topic it is more associated with according to the ratios. Calculate how many of each type (conclusion and discussion) go into each topic (1 or 2).

In [6]:
results = lda.transform(test)
totalConTop1 = 0
totalConTop2 = 0
totalDisTop1 = 0
totalDisTop2 = 0
for x in range(len(results)):
    val1 = results[x][0]
    val2 = results[x][1]
    total = val1 + val2
    print(str(labelsTest[x]) + " " + str(val1/total) + " " + str(val2/total))
    if val1 > val2:
        if labelsTest[x] == "conclusion":
            totalConTop1 += 1
        else:
            totalDisTop1 += 1
    else:
        if labelsTest[x] == "conclusion":
            totalConTop2 += 1
        else:
            totalDisTop2 += 1

conclusion 0.496547482801 0.503452517199
discussion 0.465047982416 0.534952017584
discussion 0.589135905893 0.410864094107
discussion 0.637578286463 0.362421713537
conclusion 0.31577460502 0.68422539498
discussion 0.434690805673 0.565309194327
conclusion 0.497171889758 0.502828110242
discussion 0.644654901423 0.355345098577
discussion 0.647164204977 0.352835795023
discussion 0.558988709715 0.441011290285
conclusion 0.632566094947 0.367433905053
conclusion 0.533429820465 0.466570179535
discussion 0.430902077567 0.569097922433
conclusion 0.581751562512 0.418248437488
conclusion 0.409946943439 0.590053056561
conclusion 0.612582589727 0.387417410273
discussion 0.63703123912 0.36296876088
conclusion 0.420404039212 0.579595960788
discussion 0.571805658123 0.428194341877
conclusion 0.616507528174 0.383492471826
discussion 0.61973366393 0.38026633607
conclusion 0.341798107416 0.658201892584
conclusion 0.391021426953 0.608978573047
conclusion 0.287661367671 0.712338632329
conclusion 0.390807096

Print out the results from the topic transforms.

In [7]:
print("Total Conclusion Topic One: " + str(totalConTop1))
print("Total Conclusion Topic Two: " + str(totalConTop2))
print("Total Discussion Topic One: " + str(totalDisTop1))
print("Total Discussion Topic Two: " + str(totalDisTop2))

Total Conclusion Topic One: 6787
Total Conclusion Topic Two: 7651
Total Discussion Topic One: 12935
Total Discussion Topic Two: 1421


Get the parameters for the LDA.

In [8]:
lda.get_params()

{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 50.0,
 'max_doc_update_iter': 100,
 'max_iter': 100,
 'mean_change_tol': 0.001,
 'n_jobs': 1,
 'n_topics': 2,
 'perp_tol': 0.1,
 'random_state': 0,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

# Basic Classifiers

Train three basic classifiers to solve the problem. Try Gaussian, Bernoulli and K Nearest Neighbors classifiers and calculate how accurate they are.

In [9]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.892963811905


In [10]:
from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.866673612558


In [11]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(train, labelsTrain)

classResults = classifier.predict(test)
numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.736924359242


In [12]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())
numRight = 0
numWrongDisc = 0
numWrongConc = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1
    else:
        if classResults[item] == "discussion":
            numWrongDisc += 1
        else:
            numWrongConc += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))
print("Incorrectly classified as discussion: " + str(numWrongDisc))
print("Incorrectly classified as conclusion: " + str(numWrongConc))
print(len(classResults))

0.92307425158
Incorrectly classified as discussion: 1163
Incorrectly classified as conclusion: 1052
28794


# Two Topic Models

Define two topic models with 20 topics each, one on discussion sections and one on conclusion sections. Then transform both the train and test sets using both topic models to get 40 features for each sample based on the probability distribution for each topic in each LDA.

In [13]:
ldaSet1 = LatentDirichletAllocation(n_topics=20, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)
ldaSet2 = LatentDirichletAllocation(n_topics=20, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

In [14]:
ldaSet1.fit(trainSetOne)
print_top_words(ldaSet1, tf_feature_names, n_top_words)

Topic #0:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #1:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #2:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #3:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #4:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #5:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #6:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #7:
jj nn nn nns fw fw fw cd nn jj nns nnp jj nn nn nn nn jj fw fw cd nn fw cd nn jj nns vbp cd nnp jj dt cd nns
Topic #8:
jj nn nn nns fw fw fw cd nn jj nns nnp

In [15]:
ldaSet2.fit(trainSetTwo)
print_top_words(ldaSet2, tf_feature_names, n_top_words)

Topic #0:
nn nnp nnp nn nnp nnp nn cd nnp nn nn nn nnp nnp cd cd jj nn nnp nnp jj nn nn nnp nnp nnp nnp cc nnp nnp nn in nnp nnp jj nnp cd cd cd
Topic #1:
nnp nnp nn cd nnp nn nn nn nn nnp nnp nn jj nnp nnp nnp nnp jj nnp nnp cd cd nnp nnp nn in nn nnp nn nnp nnp nnp cc nnp cd cd cd
Topic #2:
nnp nnp nn cd nnp nn nn nn nn nnp nnp nn nnp nnp cd cd nnp cd cd cd jj nn nn nnp nn nnp nn jj nnp nnp jj nn nnp nnp nnp nnp nnp cc
Topic #3:
nn nnp nnp nn nnp nn nn nn nnp nnp nn cd nnp nnp jj nnp nnp nnp cc nn nnp nn nnp nnp cd cd nnp nnp nn in nnp cd cd cd jj nnp nnp
Topic #4:
nnp nn nn nn nn nnp nnp nn nnp nnp nn cd nnp nnp jj nnp nnp nnp cc jj nn nn nnp nnp nnp nn in nn nnp nn jj nn nnp nnp nnp nnp cd cd
Topic #5:
nnp nn nn nn nn nnp nnp nn nnp nnp nn cd nnp nnp jj nnp nnp nnp cc jj nnp nnp jj nn nnp nnp nn nnp nn nnp nnp cd cd jj nn nn nnp
Topic #6:
nnp nn nn nn nnp nnp nn cd nn nnp nnp nn nnp nnp cd cd jj nnp nnp nnp nnp nnp cc jj nn nnp nnp nnp nnp nn in nnp cd cd cd nnp nnp jj
Topic #7:
nn

In [16]:
results1 = ldaSet1.transform(train)
results2 = ldaSet2.transform(train)

resultsTest1 = ldaSet1.transform(test)
resultsTest2 = ldaSet2.transform(test)

In [17]:
results = np.hstack((results1, results2))
resultsTest = np.hstack((resultsTest1, resultsTest2))

Define two classifiers using the transformed train and test sets from the topic models. Print out the accuracy of each one.

In [18]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.499027575189


In [19]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.749392234493


Normalize the results of each sample of 40 features so they sum to 1. Then train two more classifiers using the data and print out the accuracy of each.

In [20]:
for x in range(len(results)):
    total = 0
    for y in range(len(results[x])):
        total += results[x][y]
    for y in range(len(results[x])):
        results[x][y] = results[x][y]/total
        
for x in range(len(resultsTest)):
    total = 0
    for y in range(len(resultsTest[x])):
        total += resultsTest[x][y]
    for y in range(len(resultsTest[x])):
        resultsTest[x][y] = resultsTest[x][y]/total

In [21]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.499027575189


In [22]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.749913176356
