# Topic Model Test

This is a notebook for trying to use topic models for classifying sets of text that are more syntactically similar than topically similar. This notebook attempts to distinguish between discussion and conclusion section of scientific papers.

Below we are loading the dataset for use.

In [1]:
from __future__ import print_function
from time import time
import os

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.cross_validation import train_test_split

import numpy as np

import pickle

validDocsDict = dict()
fileList = os.listdir("BioMedProcessed")
for f in fileList:
    validDocsDict.update(pickle.load(open("BioMedProcessed/" + f, "rb")))

Here we are setting some vaiables to be used below and defining a function for printing the top words in a topic for the topic modeling.

In [2]:
n_samples = len(validDocsDict.keys())
n_features = 1000
n_topics = 2
n_top_words = 30


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Pre-process data

Here we are preprocessing data for use later. This code only grabs the discussion and conclusion sections of the data. We are also creating appropriate labels for the data and spliting the documents up to train and test sets.

In [3]:
print("Loading dataset...")
t0 = time()
documents = []

labels = []
concLengthTotal = 0
discLengthTotal = 0
concCount = 0
discCount = 0

for k in validDocsDict.keys():
    if k.startswith("conclusion"):
        labels.append("conclusion")
        documents.append(validDocsDict[k])
        concCount += 1
        concLengthTotal += len(validDocsDict[k].split(' '))
    elif k.startswith("discussion"):
        labels.append("discussion")
        documents.append(validDocsDict[k])
        discCount += 1
        discLengthTotal += len(validDocsDict[k].split(' '))

print(len(documents))
print(concLengthTotal * 1.0/ concCount)
print(discLengthTotal * 1.0/ discCount)

train, test, labelsTrain, labelsTest = train_test_split(documents, labels, test_size = 0.1)

Loading dataset...
53034
621.583361617
1197.39683976


Here we are splitting the data up some more to train different models. Discussion and conclusion sections are being put into their own training sets. A TFIDF vectorizer is trained with the whole dataset of conclusion AND discussion sections. The multiple different training sets are then transformed using this vectorizer to get vector encodings of the text normalized to sum to 1 which accounts for differing lengths of conclusion and discussion sections.

In [4]:
trainSetOne = []
trainSetTwo = []

for x in range(len(train)):
    if labelsTrain[x] == "conclusion":
        trainSetOne.append(train[x])
    else:
        trainSetTwo.append(train[x])

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = TfidfVectorizer(max_df=0.95, norm = 'l1', min_df=2, max_features=n_features, stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(train)

tfSetOne = tf_vectorizer.transform(trainSetOne)
tfSetTwo = tf_vectorizer.transform(trainSetTwo)
tfTest = tf_vectorizer.transform(test)
test = tfTest
train = tf
trainSetOne = tfSetOne
trainSetTwo = tfSetTwo

print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 67.817s.




# LDA With Two Topics

Define an LDA topic model on the whole data set with two topics. This is trying to see if the topic model can define the difference between the two groups automatically and prints the top words per topic.

In [5]:
print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=157526 and n_features=1000...
done in 369.030s.

Topics in LDA model:
Topic #0:
patients health study care 1016 authors risk manuscript treatment clinical data disease use research women patient medical cancer hiv children competing history pre interests analysis publication design population quality pain
Topic #1:
background expression gene cells genes cell protein results different human cancer studies activity used species model levels specific proteins present genetic method using genome dna role data number function observed


Transform the unknown data through the topic model and calculate which topic it is more associated with according to the ratios. Calculate how many of each type (conclusion and discussion) go into each topic (1 or 2).

In [6]:
results = lda.transform(test)
totalConTop1 = 0
totalConTop2 = 0
totalDisTop1 = 0
totalDisTop2 = 0
for x in range(len(results)):
    val1 = results[x][0]
    val2 = results[x][1]
    total = val1 + val2
    print(str(labelsTest[x]) + " " + str(val1/total) + " " + str(val2/total))
    if val1 > val2:
        if labelsTest[x] == "conclusion":
            totalConTop1 += 1
        else:
            totalDisTop1 += 1
    else:
        if labelsTest[x] == "conclusion":
            totalConTop2 += 1
        else:
            totalDisTop2 += 1

discussion 0.446217510904 0.553782489096
conclusion 0.647675593456 0.352324406544
conclusion 0.668534588545 0.331465411455
discussion 0.585758179684 0.414241820316
discussion 0.440545328711 0.559454671289
discussion 0.429123928039 0.570876071961
conclusion 0.508802892723 0.491197107277
conclusion 0.413093530206 0.586906469794
conclusion 0.533463022544 0.466536977456
discussion 0.617091098719 0.382908901281
discussion 0.466637971991 0.533362028009
discussion 0.591712993523 0.408287006477
conclusion 0.634231511646 0.365768488354
conclusion 0.47866952831 0.52133047169
discussion 0.570943417914 0.429056582086
conclusion 0.666855611942 0.333144388058
conclusion 0.711947853956 0.288052146044
conclusion 0.586395575591 0.413604424409
discussion 0.331408398056 0.668591601944
conclusion 0.673876581354 0.326123418646
conclusion 0.541880273551 0.458119726449
conclusion 0.713677248786 0.286322751214
conclusion 0.290702028924 0.709297971076
conclusion 0.494461926556 0.505538073444
conclusion 0.71497

Print out the results from the topic transforms.

In [7]:
print("Total Conclusion Topic One: " + str(totalConTop1))
print("Total Conclusion Topic Two: " + str(totalConTop2))
print("Total Discussion Topic One: " + str(totalDisTop1))
print("Total Discussion Topic Two: " + str(totalDisTop2))

Total Conclusion Topic One: 1605
Total Conclusion Topic Two: 1101
Total Discussion Topic One: 1056
Total Discussion Topic Two: 1542


Get the parameters for the LDA.

In [8]:
lda.get_params()

{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 50.0,
 'max_doc_update_iter': 100,
 'max_iter': 100,
 'mean_change_tol': 0.001,
 'n_jobs': 1,
 'n_topics': 2,
 'perp_tol': 0.1,
 'random_state': 0,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

# Basic Classifiers

Train three basic classifiers to solve the problem. Try Gaussian, Bernoulli and K Nearest Neighbors classifiers and calculate how accurate they are.

In [9]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.931184012066


In [10]:
from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(test.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.956259426848


In [11]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(train, labelsTrain)

classResults = classifier.predict(test)
numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.739064856712


# Two Topic Models

Define two topic models with 20 topics each, one on discussion sections and one on conclusion sections. Then transform both the train and test sets using both topic models to get 40 features for each sample based on the probability distribution for each topic in each LDA.

In [12]:
ldaSet1 = LatentDirichletAllocation(n_topics=20, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)
ldaSet2 = LatentDirichletAllocation(n_topics=20, max_iter=100,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

In [13]:
ldaSet1.fit(trainSetOne)
print_top_words(ldaSet1, tf_feature_names, n_top_words)

Topic #0:
income perceived intern healthcare hospitals emergency adolescents gps questionnaire surgery nursing nurses physician physicians prepub jama students practices conserved rural s0140 policy 6736 reason 1001 reasons urban country transcriptional illness
Topic #1:
income perceived intern healthcare hospitals emergency adolescents gps questionnaire surgery nursing nurses physician physicians prepub jama students practices conserved rural s0140 policy 6736 reason 1001 reasons urban country transcriptional illness
Topic #2:
income perceived intern healthcare hospitals emergency adolescents gps questionnaire surgery nursing nurses physician physicians prepub jama students practices conserved rural s0140 policy 6736 reason 1001 reasons urban country transcriptional illness
Topic #3:
income perceived intern healthcare hospitals emergency adolescents gps questionnaire surgery nursing nurses physician physicians prepub jama students practices conserved rural s0140 policy 6736 reason 100

In [14]:
ldaSet2.fit(trainSetTwo)
print_top_words(ldaSet2, tf_feature_names, n_top_words)

Topic #0:
physician physicians surgery nursing students transcriptional income mrna emergency pain policy questionnaire gps nurses intern conserved urban jama rural staff services hospitals healthcare microarray loci adolescents country service school economic
Topic #1:
patients study treatment patient et al reported cases disease surgery clinical renal studies cancer therapy case pain bone diagnosis risk group tumor chemotherapy ct surgical associated rate dose survival blood
Topic #2:
cells expression cell protein genes cancer activity gene tumor proteins induced levels study activation mice growth shown observed il increased binding apoptosis role studies signaling human effect response pathway regulation
Topic #3:
physician physicians surgery nursing students transcriptional income mrna emergency pain policy questionnaire gps nurses intern conserved urban jama rural staff services hospitals healthcare microarray loci adolescents country service school economic
Topic #4:
physician p

In [15]:
results1 = ldaSet1.transform(train)
results2 = ldaSet2.transform(train)

resultsTest1 = ldaSet1.transform(test)
resultsTest2 = ldaSet2.transform(test)

In [16]:
results = np.hstack((results1, results2))
resultsTest = np.hstack((resultsTest1, resultsTest2))

Define two classifiers using the transformed train and test sets from the topic models. Print out the accuracy of each one.

In [17]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.608597285068


In [18]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.776583710407


Normalize the results of each sample of 40 features so they sum to 1. Then train two more classifiers using the data and print out the accuracy of each.

In [19]:
for x in range(len(results)):
    total = 0
    for y in range(len(results[x])):
        total += results[x][y]
    for y in range(len(results[x])):
        results[x][y] = results[x][y]/total
        
for x in range(len(resultsTest)):
    total = 0
    for y in range(len(resultsTest[x])):
        total += resultsTest[x][y]
    for y in range(len(resultsTest[x])):
        resultsTest[x][y] = resultsTest[x][y]/total

In [20]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.489819004525


In [21]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

classifier.fit(results, labelsTrain)

classResults = classifier.predict(resultsTest)

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTest[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.776583710407
