# Topic Model Two Datasets Memory Efficient Fuzzy

This is a notebook for trying to use topic models for classifying sets of text that are more syntactically similar than topically similar. This notebook attempts to distinguish between discussion and conclusion section of scientific papers. This modifies the sections with random words from the introduction sections. It also reads the second dataset in a more memory efficient way.

Below we are loading the two datasets for use.

In [1]:
from __future__ import print_function
from time import time
from random import randint

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.cross_validation import train_test_split

import numpy as np
import os
import pickle

validDocsDict = dict()
fileList = os.listdir("BioMedProcessed")
for f in fileList:
    validDocsDict.update(pickle.load(open("BioMedProcessed/" + f, "rb")))

Here we are setting some vaiables to be used below and defining a function for printing the top words in a topic for the topic modeling.

In [2]:
n_samples = len(validDocsDict.keys())
n_features = 10000
n_topics = 2
n_top_words = 30
lengthOfIntroToAdd = 700

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Preprocess Data

Here we are preprocessing data for use later. This code only grabs the discussion and conclusion sections of the data. We are also creating appropriate labels for the data and spliting the documents up to train and test sets. We do this for both sets of data and then for a combined set of data.

In [3]:
print("Loading dataset...")
t0 = time()
documents = []
introductionSections = []

labels = []
concLengthTotal = 0
discLengthTotal = 0
concCount = 0
discCount = 0
introCount = 0

for k in validDocsDict.keys():
    if k.startswith("conclusion"):
        labels.append("conclusion")
        documents.append(validDocsDict[k])
        concCount += 1
        concLengthTotal += len(validDocsDict[k].split(' '))
    elif k.startswith("discussion"):
        labels.append("discussion")
        documents.append(validDocsDict[k])
        discCount += 1
        discLengthTotal += len(validDocsDict[k].split(' '))
    elif k.startswith("introduction") and len(validDocsDict[k]) > 10000:
        introCount += 1
        introductionSections.append(validDocsDict[k])

print(len(documents))
print(concLengthTotal * 1.0/ concCount)
print(discLengthTotal * 1.0/ discCount)
print(introCount)

Loading dataset...
53034
621.583361617
1197.39683976
1213


Here we are reading in the files of the second dataset and only keeping the important sections. We are reading the files in one file at a time to be more memory efficient. Also, note that because the PubMed dataset is much larger, we are only reading in a third of the files.

In [4]:
validDocs2 = []
labels2 = []
fileList = os.listdir("PubMedProcessed")
for f in fileList[0:len(fileList)/3]:
    tempDict = pickle.load(open("PubMedProcessed/" + f, "rb"))
    for item in tempDict.keys():
        if item.startswith("conclusion"):
            labels2.append("conclusion")
            validDocs2.append(tempDict[item])
        elif item.startswith("discussion"):
            labels2.append("discussion")
            validDocs2.append(tempDict[item])
        elif item.startswith("introduction") and len(tempDict[item]) > 10000:
            introCount += 1
            introductionSections.append(tempDict[item])

print(len(validDocs2))
print(introCount)

27688
3392


Here we are adding random introduction words to the conclusion and discussion sections to replicate noise. Because the sections are tfidf vectorized, it is not important where in the section they are inserted.

In [5]:
for item in range(len(documents)):
    intro = introductionSections[randint(0, len(introductionSections) - 1)].split(" ")
    randNum = randint(0, len(intro) - lengthOfIntroToAdd)
    introWords = intro[randNum:randNum + lengthOfIntroToAdd]
    documents[item] = documents[item] + " ".join(introWords)

for item in range(len(validDocs2)):
    intro = introductionSections[randint(0, len(introductionSections) - 1)].split(" ")
    randNum = randint(0, len(intro) - lengthOfIntroToAdd)
    introWords = intro[randNum:randNum + lengthOfIntroToAdd]
    validDocs2[item] = validDocs2[item] + " ".join(introWords)
    
train, test, labelsTrain, labelsTest = train_test_split(documents, labels, test_size = 0.1)

Here we are splitting the data up some more to train different models. Discussion and conclusion sections are being put into their own training sets. A TFIDF vectorizer is trained with the whole dataset of conclusion AND discussion sections from both data sets. The multiple different training sets are then transformed using this vectorizer to get vector encodings of the text normalized to sum to 1 which accounts for differing lengths of conclusion and discussion sections and between data sets.

In [6]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = TfidfVectorizer(max_df=0.95, norm = 'l1', min_df=2, max_features=n_features, stop_words='english')
t0 = time()
tf_vectorizer.fit(train)
tf = tf_vectorizer.transform(train)

tfTest = tf_vectorizer.transform(test)
test = tfTest
train = tf

pubTest = tf_vectorizer.transform(validDocs2)

print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 132.906s.




# Basic Classifiers Between Two Datasets

Train and test two Bernoulli classifiers (one where dataset 1 is trained and one where dataset 2 is trained) and print out the results of accuracy.

In [7]:
from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()

classifier.fit(train.toarray(), labelsTrain)

classResults = classifier.predict(pubTest.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labels2[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.886629586825


In [8]:
from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()

classifier.fit(pubTest.toarray(), labels2)

classResults = classifier.predict(train.toarray())

numRight = 0

for item in range(len(classResults)):
    if classResults[item] == labelsTrain[item]:
        numRight += 1

print(str(numRight * 1.0 / len(classResults) * 1.0))

0.934045673581


In [9]:
probas = classifier.predict_log_proba(train.toarray())

In [10]:
TotalRight = 0
TotalWrong = 0
numRight = 0
numWrong = 0
RightNumbers = []
WrongNumbers = []
for item in range(len(classResults)):
    if classResults[item] == labelsTrain[item]:
        TotalRight += probas[item][0] + probas[item][1]
        numRight += 1
        RightNumbers.append(probas[item][0] + probas[item][1])
    else:
        TotalWrong += probas[item][0] + probas[item][1]
        numWrong += 1
        WrongNumbers.append(probas[item][0] + probas[item][1])

In [11]:
print(str(TotalRight * 1.0 / numRight))
print(str(TotalWrong * 1.0 / numWrong))

-62.4955817234
-21.1156799732
