# Topic Model & Subject Headings


Sources:

https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/

Kapadia, Shashank, "[Topic Modeling in Python: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0)," <i>towards data science</i>, Accessed 10/09/2020.

ALollz, "[How to calculate p-values for pairwise correlation of columns in Pandas?](https://stackoverflow.com/questions/52741236/how-to-calculate-p-values-for-pairwise-correlation-of-columns-in-pandas)," <i>StackOverflow</i>, Accessed 10/13/2020.

In [1]:
# Import necessary libraries.
import re, nltk, warnings, csv, sys, os, pickle, glob
import pandas as pd
import numpy as np
import seaborn as sns
from itertools import chain
from scipy import stats

# Import NLTK packages.
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

# Import sklearn packages.
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Import LDA visualizer.
import pyLDAvis, pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# Import and append stopwords.
stop_words = stopwords.words("english")
stop_words.append('mr')


# Import project-specific functions. 
# Python files (.py) have to be in same folder to work.
lib_path = os.path.abspath(os.path.join(os.path.dirname('JQA_XML_parser.py'), '../Scripts'))
sys.path.append(lib_path)
from JQA_XML_parser import *

# Read in config.py (git ignored file) for API username and pw.
config_path = os.path.abspath(os.path.join(os.path.dirname('config.py'), '../Scripts'))
sys.path.append(config_path)
import config

# url = 'https://dsg.xmldb-dev.northeastern.edu/basex/psc/' :: old
url = 'https://dsg.xmldb-dev.northeastern.edu/BaseX964/rest/psc/'
user = config.username
pw = config.password

# Ignore warnings related to deprecated functions.
warnings.simplefilter("ignore", DeprecationWarning)

# Get the correct file path to navigate to the github repository.
abs_dir = os.getcwd() + '/../../'

## Gather XML Files

In [2]:
%%time

# Remove this cell when files are in BaseX.
# Declare directory location to shorten filepaths later.
files = glob.glob(abs_dir + "../../Data/PSC/JQA/*/*.xml")

len(files)

CPU times: user 3.31 ms, sys: 3.97 ms, total: 7.28 ms
Wall time: 7.77 ms


762

In [3]:
# %%time

# # Must be connected to Northeastern's VPN.
# r = requests.get(url, 
#                  auth = (user, pw), 
#                  headers = {'Content-Type': 'application/xml'}
#                 )

# # Check status of URL
# print (r.status_code)

# # Read in contents of pipeline.
# soup = BeautifulSoup(r.content, 'html.parser')

# # Split soup's content by \n (each line is a file path to an XML doc).
# # Use filter() to remove empty strings ('').
# # Convert back to list using list().
# files = list(filter(None, soup.text.split('\n')))

# # Filter list and retrieve only jqa/ files.
# files = [i for i in files if 'jqa/' in i]

# # len(files)
# files

## Build DataFrame

In [4]:
%%time

# Build dataframe from XML files.
# build_dataframe() called from Correspondence_XML_parser
df = build_dataframe(files)

df.head(3)

TypeError: build_dataframe() missing 3 required positional arguments: 'url', 'user', and 'pw'

## Clean Data & Prepare for Topic Modeling

In [5]:
%%time

# Drop duplicate texts (created from unnested subject headings) & count words.
doc_len = df['text'].str.split(' ').str.len() \
    .reset_index() \
    .drop_duplicates()

# Round word count.
doc_len = np.around(doc_len['text'], decimals = -1)

doc_len = pd.DataFrame(doc_len)

# Plot graph.
sns.set(rc = {"figure.figsize": (12, 6)})
sns.set_style("dark")
ax = sns.histplot(doc_len['text'])

NameError: name 'df' is not defined

#### Average Document Length

Topic modeling is sensitive to document length. Longer documents, which discuss multiple topics, might water down the end results. It might be good to shorten and normalize document lengths.

How will multiple subject headings relate to splitting entries? Will splitting entries wash down/skew results of topic correlation?

## Chunk texts into equal lengths.

In [6]:
%%time

chunk_size = 200

def splitText(string):
    words = string.split(' ')
    removed_stopwords = [w for w in words if w not in stop_words]
    grouped_words = [removed_stopwords[i: i + chunk_size] for i in range(0, len(removed_stopwords), chunk_size)]
    return grouped_words

df['text'] = df['text'].apply(splitText)

df = df.explode('text')

# Add word count field.
df['wordCount'] = df['text'].apply(lambda x: len(x))

# Join list of words into single string.
df['text'] = df['text'].apply(' '.join)

# Remove rows without text.
df = df.dropna(subset = ['text'])

# Remove texts with too few words (chunk_size - 50).
df = df.query('wordCount >= (@chunk_size - 50)')

df.head(3)

NameError: name 'df' is not defined

## Train Topic Model

>Count Vectorizer or Tfidf? Create two models and compare?

In [7]:
%%time

# Remove duplicate text rows (caused from unnesting headings) by subsetting & de-duplicating.
topics = df[['entry', 'text']].drop_duplicates(subset = ['entry'])

# Initialise the vectorizer with English stop words.
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the processed texts.
features = vectorizer.fit_transform(topics['text'])

# Helper function (from Kapadia).
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Set parameters (topics set to number of unique subject headings found).
number_topics = 40
number_words = 10

# Create and fit the LDA model
lda = LDA(n_components = number_topics, n_jobs=-1)
lda.fit(features)

# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, vectorizer, number_words)

NameError: name 'df' is not defined

## Explore Topics

#### Topics

In [8]:
%%time

'''
pyLDAvis adjusts topic numbers by +1. 
Topic 1 in visualization is actually topic 0 (zero) in the model.
'''

pyLDAvis.sklearn.prepare(lda, features, vectorizer, mds='mmds')

NameError: name 'lda' is not defined

## Save pyLDAvis

In [9]:
%%time

p = pyLDAvis.sklearn.prepare(lda, features, vectorizer, mds='mmds')

pyLDAvis.save_html(p, abs_dir + "lab_space/projects/jqa/topics/jqa_topics-40_pyLDAvis.html')

SyntaxError: EOL while scanning string literal (<unknown>, line 3)