# Capstone Project - Predicting Tags on a Corpus of Research Papers

The Mercatus Center at George Mason University is a university research center that provides economic research and education. Mercatus employs a fairly large staff of media and policy outreach professionals who must digest and communicate economic findings to media outlets and political staffers in Washington, DC. 

Part of this outreach is the maintainance of a website that provides full access to all research papers, testimonials, opinion editorials, summaries, and other outputs. These publications are given tags that allow users to easily find related publications. So far, Mercatus has assigned tags somewhat haphazardly, entrusting junior staffers to choose tags with a preference for preexisting labels. 

The goal of this project is to clean the current corpus of tagged documents and then, using machine learning techniques available from the scikit-learn Python library, produce a machine learning model that can predict the most relevant tags for a new document. 

I have written a scraper that has crawled the Mercatus website for documents and saved each one, along with some useful metadata. The following notebook preprocesses this data and then fits a machine learning model that can be pickled and used later to produce actionable tag suggestions for Mercatus outreach staff.

In [1]:
import pandas as pd
import datetime
import numpy as np
import nltk
import string
import pickle
import scipy

%matplotlib inline

# should we generate a fresh model, or should we use pickles?
# This variable sets a master value for all models in this
# notebook; individual models can be set differently within
# their cells.

use_pickles_master = True

# initialize array for capturing all models

models = []

## Cleaning the data

First I use sklearn.dataset's `load_files()` method to load a classified dataset - a directory whose subdirectories' names are label names, and whose subdirectories' contents are the text files for that label. 

I then initialize and fill a dataframe from attributes of the load_files object, and drop rows which contain missing data.

In [2]:
from sklearn.datasets import load_files

trainer = load_files('new_data')

df = pd.DataFrame()
df['filename'] = trainer.filenames
df['text'] = [' '.join(open(f, 'r').read().split()) for f in trainer.filenames]
df['label'] = [trainer.target_names[x] for x in trainer.target]
df['author'] = [x.split('__')[0].split('--')[0].split('/')[-1] for x in df['filename']]
df['date'] = pd.to_datetime([x.split('__')[-1].split('-')[-1].split('.txt')[0].strip() for x in df['filename']], 
                            errors='coerce')

# Find documents with less than 250 characters and remove them
df = df[pd.Series([len(x) for x in df.text]) > 250]

# remove documents in NA values - sometimes dates got cut off,
# which causes problems in the next cell
df = df.dropna()

print(f"Our dataset contains {len(df.groupby('text').count())} unique documents, with",
     f"{len(df)/len(df.groupby('text').count())} labels per document, for a total of",
     f"{len(df)} documents spread over {len(df.groupby('label').count())} labels.")

Our dataset contains 1901 unique documents, with 2.850078905839032 labels per document, for a total of 5418 documents spread over 136 labels.


Our dataset contains a large number of labels, and some have not been used in many years, while others have only been used a few times. 

The following code converts the `date` column to datetime objects, then groups by `label` and `date` to find only labels that have been active in the past four years (1460 days). 

We then remove labels that contain less than six documents.

In [3]:
df['date'] = df['date'].astype(np.int64) // 10 ** 9
df['date'] = pd.to_datetime(df['date'], unit='s')

df_active_labels = df[['label','date']].groupby('label').max().sort_values('date',ascending=True).reset_index()
df_active_labels = df_active_labels[df_active_labels.date > datetime.date.today() - datetime.timedelta(1460)]
df = df[df['label'].isin(df_active_labels.label)]

df_large_categories = df.groupby('label').count()[df.groupby('label').count()['filename'] > 5].reset_index()[['label','filename']]
df = df[df['label'].isin(df_large_categories.label)]

print(f"Our refined dataset contains {len(df.groupby('text').count())} unique documents, with",
     f"for a total of",
     f"{len(df)} documents spread over {len(df.groupby('label').count())} labels.")

Our refined dataset contains 1870 unique documents, with for a total of 4946 documents spread over 90 labels.


Now we can start working on our data. First we encode our labels and split our training and validation documents.

In [4]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

le = preprocessing.LabelEncoder()
y = le.fit_transform(df.label)

xtrain, xvalid, ytrain, yvalid = train_test_split(df.text.tolist(), y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1)

Next we initialize two functions, one for stemming tokens using the `SnowballStemmer`, and the second for actually tokenizing text.

In [5]:
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

stemmer = SnowballStemmer('english')

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    tokens = [i for i in tokens if all(j.isalpha() or j in string.punctuation for j in i)]
    tokens = [i for i in tokens if '/' not in i]
    stems = stem_tokens(tokens, stemmer)
    return stems

Now I initialize two vectorizers - a CountVectorizer and TfidfVectorizer, both of which cap the 1-gram and 2-gram features at 200,000 and impose a minimum document frequency of three and a maximum document frequency of 70% of the corpus.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    cv = pickle.load(open("countvectorizer.pkl", "rb"))
    xtrain_cv = pickle.load(open("xtrain_cv.pkl", "rb"))
    xvalid_cv = pickle.load(open("xvalid_cv.pkl", "rb"))

else:
    cv = CountVectorizer(min_df=3, max_df=0.7, max_features=200000, tokenizer=tokenize,
                strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                ngram_range=(1, 2), stop_words='english')

    cv.fit(list(set(xtrain)) + list(set(xvalid)))
    xtrain_cv = cv.transform(xtrain)
    xvalid_cv = cv.transform(xvalid)

    pickle.dump(xtrain_cv, open("xtrain_cv.pkl","wb"))
    pickle.dump(xvalid_cv, open("xvalid_cv.pkl","wb"))
    pickle.dump(cv, open("countvectorizer.pkl","wb"))

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    tfv = pickle.load(open("tfidfvectorizer.pkl", "rb"))
    xtrain_tfv = pickle.load(open("xtrain_tfv.pkl", "rb"))
    xvalid_tfv = pickle.load(open("xvalid_tfv.pkl", "rb"))
    
else:
    tfv = TfidfVectorizer(min_df=3, max_df=0.7, max_features=200000, tokenizer=tokenize,
                strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1,
                stop_words = 'english')

    tfv.fit(list(xtrain) + list(xvalid))
    xtrain_tfv =  tfv.transform(xtrain) 
    xvalid_tfv = tfv.transform(xvalid)

    pickle.dump(xtrain_tfv, open("xtrain_tfv.pkl","wb"))
    pickle.dump(xvalid_tfv, open("xvalid_tfv.pkl","wb"))
    pickle.dump(tfv, open("tfidfvectorizer.pkl","wb"))

## Model Tests: First Pass

The goal is to provide users with the most relevant tags for a particular document. To accomplish this, I use `GridSearchCV` to tune hyperparameters for a variety of models, and then choose the best one. These models include `LogisticRegression`, `DecisionTreeClassifier`, `RandomForestClassifier`, `KNeighborsClassifier`, and `LinearSVC`. These models will be tried using both tf-idf and count vectorizations. 

#### LogisticRegression

Tests for tf-idf and count vectorizations:

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Testing LogisticRegression on tf-idf vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_lr_tf = pickle.load(open("gscv_lr_tf.pkl", "rb"))

else:
    lr_tf = LogisticRegression()
    params = {'C':[0.5, 1, 5], 'dual':[False, True], 'n_jobs':[-1], 'random_state':[42]}

    # Initialize and fit.

    gscv_lr_tf = GridSearchCV(lr_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_lr_tf.fit(xtrain_tfv, ytrain)

    pickle.dump(gscv_lr_tf, open("gscv_lr_tf.pkl","wb"))


# add model to the models array
models.extend([gscv_lr_tf])
    
# Display LogisticRegression/tf-idf results.

print(f'Best score of {gscv_lr_tf.best_score_} found with the following parameters:',
      gscv_lr_tf.best_params_)

Best score of -3.3094250726587253 found with the following parameters: {'C': 5, 'dual': False, 'n_jobs': -1, 'random_state': 42}


In [9]:
# # Testing LogisticRegression on count vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_lr_cv = pickle.load(open("gscv_lr_cv.pkl", "rb"))

else:
    lr_cv = LogisticRegression()
    params = {'C':[0.5, 1, 5], 'dual':[False, True], 'n_jobs':[-1], 'random_state':[42]}

    # Initialize and fit.

    gscv_lr_cv = GridSearchCV(lr_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_lr_cv.fit(xtrain_cv, ytrain)
    
    pickle.dump(gscv_lr_cv, open("gscv_lr_cv.pkl","wb"))


# add model to the models array
models.extend([gscv_lr_cv])
       
# Display LogisticRegression/countvectorizer results.

print(f'Best score of {gscv_lr_cv.best_score_} found with the following parameters:',
      gscv_lr_cv.best_params_)

Best score of -9.599575059251444 found with the following parameters: {'C': 0.5, 'dual': False, 'n_jobs': -1, 'random_state': 42}


#### DecisionTreeClassifier

Tests for tf-idf and count vectorizations:

In [10]:
from sklearn.tree import DecisionTreeClassifier

# Testing DecisionTreeClassifier on tf-idf vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_dtc_tf = pickle.load(open("gscv_dtc_tf.pkl", "rb"))

else:
    dtc_tf = DecisionTreeClassifier()
    params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

    # Initialize and fit.

    gscv_dtc_tf = GridSearchCV(dtc_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_dtc_tf.fit(xtrain_tfv, ytrain)
    
    pickle.dump(gscv_dtc_tf, open("gscv_dtc_tf.pkl","wb"))


# add model to the models array
models.extend([gscv_dtc_tf])
       
# Display DecisionTreeClassifier/tf-idf results.

print(f'Best score of {gscv_dtc_tf.best_score_} found with the following parameters:',
      gscv_dtc_tf.best_params_)

Best score of -30.866439479022038 found with the following parameters: {'criterion': 'gini', 'min_samples_split': 6, 'random_state': 42}


In [11]:
# Testing DecisionTreeClassifier on count vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_dtc_cv = pickle.load(open("gscv_dtc_cv.pkl", "rb"))

else:
    dtc_cv = DecisionTreeClassifier()
    params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

    # Initialize and fit.

    gscv_dtc_cv = GridSearchCV(dtc_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_dtc_cv.fit(xtrain_cv, ytrain)
    
    pickle.dump(gscv_dtc_cv, open("gscv_dtc_cv.pkl","wb"))


# add model to the models array
models.extend([gscv_dtc_cv])
       
# Display DecisionTreeClassifier/tf-idf results.

print(f'Best score of {gscv_dtc_cv.best_score_} found with the following parameters:',
      gscv_dtc_cv.best_params_)

Best score of -30.50875466565311 found with the following parameters: {'criterion': 'entropy', 'min_samples_split': 6, 'random_state': 42}


#### RandomForestClassifier

Tests for tf-idf and count vectorizations:

In [12]:
from sklearn.ensemble import RandomForestClassifier

# Testing RandomForestClassifier on tf-idf vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_rfc_tf = pickle.load(open("gscv_rfc_tf.pkl", "rb"))
    
else:
    rfc_tf = RandomForestClassifier()
    params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

    # Initialize and fit.

    gscv_rfc_tf = GridSearchCV(rfc_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_rfc_tf.fit(xtrain_tfv, ytrain)
    
    pickle.dump(gscv_rfc_tf, open("gscv_rfc_tf.pkl","wb"))


# add model to the models array
models.extend([gscv_rfc_tf])
       
# Display RandomForestClassifier/tf-idf results.

print(f'Best score of {gscv_rfc_tf.best_score_} found with the following parameters:',
      gscv_rfc_tf.best_params_)

Best score of -16.008716214283595 found with the following parameters: {'criterion': 'gini', 'min_samples_split': 6, 'random_state': 42}


In [13]:
# Testing RandomForestClassifier on count vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_rfc_cv = pickle.load(open("gscv_rfc_cv.pkl", "rb"))

else:
    rfc_cv = RandomForestClassifier()
    params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

    # Initialize and fit.

    gscv_rfc_cv = GridSearchCV(rfc_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_rfc_cv.fit(xtrain_cv, ytrain)
    
    pickle.dump(gscv_rfc_cv, open("gscv_rfc_cv.pkl","wb"))


# add model to the models array
models.extend([gscv_rfc_cv])
       
# Display RandomForestClassifier/count results.

print(f'Best score of {gscv_rfc_cv.best_score_} found with the following parameters:',
      gscv_rfc_cv.best_params_)

Best score of -15.776711349094573 found with the following parameters: {'criterion': 'entropy', 'min_samples_split': 6, 'random_state': 42}


#### Naive Bayes

Tests for tf-idf and count vectorizations:

In [14]:
from sklearn.naive_bayes import MultinomialNB

# Testing NaiveBayes on tf-idf vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_mnb_tf = pickle.load(open("gscv_mnb_tf.pkl", "rb"))
    
else:
    mnb_tf = MultinomialNB()
    params = {'alpha':[0.01, 1, 3]}

    # Initialize and fit.

    gscv_mnb_tf = GridSearchCV(mnb_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_mnb_tf.fit(xtrain_tfv, ytrain)
    
    pickle.dump(gscv_mnb_tf, open("gscv_mnb_tf.pkl","wb"))


# add model to the models array
models.extend([gscv_mnb_tf])
       
# Display NaiveBayes/tf-idf results.

print(f'Best score of {gscv_mnb_tf.best_score_} found with the following parameters:',
      gscv_mnb_tf.best_params_)

Best score of -4.02127379671145 found with the following parameters: {'alpha': 3}


In [15]:
# Testing NaiveBayes on count vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_mnb_cv = pickle.load(open("gscv_mnb_cv.pkl", "rb"))

else:
    mnb_cv = MultinomialNB()
    params = {'alpha':[0.01, 1, 3]}

    # Initialize and fit.

    gscv_mnb_cv = GridSearchCV(mnb_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_mnb_cv.fit(xtrain_cv, ytrain)
    
    pickle.dump(gscv_mnb_cv, open("gscv_mnb_cv.pkl","wb"))


# add model to the models array
models.extend([gscv_mnb_cv])
       
# Display NaiveBayes/count results.

print(f'Best score of {gscv_mnb_cv.best_score_} found with the following parameters:',
      gscv_mnb_cv.best_params_)

Best score of -28.09564229668772 found with the following parameters: {'alpha': 3}


## Model Tests: A Second Pass

As the test results show, LogisticRegression on the tf-idf vectorized data returns the most reliable results. A further dive into the the GridSearchCV test results shows a pattern of improving test scores as C-values increase:

In [16]:
for count, c in enumerate([0.5, 0.5, 1.0, 1.0, 5.0, 5.0]):
    print(f"For C={c}, mean_test_score={list(gscv_lr_tf.cv_results_['mean_test_score'])[count]}")

For C=0.5, mean_test_score=-3.608504471830679
For C=0.5, mean_test_score=-3.608512043830933
For C=1.0, mean_test_score=-3.420553423569543
For C=1.0, mean_test_score=-3.42056045461182
For C=5.0, mean_test_score=-3.3094250726587253
For C=5.0, mean_test_score=-3.309442896912445


Exploring higher C-values makes sense, given the trend we are seeing here. Each `C` value appears twice as each value was tested along with a `dual` value of `True` and `False`; the `dual` value does not seem to make much of a difference, so we'll drop it in our next test and only test further values of `C`. We will also test a few other parameters, namely `penalty` and `tol`:

In [17]:
# Second testing of LogisticRegression on tf-idf vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_lr_tf2 = pickle.load(open("gscv_lr_tf2.pkl", "rb"))
    
else:
    lr_tf2 = LogisticRegression()
    params = {'penalty':['l1','l2'], 'C':[1, 5, 10], 'tol':[0.001, 0.0001, 0.00001], 'n_jobs':[-1], 'random_state':[42]}

    # Initialize and fit.

    gscv_lr_tf2 = GridSearchCV(lr_tf2, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_lr_tf2.fit(xtrain_tfv, ytrain)
    
    pickle.dump(gscv_lr_tf2, open("gscv_lr_tf2.pkl","wb"))


# add model to the models array
models.extend([gscv_lr_tf2])
       
# Display LogisticRegression/tf-idf results.

print(f'Best score of {gscv_lr_tf2.best_score_} found with the following parameters:',
      gscv_lr_tf2.best_params_)

Best score of -2.8691197965860096 found with the following parameters: {'C': 5, 'n_jobs': -1, 'penalty': 'l1', 'random_state': 42, 'tol': 1e-05}


Excellent! Reducing our log-loss to -2.87 is a big improvement!

Similarly, our tf-idf-vectorized Multinomial Naive Bayes returned respectable results at its maximum `alpha` tested - 3 - so it makes sense to test it at higher alphas:

In [18]:
from sklearn.naive_bayes import MultinomialNB

# Testing NaiveBayes on tf-idf vectorized features.

# should we generate a fresh model, or should we use pickles?
# by default this is the master value set in cell 1; overwrite
# it for a specific cell by specifying True or False.

use_pickles = use_pickles_master

if use_pickles:
    gscv_mnb_tf2 = pickle.load(open("gscv_mnb_tf2.pkl", "rb"))
    
else:
    mnb_tf2 = MultinomialNB()
    params = {'alpha':[4, 5, 6, 7, 8, 9]}

    # Initialize and fit.

    gscv_mnb_tf2 = GridSearchCV(mnb_tf2, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
    gscv_mnb_tf2.fit(xtrain_tfv, ytrain)
    
    pickle.dump(gscv_mnb_tf2, open("gscv_mnb_tf2.pkl","wb"))


# add model to the models array
models.extend([gscv_mnb_tf2])
       
# Display NaiveBayes/tf-idf results.

print(f'Best score of {gscv_mnb_tf2.best_score_} found with the following parameters:',
      gscv_mnb_tf2.best_params_)

Best score of -3.8707800902353537 found with the following parameters: {'alpha': 8}


This is an improvement over the score at `alpha = 3`, but it's still not competitive with LogisticRegression.

### Conclusion

Just to sate my curiosity, I used the `gscv_lr_tf2` model to run a test on a document that Mercatus has recently published that is not in the corpus, [this "Economic Situation" quarterly report](https://www.mercatus.org/publications/economic-situation-december-2017). It's useful to note that this document has only one tag on the website, "Economics and Public Policy":

In [19]:
with open('test_doc.txt','r') as f:
    text = f.read()

tf_test =  tfv.transform([text]) 
text_pred = gscv_lr_tf2.predict_proba(tf_test)

test_df = pd.DataFrame(columns=['label','probability'])
test_df['label'] = le.inverse_transform(text_pred[0].all())[0]
test_df['probability'] = text_pred[0][:90]
test_df.sort_values('probability', ascending=False).reset_index(drop=True)[:5]

Unnamed: 0,label,probability
0,Economics and Public Policy,0.12186
1,State and Local Regulations,0.075437
2,Federal Fiscal Policy,0.064857
3,Regulatory Report Card,0.064778
4,Regulatory Accumulation,0.051593


It guessed the tag used correctly! It also provided some other suggestions for strong contender tags that could help somebody trying to tag this document more thoroughly.

Of course, performance on one document is not a very good test of a model. For that we look to our validation data that was set aside in our fourth cell:

In [20]:
from sklearn.metrics import log_loss

# generate prediction array and score accuracy using log_loss

ypred = gscv_lr_tf2.predict_proba(xvalid_tfv)
print('Log-loss on test set is', log_loss(yvalid, ypred))

ValueError: Found input variables with inconsistent numbers of samples: [499, 495]

Excellent! Our test results for the second pass of tf-idf-vectorized Logistic Regression are even better than our validation results above. This model is ready for production!