# Capstone Project - Predicting Tags on a Corpus of Research Papers

The Mercatus Center at George Mason University is a university research center that provides economic research and education. Mercatus employs a fairly large staff of media and policy outreach professionals who must digest and communicate economic findings to media outlets and political staffers in Washington, DC. 

Part of this outreach is the maintainance of a website that provides full access to all research papers, testimonials, opinion editorials, summaries, and other outputs. These publications are given tags that allow users to easily find related publications. So far, Mercatus has assigned tags somewhat haphazardly, entrusting junior staffers to choose tags with a preference for preexisting labels. 

The goal of this project is to clean the current corpus of tagged documents and then, using machine learning techniques available from the scikit-learn Python library, produce a machine learning model that can predict the most relevant tags for a new document. 

I have written a scraper that has crawled the Mercatus website for documents and saved each one, along with some useful metadata. The following notebook preprocesses this data and then fits a machine learning model that can be pickled and used later to produce actionable tag suggestions for Mercatus outreach staff.

In [2]:
import pandas as pd
import datetime
import numpy as np
import nltk
import string

%matplotlib inline

## Cleaning the data

First I use sklearn.dataset's `load_files()` method to load a classified dataset - a directory whose subdirectories' names are label names, and whose subdirectories' contents are the text files for that label. 

I then initialize and fill a dataframe from attributes of the load_files object, and drop rows which contain missing data.

In [3]:
from sklearn.datasets import load_files

trainer = load_files('data')

df = pd.DataFrame()
df['filename'] = trainer.filenames
df['text'] = [open(f, 'r').read() for f in trainer.filenames]
df['label'] = [trainer.target_names[x] for x in trainer.target]
df['author'] = [x.split('__')[0].split('--')[0].split('/')[-1] for x in df['filename']]
df['date'] = pd.to_datetime([x.split('__')[-1].split('-')[-1].split('.txt')[0].strip() for x in df['filename']], 
                            errors='coerce')

df.dropna(inplace=True)

print(f"Our dataset contains {len(df.groupby('text').count())} unique documents, with",
     f"{len(df)/len(df.groupby('text').count())} labels per document, for a total of",
     f"{len(df)} documents spread over {len(df.groupby('label').count())} labels.")

Our dataset contains 1630 unique documents, with 3.198159509202454 labels per document, for a total of 5213 documents spread over 130 labels.


Our dataset contains a large number of labels, and some have not been used in many years, while others have only been used a few times. 

The following code converts the `date` column to datetime objects, then groups by `label` and `date` to find only labels that have been active in the past four years (1460 days). 

We then remove labels that contain less than six documents.

In [4]:
df['date'] = df['date'].astype(np.int64) // 10 ** 9
df['date'] = pd.to_datetime(df['date'], unit='s')

df_active_labels = df[['label','date']].groupby('label').max().sort_values('date',ascending=True).reset_index()
df_active_labels = df_active_labels[df_active_labels.date > datetime.date.today() - datetime.timedelta(1460)]
df = df[df['label'].isin(df_active_labels.label)]

df_large_categories = df.groupby('label').count()[df.groupby('label').count()['filename'] > 5].reset_index()[['label','filename']]
df = df[df['label'].isin(df_large_categories.label)]

print(f"Our refined dataset contains {len(df.groupby('text').count())} unique documents, with",
     f"for a total of",
     f"{len(df)} documents spread over {len(df.groupby('label').count())} labels.")

Our refined dataset contains 1615 unique documents, with for a total of 4894 documents spread over 91 labels.


Now we can start working on our data. First we encode our labels and split our training and validation documents.

In [5]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

le = preprocessing.LabelEncoder()
y = le.fit_transform(df.label)

xtrain, xvalid, ytrain, yvalid = train_test_split(df.text.tolist(), y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1)

Next we initialize two functions, one for stemming tokens using the `SnowballStemmer`, and the second for actually tokenizing text.

In [6]:
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

stemmer = SnowballStemmer('english')

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    tokens = [i for i in tokens if all(j.isalpha() or j in string.punctuation for j in i)]
    tokens = [i for i in tokens if '/' not in i]
    stems = stem_tokens(tokens, stemmer)
    return stems

Now I initialize two vectorizers - a CountVectorizer and TfidfVectorizer, both of which cap the 1-gram, 2-gram, and 3-gram features at 200,000 and impose a minimum document frequency of three and a maximum document frequency of 70% of the corpus.

In [None]:
%%timeit

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(min_df=3, max_df=0.7, max_features=200000, tokenizer=tokenize,
            strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
            ngram_range=(1, 2), stop_words='english')

vect.fit(list(xtrain) + list(xvalid))
o_xtrain_cv = vect.transform(xtrain)
o_xvalid_cv = vect.transform(xvalid)

In [None]:
%%timeit

from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3, max_df=0.7, max_features=200000, tokenizer=tokenize,
            strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
            ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1,
            stop_words = 'english')

tfv.fit(list(xtrain) + list(xvalid))
o_xtrain_tfv =  tfv.transform(xtrain) 
o_xvalid_tfv = tfv.transform(xvalid)

## Model tests

The goal is to provide users with the most relevant tags for a particular document. To accomplish this, I use `GridSearchCV` to tune hyperparameters for a variety of models, and then choose the best one. These models include `LogisticRegression`, `DecisionTreeClassifier`, `RandomForestClassifier`, `KNeighborsClassifier`, and `LinearSVC`. These models will be tried using both tf-idf and count vectorizations. 

#### LogisticRegression

Tests for tf-idf and count vectorizations:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Testing LogisticRegression on tf-idf vectorized features.

lr_tf = LogisticRegression()
params = {'C':[0.5, 1, 5], 'dual':[False, True], 'n_jobs':[-1], 'random_state':[42]}

# Initialize and fit.

gscv_lr_tf = GridSearchCV(lr_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
gscv_lr_tf.fit(xtrain_tfv, ytrain)

# Display LogisticRegression/tf-idf results.

print(f'Best score of {gscv_lr_tf.best_score_} found with the following parameters:',
      gscv_lr_tf.best_params_)

In [None]:
# Testing LogisticRegression on count vectorized features.

lr_cv = LogisticRegression()
params = {'C':[0.5, 1, 5], 'dual':[False, True], 'n_jobs':[-1], 'random_state':[42]}

# Initialize and fit.

gscv_lr_cv = GridSearchCV(lr_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
gscv_lr_cv.fit(xtrain_cv, ytrain)

# Display LogisticRegression/countvectorizer results.

print(f'Best score of {gscv_lr_cv.best_score_} found with the following parameters:',
      gscv_lr_cv.best_params_)

#### DecisionTreeClassifier

Tests for tf-idf and count vectorizations:

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Testing DecisionTreeClassifier on tf-idf vectorized features.

dtc_tf = DecisionTreeClassifier()
params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

# Initialize and fit.

gscv_dtc_tf = GridSearchCV(dtc_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
gscv_dtc_tf.fit(xtrain_tfv, ytrain)

# Display DecisionTreeClassifier/tf-idf results.

print(f'Best score of {gscv_dtc_tf.best_score_} found with the following parameters:',
      gscv_dtc_tf.best_params_)

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Testing DecisionTreeClassifier on count vectorized features.

dtc_cv = DecisionTreeClassifier()
params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

# Initialize and fit.

gscv_dtc_cv = GridSearchCV(dtc_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
gscv_dtc_cv.fit(xtrain_cv, ytrain)

# Display DecisionTreeClassifier/tf-idf results.

print(f'Best score of {gscv_dtc_cv.best_score_} found with the following parameters:',
      gscv_dtc_cv.best_params_)

#### RandomForestClassifier

Tests for tf-idf and count vectorizations:

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Testing RandomForestClassifier on tf-idf vectorized features.

rfc_tf = RandomForestClassifier()
params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

# Initialize and fit.

gscv_rfc_tf = GridSearchCV(rfc_tf, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
gscv_rfc_tf.fit(xtrain_tfv, ytrain)

# Display RandomForestClassifier/tf-idf results.

print(f'Best score of {gscv_rfc_tf.best_score_} found with the following parameters:',
      gscv_rfc_tf.best_params_)

In [None]:
# Testing RandomForestClassifier on count vectorized features.

rfc_cv = RandomForestClassifier()
params = {'criterion':['gini','entropy'], 'min_samples_split':[2,4,6], 'random_state':[42]}

# Initialize and fit.

gscv_rfc_cv = GridSearchCV(rfc_cv, param_grid=params, scoring='neg_log_loss', n_jobs=-1)
gscv_rfc_cv.fit(xtrain_cv, ytrain)

# Display RandomForestClassifier/count results.

print(f'Best score of {gscv_rfc_cv.best_score_} found with the following parameters:',
      gscv_rfc_cv.best_params_)

### Conclusion

As the test results show, LogisticRegression on the tf-idf vectorized data returns the most reliable results. *write more when more results are available*

*Figure out how to work this example in* - Just to satiate my curiosity, I ran test on a document that Mercatus has recently published that is not in the corpus:

In [None]:
with open('test_doc.txt','r') as f:
    text = f.read()

tf_test =  tfv.transform([text]) 
text_pred = rfc.predict_proba(tf_test)



test_df = pd.DataFrame(columns=['label','probability'])
test_df['label'] = le.inverse_transform(predictions_tf_clf[0].all())[0]
test_df['probability'] = text_pred[0]
test_df.sort_values('probability', ascending=False).reset_index(drop=True)