## Predict on Test Set 

<b>*Note*</b>: I made two versions of the voting ensemble chunking classifier: one trained on 50,000 data points (~25% of the training dataset), the other on the full training dataset. The version trained on the full dataset is over twice the size of the version trained on 50,000 and takes far longer to run. Judging from the F1 scores of the other four classifiers after being trained on 90% of the training dataset, the improvement of performance is marginal (~1%). Each of these four classifiers, however, is smaller in size than that of the voting ensemble trained on 50,000 data points. As a point of comparison, the logistic regression model trained on 90% of the dataset is also provided, as the boxplot of accuracies after 10-fold cross-validation indicates that it has a similar accuracy to the voting ensemble.

This script assumes that the test dataset filename is `chunking_test.txt`, that it is in a folder called Datasets (as this is the folder in which the training data was placed), and that the script is in the parent folder. These parameters can easily be modified should these assumptions not hold true.

In [None]:
import pickle
import pandas as pd
from sklearn import metrics
from nltk import word_tokenize, pos_tag

In [None]:
saved_classifier = open('clf_ve_50000.pickle', 'rb')
#saved_classifier = open('clf_lr_190554.pickle', 'rb')
chunking_clf = pickle.load(saved_classifier) 
saved_classifier.close()

In [None]:
def predict_on_test_set(filepath):
    
    df = pd.read_csv(filepath, sep=' ', header=None)
    df.columns = ['token', 'pos_tag', 'chunk_tag']
    y = df['chunk_tag']
    X = []
    
    def features(token, index, pos_tag, chunk_tag):
        features = {'token': token[index],
                    'pos': pos_tag[index],
                    'prev_token': '' if index == 0 else token[index-1],
                    'prev_pos': '' if index == 0 else pos_tag[index-1],
                    'prev_chunk': '' if index == 0 else chunk_tag[index-1],
                    'next_token': '' if index == len(df.token)-1 else token[index+1],
                    'next_pos': '' if index == len(df.token)-1 else pos_tag[index+1]}
        return features

    for index in range(len(df.token)):
        X.append(features(df.token, index, df.pos_tag, df.chunk_tag))
    
    predicted = chunking_clf.predict(X)  
    print(metrics.classification_report(y, predicted))

In [None]:
file = 'Datasets/chunking_test.txt'
predict_on_test_set(file)

## Predict on Texts

As POS tags are an important input feature for the model, the NLTK library was used to automatically label them (other options are Stanford's CoreNLP and spaCy, but the former was decided against because it is more computationally intensive, and the latter was not chosen because there were issues downloading the `en_core_web_sm` model on Windows in an Anaconda environment). 

As the chunk tag of the previous token is also an input feature, *sequence classification* was used. More specifically, the *consecutive classification* approach was adopted: the chunk tag of the first token is used to determine the best tag for the second token, and so on and so forth.

I compared the prediction results for two different text datasets, forum posts about cars from [scikit-learn's 20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) and three random paragraphs from [a recent BBC article](https://www.bbc.co.uk/news/uk-politics-46227046), to determine the classifier's *domain versatility*. The forum posts are a stylistic contrast to edited text, which is what the classifier was trained on - they are messier in that they contain typos, special characters, and more slang. Thus, we would expect the classifier to perform better on news articles.

In [None]:
from sklearn.datasets import fetch_20newsgroups
cars = fetch_20newsgroups(categories=['rec.autos'], remove=('headers', 'footers', 'quotes'))
forum_posts = cars.data[70:73]

In [None]:
news_paragraphs = ['She vowed to get the deal signed off in Brussels and put it to a vote of MPs.',
'It follows a string of ministerial resignations and talk of a no-confidence vote from Tory MPs.',
'Brexit Secretary Dominic Raab and Work and Pensions Secretary Esther McVey both quit earlier in protest at the withdrawal agreement, along with two junior ministers.']

In [None]:
def predict_on_texts(texts):
    
    texts = ' '.join(texts) # merge list of texts into one 'text'
    texts = texts.replace('\n',' ') # remove new lines
    tokens = word_tokenize(texts) # list of tokens
    token_pos_tag = pos_tag(tokens) # list of token, pos_tag tuples
    chunk_tag_hist = []
    
    def pos(token_and_tag): # return a list of pos_tags from token, pos_tag tuples
        return [t[1] for t in token_and_tag]
    
    def features(token, index, pos_tag, chunk_tag_hist): # chunk_tag_hist stores previously labelled chunk tags
        features = {'token': token[index],
                    'pos': pos_tag[index],
                    'prev_token': '' if index == 0 else token[index-1],
                    'prev_pos': '' if index == 0 else pos_tag[index-1],
                    'prev_chunk': '' if index == 0 else chunk_tag_hist[index-1],
                    'next_token': '' if index == len(token)-1 else token[index+1],
                    'next_pos': '' if index == len(token)-1 else pos_tag[index+1]}
        return features
    
    for index in range(len(tokens)):
        X = features(tokens, index, pos(token_pos_tag), chunk_tag_hist)
        predicted = chunking_clf.predict(X)
        chunk_tag_hist.append(predicted[0]) # scikit-learn classifiers output 1D NumPy arrays

    for token, predicted_tag in zip(tokens, chunk_tag_hist):
        print('%s => %s' % (token, predicted_tag))

In [None]:
predict_on_texts(forum_posts)

In [None]:
predict_on_texts(news_paragraphs)

## Appendix

A version of the `predict_on_texts` function without sequential (consecutive) classification was also produced to better understand the difference that such an approach can make. Results are not as good, which is to be expected. The version with sequential classification is better at recognising named entity chunks. One example of a difference:

With previous chunk tag:
Brexit => B-NP
Secretary => I-NP
Dominic => I-NP
Raab => I-NP

Without previous chunk tag:
Brexit => B-NP
Secretary => B-NP
Dominic => B-NP
Raab => B-NP

'And' is often misclassified as 'O'. This is a problem that needs to be addressed in future iterations of the classifier.

In [None]:
def predict_on_texts2(texts):
    
    texts = ' '.join(texts) # merge list of texts into one 'text'
    texts = texts.replace('\n',' ') # remove new lines
    tokens = word_tokenize(texts) # list of tokens
    token_pos_tag = pos_tag(tokens) # list of token, pos_tag tuples
    X = []
    
    def pos(token_and_tag): # return a list of pos_tags from token, pos_tag tuples
        return [t[1] for t in token_and_tag]
    
    def features(token, index, pos_tag):
        features = {'token': token[index],
                    'pos': pos_tag[index],
                    'prev_token': '' if index == 0 else token[index-1],
                    'prev_pos': '' if index == 0 else pos_tag[index-1],
                    'next_token': '' if index == len(token)-1 else token[index+1],
                    'next_pos': '' if index == len(token)-1 else pos_tag[index+1]}
        return features
    
    for index in range(len(tokens)):
        X.append(features(tokens, index, pos(token_pos_tag)))
    
    predicted = chunking_clf.predict(X) 
    
    for token, predicted_tag in zip(tokens, predicted):
        print('%s => %s' % (token, predicted_tag))

In [None]:
predict_on_texts2(sample_texts)

In [None]:
predict_on_texts2(news_paragraphs)