# RegGenome challenge

RegGenome have trained a regulatory / non-regulatory (RNR) classifier based on the language used in the text of documents published by legislative bodies.

The classifier has performed very well in development, scoring over 99% accuracy.

However, after the model was deployed to production many non-regulatory documents started being misclassified as regulatory. The new publisher web crawlers have been optimised for high recall and the large number of spurious documents is causing high load on downstream systems and analysts.

Below is the notebook which builds and tests the model. Please take an hour to consider the following questions:

1. Why is the model performing worse in a production setting?
2. How could we have predicted this?
3. What strategies could we employ to improve the performance? Please consider:
    a) Technical / DS strategies
    b) Non-technical / organisational strategies

Please edit or add to the notebook to demonstrate technical strategies for 2 & 3a. Due to time constraints they do not need to be fully formed but should demonstrate your programming ability and a grasp of the issues involved.

**Please spend no more than an hour on this challenge.**

We will spend 20-30 minutes of our interview discussing your proposals. Good luck!


In [None]:
import sys
import nltk
import sklearn
import pandas as pd
import numpy as np

nltk.download('stopwords')
nltk.download('punkt')

print('Python: {}'.format(sys.version))
print('NLTK: {}'.format(nltk.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pd.__version__))
print('Numpy: {}'.format(np.__version__))

In [None]:
df = pd.read_table('rnr-examples.csv', sep=",", header=0, encoding='utf-8')

print(df.info())
print(df.head())

In [None]:
texts = df['text']
labels = df['label']

In [None]:
from nltk.tokenize import word_tokenize

# Create bag of words
all_words = []

for text in texts:
    words = word_tokenize(text)
    for word in words:
        all_words.append(word)
        
all_words = nltk.FreqDist(all_words)

# Inspect the total number of words and the 15 most common words

print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

In [None]:
# Use the 1,000 most common words as features

word_features = list(all_words.keys())[:1000]

def find_features(text):
    words = word_tokenize(text)
    features = {}
    for word in word_features:
        features[word] = (word in words)
    return features

In [None]:
# Extract all the features for all the texts

texts = list(zip(texts, labels))

# define a seed for reproducibility
seed = 1
np.random.seed = seed
np.random.shuffle(texts)

# call find_features function for each SMS message
feature_sets = [(find_features(text), label) for (text, label) in texts]

In [None]:
# Train Random Forest classifier

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

model = SklearnClassifier(RandomForestClassifier())
model.train(feature_sets)
accuracy = nltk.classify.accuracy(model, feature_sets)*100

print("Classifier Accuracy: {}".format(accuracy))