## Introduction

In Part1 of the project, we will develop and deploy a text classification ML model to predict news categories from the text of the news article:

1. Train a text classification model. 
2. Evaluate the trained model performance on validation dataset. Explore tools and techniques such as behavioral/scenario-base testing to augment commonly used performance evaluation setups.
3. Build a simple web application in Python that uses the trained model to do online inference, and test this application locally.
4. [advanced] Deploy the application as a microservice in the cloud using AWS Lambda. 

Throughout the project there are default model & system architectures that are "good defaults" for building ML systems. That said, there's many different ways to configure and scale this setup! Suggestions always welcome.

In [None]:
# Package imports that will be needed for this exercise

import json
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

import numpy as np

In [None]:
# Global Constants

LABEL_SET = [
    'Business',
    'Sci/Tech',
    'Software and Developement',
    'Entertainment',
    'Sports',
    'Health',
    'Toons',
    'Music Feeds'
]

WORD_VECTOR_MODEL = 'glove-wiki-gigaword-100'
EPS = 0.001
SEED = 42

DIRECTORY_NAME = "data"
DOWNLOAD_URL = 'https://corise-mlops.s3.us-west-2.amazonaws.com/qconsf/dataset_part1.zip'

np.random.seed(SEED)


## Download & Load Datasets

[AG News](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) is a collection of more than 1 million news articles gathered from more than 2000 news sources by an academic news search engine. The news topic classification dataset & benchmark was first used in [Character-level Convolutional Networks for Text Classification (NIPS 2015)](https://arxiv.org/abs/1509.01626). 

The dataset has the text description (summary) of the news article along with some metadata. **For this exercise, we will use a cleaned up subset of this dataset** 

Schema:
* Source - News publication source
* URL - URL of the news article
* Title - Title of the news article
* Description - Summary description of the news article
* Category (Label) - News category

Sample row in this dataset:
```
{
    'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
    'id': 86273,
    'label': 'Sci/Tech',
    'source': 'Voice of America',
    'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
    'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'
 }
```



In [None]:
# Download dataset from S3 bucket
# Note: you can skip this if you've already downloaded the data locally

http_response = urlopen(DOWNLOAD_URL)
zipfile = ZipFile(BytesIO(http_response.read()))
zipfile.extractall(path=DIRECTORY_NAME)

In [None]:
# Load dataset into memory
Datasets = {}
for ds in ['training', 'validation']:
    with open('data/{}.json'.format(ds), 'r') as f:
        Datasets[ds] = json.load(f)
    print("Loaded Dataset {0} with {1} rows".format(ds, len(Datasets[ds])))

print("\nExample train row:\n")
display(Datasets['training'][0])


## Model Training

Take a look at `NewsCategoryClassifier` in `model/classifier.py` and familiarize yourself with the model architecure. 

In the interest of time, we are providing this default classification model implementation. 
This is a good default implementation of the classifier. But if you wish to extend/modify any part of this pipeline, or explore new model architectures you should definitely feel free to do so.

In [None]:
# Prepare data for model training & evaluation

X_train, Y_train = [], []
X_test, Y_true = [], []

for row in Datasets['training']:
    X_train.append(row['description'])
    Y_train.append(row['label'])

for row in Datasets['validation']:
    X_test.append(row['description'])
    Y_true.append(row['label'])


In [None]:
from model.classifier import NewsCategoryClassifier, WordVectorFeaturizer

classifier = NewsCategoryClassifier(
    config={
        'word_vector_model': WORD_VECTOR_MODEL,
        'word_vector_dim': 100
    },
)

classifier.fit(X_train, Y_train, verbose=True)

In [None]:
# sanity check with an example prediction

print(f"input:\n{X_test[0]}\n\n")
print(f"predictions:\n{classifier.predict_proba(X_test[0])}")


## Model Evaluation

Model performance evaluation for real-world ML applications can often be complex since different stakeholders (ML engineers, product managers, sales teams) care about the impact on different dimensions. In this exercise, we will evaluate the News category classification in the following ways:


* **Aggregate performance metrics**: Using standard metrics to evaluate ML model performance such as accuracy, mean squared error, BLEU scores etc (depending on the ML task) is a necessary first step, but far from sufficient. It serves to filter out models that are clearly suboptimal and reduces risk of launching bad models

* **Cohort/Slice-based performance metrics**: It is important to track model performance not just in aggregate but for important cohorts/slices of your traffic. For example, it is important to track the performance of your hate speech detection model not just in aggregate but for traffic from each country, language etc to understand the gaps in performance.
  
* **Qualitative evaluations**: Behavioral tests can be helpful to root-cause individual instances of model failures and yield helpful insights to improve models

In [None]:
# Run model predictions on the validation set

Y_pred = [classifier.predict_label(x) for x in X_test]

In [None]:
from sklearn.metrics import accuracy_score

print(f" Overall accuracy: {accuracy_score(Y_true, Y_pred)}")

In [None]:
""" 
[TO BE IMPLEMENTED]

Compute the precision, recall and F1 score per-class for the validation set

Hint: check out sklearn.metrics 

"""


print("NOT IMPLEMENTED YET")


In [None]:
""" 
[TO BE IMPLEMENTED]

Plot a confusion matrix for the news category classification model trained above. 
You should have a NxN matrix, where: 
matrix(i, j) = number of instances in the test set where true_label = LABEL_SET[i] and pred_label = LABEL_SET[j]

"""
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# dummy confusion matrix, replace with the actual one
cm = np.zeros((len(LABEL_SET), len(LABEL_SET)))

# display confusion matrix -- no code change needed here
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=LABEL_SET)
disp.plot(xticks_rotation='vertical')
plt.show()


## Estimating confidence intervals of performance metrics with Bootstrap sampling

When computing model performance metrics, the margin of error can vary a lot depending on the size of the test dataset. We are trying to estimate the model’s performance on any unseen data by empirically computing the model performance on the held-out validation set, and trying to quantify the confidence intervals in this estimation: 

If we were to measure the model’s performance on another independently collected validation dataset from the same underlying distribution, our model’s performance on this dataset is unlikely to be the same, but how different might they plausibly be? In this exercise, we will implement Bootstrap sampling to estimate the 95% confidence interval for perfornance metrics of the model:

1. Generate N ‘bootstrap sample’ datasets, each the same size as the original validation set. Each bootstrap sample dataset is obtained by sampling instances uniformly at random from the original validation set (with replacement).
2. On each of the bootstrap sample datasets, calculate the performance metric of choice.
3. From Step (2), you will end up with N different values. Sort them.
4. The 95% confidence interval is given by the 2.5th to the 97.5th percentile among the N sorted accuracy values.


In [None]:
from sklearn.metrics import accuracy_score


def bootstrap_distribution(Y_true, Y_pred, N):
    """ 
    [TO BE IMPLEMENTED]
    
    Implement this function that takes the following inputs:
    1. Y_true, Y_pred (list of true and predicted labels)
    2. number of bootstrap trials (N). 
    
    It should return a list (ret) of length = N, where ret[i] = accuracy metric from i-th bootstrap sampling run
    """
    bootstrap_vals = [0]*N
    return bootstrap_vals


In [None]:
import matplotlib.pyplot as plt
NUM_BOOTSTRAP = 1000

bs_vals = bootstrap_distribution(Y_true, Y_pred, NUM_BOOTSTRAP)
bs_vals = sorted(bs_vals)

print("95 percent confidence interval: [{0}, {1}]".format(
    bs_vals[25],    # 2.5th percentile
    bs_vals[975]    # 97.5th percentile
))

_ = plt.hist(bs_vals, bins='auto')
plt.show()


In [None]:
""" 
[TO BE IMPLEMENTED]

Repeat the exact same run, but not only for data points with Y_true = 'Health'
Do you see the confidence intervals narrower? wider? 

"""

Y_true_health = []
Y_pred_health = []

bs_vals = bootstrap_distribution(Y_true_health, Y_pred_health, NUM_BOOTSTRAP)
bs_vals = sorted(bs_vals)

print("95 percent confidence interval: [{0}, {1}]".format(
    bs_vals[25],    # 2.5th percentile
    bs_vals[975]    # 97.5th percentile
))

_ = plt.hist(bs_vals, bins='auto')
plt.show()


## Behavioral/Scenario based Tests

Unit tests play an important role in testing software for bugs, inefficiencies and potential vulnerabilities. Can we employ a similar approach to testing ML models? This exercise introduces the concept of “behavioral testing”  for machine learning models. 

Minimum Functionality Tests are a class of behavioral tests, and equivalents of “unit tests in software engineering” - a collection of simple examples (and labels) to check a behavior. A recommended practice is to write minimum functionality tests for highly visible/high cost potential failure modes, and for failure modes that you uncover during error analysis to guard against such failures in the future. 

For this exercise, we will use a popular open source library called [Checklist](https://github.com/marcotcr/checklist) to configure and run behavioral tests for our model. The goal and scope of the exercise here is to get you acquainted with the library and practice of testing for minimum functionality. 

Useful references:
1. Getting started with Checklist: https://github.com/marcotcr/checklist
2. Creating & Running tests with Checklist: https://github.com/marcotcr/checklist/blob/9baab717e44e216697f7ef0730ee269db9ef7d5b/notebooks/tutorials/3.%20Test%20types,%20expectation%20functions,%20running%20tests.ipynb 

In [None]:
import checklist
from checklist.editor import Editor
from checklist.test_types import MFT
from checklist.pred_wrapper import PredictorWrapper

# Run some warmup code to get you familiar with templates in checlist
editor = Editor()
ret = editor.template('{first_name} is {a:profession} from {country}.',
                      profession=['lawyer', 'doctor', 'accountant'])
np.random.choice(ret.data, 3)


In [None]:
""" 
[TO BE IMPLEMENTED]

1. News Source variation:
{source}: Astronomers expect the Perseid meteor shower to be one of the best versions of the shooting star events in several years
source = ['New York Times', 'Reuters', 'AP', 'Wall Street Journal', 'Quanta', 'BBC', 'BBC UK', 'Yahoo News']
label = "Sci/Tech"

2. Company Name variation:
{company} revealed Thursday that its old recipe of adding stores is longer is a source of new profits for the company.
company = ['McDonalds', 'Starbucks', 'Chipotle', 'Krispy Creme', 'Unknown Company']
label = "Business"

3. Disease terms for healthcare news
{disease} will come under the {mask} during the charity gala event being held on Monday at 7pm
disease = ['Breast cancer', 'cancer', 'heart disease']
label = "Health"
"""

editor = Editor()
# [TO BE IMPLEMENTED]
# ret = editor.template(...)
# ret += editor.template(...)
# ret += editor.template(...)


In [None]:
# Configure and run tests

def encode_and_predict(inputs):
    preds = [classifier.predict_label(x) for x in inputs]
    print(preds)
    return preds


test = MFT(**ret, name='News classification behavioral tests')

wrapped_pp = PredictorWrapper.wrap_predict(encode_and_predict)
test.run(wrapped_pp, overwrite=True)
test.summary()


## Serialize model to prepare for deployment

In [None]:
classifier.dump('deploy/news_classifier.joblib')
