# **Introduction**

### **Problem**
Project: machine learning text classifier to predict news categories from the news article text. 
1. Iterate on classification models with increasing level of complexity and improved performance. 
2. Analyze the impact of training data size on model performance.

### **Deliverables**
1. Train average word vector classifier and report model performance for training size = [500, 1000, 2000, 5000, 10000, 25000]
2. Train transformer encoder classifier and report model performance for training size = [500, 1000, 2000, 5000, 10000, 25000]
3. Report performance improvement on the test dataset from naive dataset augmentation outlined in the this notebook
4. [stretch] Experiment with advanced data augmentation techniques (a few ideas & pointers given in the notebook below)


## **Step 1: Prereqs & Installation**

Download & Import all the necessary libraries we need throughout the project.

In [None]:
#Install all the required dependencies for the project
!pip install numpy
!pip install scikit-learn
!pip install gensim
!pip install sentence-transformers
!pip install matplotlib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Package imports that will be needed for this project
import numpy as np
import json
from collections import Counter
from sklearn.metrics import accuracy_score, f1_score
from sentence_transformers import SentenceTransformer
from gensim.utils import tokenize as gensim_tokenizer
import gensim.downloader as gensim_downloader
from sklearn.base import BaseEstimator, TransformerMixin
from pprint import pprint
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Normalizer

In [None]:
# Global Constants
LABEL_SET = [
    'Business',
    'Sci/Tech',
    'Software and Developement',
    'Entertainment',
    'Sports',
    'Health',
    'Toons',
    'Music Feeds'
]

WORD_VECTOR_MODEL = 'glove-wiki-gigaword-100'
SENTENCE_TRANSFORMER_MODEL = 'all-mpnet-base-v2'

TRAIN_SIZE_EVALS = [500, 1000, 2000, 5000, 10000, 25000]
EPS = 0.001
SEED = 0

np.random.seed(SEED)

## **Step 2: Download & Load Datasets** 

[AG News](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) is a collection of more than 1 million news articles gathered from more than 2000 news sources by an academic news search engine. The news topic classification dataset & benchmark was first used in [Character-level Convolutional Networks for Text Classification (NIPS 2015)](https://arxiv.org/abs/1509.01626). The dataset has the text description (summary) of the news article along with some metadata. **For this project, a slightly modified (cleaned up) version of this dataset was used.** 

Schema:
* Source - News publication source
* URL - URL of the news article
* Title - Title of the news article
* Description - Summary description of the news article
* Category (Label) - News category

Sample row in this dataset:
```
{
    'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
    'id': 86273,
    'label': 'Entertainment',
    'source': 'Voice of America',
    'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
    'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'
 }
```




In [None]:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

DIRECTORY_NAME = "data"
DOWNLOAD_URL = 'https://corise-mlops.s3.us-west-2.amazonaws.com/project1/agnews.zip'

def download_dataset():
    """
    Download the dataset. The zip contains three files: train.json, test.json and unlabeled.json 
    """
    http_response = urlopen(DOWNLOAD_URL)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=DIRECTORY_NAME)

#Expensive operation so we should just do this once
download_dataset()

In [None]:
Datasets = {}

for ds in ['train', 'test', 'augment']:
    with open('data/{}.json'.format(ds), 'r') as f:
        Datasets[ds] = json.load(f)
    print("Loaded Dataset {0} with {1} rows".format(ds, len(Datasets[ds])))

print("\nExample train row:\n")
pprint(Datasets['train'][0])

print("\nExample test row:\n")
pprint(Datasets['test'][0])

Loaded Dataset train with 25000 rows
Loaded Dataset test with 50000 rows
Loaded Dataset augment with 150000 rows

Example train row:

{'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
 'id': 86273,
 'label': 'Entertainment',
 'source': 'Voice of America',
 'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
 'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'}

Example test row:

{'description': 'AP - Ellis L. Marsalis Sr., the patriarch of a family of '
                'world famous jazz musicians, including grandson Wynton '
                'Marsalis, has died. He was 96.',
 'id': 143852,
 'label': 'Entertainment',
 'source': 'Yahoo Entertainment',
 'title': 'Music Patriarch Marsalis Sr. Dies (AP)',
 'url': 'http://us.rd.yahoo.com/dailynews/rss/entertainment/*http://story.news.yahoo.com/news?tmpl

In [None]:
X_train, Y_train = [], []
X_test, Y_true = [], []
X_augment, Y_augment = [], []

for row in Datasets['train']:
    X_train.append(row['description'])
    Y_train.append(row['label'])

for row in Datasets['test']:
    X_test.append(row['description'])
    Y_true.append(row['label'])

for row in Datasets['augment']:
    X_augment.append(row['description'])
    Y_augment.append(row['label'])

## **Step 3: [Modeling part 1] Word vectors**

In [None]:
# Initialize the word vector model
word_vector_model = gensim_downloader.load(WORD_VECTOR_MODEL)

# Sanity check
print(word_vector_model.most_similar("cat"))
print(word_vector_model['cat'])

[('dog', 0.8798074722290039), ('rabbit', 0.7424426674842834), ('cats', 0.7323004007339478), ('monkey', 0.7288709878921509), ('pet', 0.7190139889717102), ('dogs', 0.7163872718811035), ('mouse', 0.6915250420570374), ('puppy', 0.6800068020820618), ('rat', 0.6641027331352234), ('spider', 0.6501135230064392)]
[ 0.23088    0.28283    0.6318    -0.59411   -0.58599    0.63255
  0.24402   -0.14108    0.060815  -0.7898    -0.29102    0.14287
  0.72274    0.20428    0.1407     0.98757    0.52533    0.097456
  0.8822     0.51221    0.40204    0.21169   -0.013109  -0.71616
  0.55387    1.1452    -0.88044   -0.50216   -0.22814    0.023885
  0.1072     0.083739   0.55015    0.58479    0.75816    0.45706
 -0.28001    0.25225    0.68965   -0.60972    0.19578    0.044209
 -0.31136   -0.68826   -0.22721    0.46185   -0.77162    0.10208
  0.55636    0.067417  -0.57207    0.23735    0.4717     0.82765
 -0.29263   -1.3422    -0.099277   0.28139    0.41604    0.10583
  0.62203    0.89496   -0.23446    0.5134

In [None]:
class WordVectorFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, dim, word_vector_model):
        self.dim = dim
        self.word_vector_model = word_vector_model
        # you can add any other params to be passed to the constructor here
    
    #estimator. Since we don't have to learn anything in the featurizer, this is a no-op
    def fit(self, X, y=None):
        return self
    
    #transformation: return the average word vector of each token in the document
    def transform(self, X, y=None):
        """       
        Goal: WordVectorFeaturizer's transform() method converts the raw text document
        into a feature vector to be passed as input to the classifier.
        """

        X_t = []
        for doc in X:
            X_t.append(np.zeros(self.dim))
        return X_t

In [None]:
models = {}

for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    pipeline = Pipeline([
        ('featurizer', WordVectorFeaturizer(dim=100,word_vector_model=word_vector_model)),
        ('normalization',Normalizer()),
        ('classifier', LogisticRegression(max_iter=10000))
    ])
    
    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))

Evaluating for training data size = 500
Accuracy on test set: 0.27502
Evaluating for training data size = 1000
Accuracy on test set: 0.27502
Evaluating for training data size = 2000
Accuracy on test set: 0.27502
Evaluating for training data size = 5000
Accuracy on test set: 0.27502
Evaluating for training data size = 10000
Accuracy on test set: 0.27502
Evaluating for training data size = 25000
Accuracy on test set: 0.27502


## **Step 4: [Modeling part 2] Pretrained Transformers**

In [None]:
# Initialize the pretrained transformer model
sentence_transformer_model = SentenceTransformer(
    'sentence-transformers/{model}'.format(model=SENTENCE_TRANSFORMER_MODEL))

# Sanity check
example_encoding = sentence_transformer_model.encode(
    "This is an example sentence",
    normalize_embeddings=True
)

print(example_encoding)


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

[ 2.25025937e-02 -7.82918185e-02 -2.30307318e-02 -5.10000251e-03
 -8.03404152e-02  3.91321704e-02  1.13428403e-02  3.46478494e-03
 -2.94573940e-02 -1.88930426e-02  9.47433859e-02  2.92748231e-02
  3.94859128e-02 -4.63165455e-02  2.54245717e-02 -3.21999975e-02
  6.21928461e-02  1.55592030e-02 -4.67795767e-02  5.03901429e-02
  1.46113718e-02  2.31413618e-02  1.22066764e-02  2.50695944e-02
  2.93652620e-03 -4.19822112e-02 -4.01031598e-03 -2.27843672e-02
 -7.68594909e-03 -3.31090726e-02  3.22118960e-02 -2.09992640e-02
  1.16730984e-02 -9.85074118e-02  1.77932645e-06 -2.29931492e-02
 -1.31140817e-02 -2.80222651e-02 -6.99970126e-02  2.59314626e-02
 -2.89501771e-02  8.76336247e-02 -1.20919431e-02  3.98605168e-02
 -3.31381820e-02  3.59108150e-02  3.46099250e-02  6.49783835e-02
 -3.00817546e-02  6.98188543e-02 -3.99514660e-03 -1.01598888e-03
 -3.50184701e-02 -4.36567143e-02  5.08025661e-02  4.68758158e-02
  5.39663658e-02 -4.03008647e-02  3.20136547e-03  1.36618437e-02
  3.82188335e-02 -3.23844

In [None]:
class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, dim, sentence_transformer_model):
        self.dim = dim
        self.sentence_transformer_model = sentence_transformer_model
        # you can add any other params to be passed to the constructor here

    #estimator. Since we don't have to learn anything in the featurizer, this is a no-op
    def fit(self, X, y=None):
        return self

    #transformation: return the encoding of the document as returned by the transformer model 
    def transform(self, X, y=None):
        X_t = []
        """        
        Goal: TransformerFeaturizer's transform() method converts the raw text document
        into a feature vector to be passed as input to the classifier.            
        """
        for doc in X:
            X_t.append(np.zeros(self.dim))
        return X_t

In [None]:
models_v2 = {}
for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    pipeline = Pipeline([
        ('featurizer', TransformerFeaturizer(dim=768,sentence_transformer_model=sentence_transformer_model)),
        ('classifier', LogisticRegression(max_iter=10000))
    ])

    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models_v2[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))


Evaluating for training data size = 500
Accuracy on test set: 0.27502
Evaluating for training data size = 1000
Accuracy on test set: 0.27502
Evaluating for training data size = 2000
Accuracy on test set: 0.27502
Evaluating for training data size = 5000
Accuracy on test set: 0.27502
Evaluating for training data size = 10000
Accuracy on test set: 0.27502
Evaluating for training data size = 25000
Accuracy on test set: 0.27502


## **Step 5: Report Results from previous two steps**

In [None]:
# Report results

print("Word Vector Models: ")
for train_size, result in models.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

print("Pretrained Transformer Models: ")
for train_size, result in models_v2.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Word Vector Models: 
Train size: 500  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 1000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 2000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 5000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 10000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 25000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Pretrained Transformer Models: 
Train size: 500  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 1000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 2000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train size: 5000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249
Train s

## **Step 6: Data Augmentation**

In this section, we want to explore how to augment data efficiently to your existing training data. This is a very empirical exercise with a less well-defined playbook which means this section of the project is going to be open ended. Let us first understand what we mean by efficiency here, and why it matters:

### Performance Gain (G):
We will measure performance gain from data augmentation as the improvement in model accuracy (reduction in num. errors) on the Test dataset as defined above. 

### Budget (K):
We will measure "budget" as the number of additional rows augmentated to the original training dataset.  In this project, the universe of data from which you will select to add to your training set is Datasets['augment'] (and downstream X_augment, Y_augment).

This data is already labeled of course, but in most real-world scenarios the additional data is typically unlabeled. In order to augment it to your training data, you have to get it annotated which incurs some cost in time & money. This is the motivation to consider budget as a metric.

### Efficiency (E = G / K): 
Efficiency = Performance Gain (Reduction in num errors in test set) / Budget (Number of additional rows augmented to the training dataset)

We want to get the maximum gain in performance, while incurring minimum annotation cost.

In [None]:
# Naively augmenting data by selecting (and incurring annotation cost) for K examples at random.

# In the code snippet below, we show the gain in performance from augmenting data naively
# at a few different budget values (K = 1000, 5000, 10000, 50000)

models_aug = {}

for K in [1000, 5000, 10000, 50000]:
    X_train_aug = X_train + X_augment[:K]
    Y_train_aug = Y_train + Y_augment[:K]

    pipeline = Pipeline([
        ('featurizer', WordVectorFeaturizer(dim=100,word_vector_model=word_vector_model)),
        ('normalization',Normalizer()),
        ('classifier', LogisticRegression(max_iter=10000))
    ])

    # train
    pipeline.fit(X_train_aug, Y_train_aug)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models_aug[K] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))

Accuracy on test set: 0.27502
Accuracy on test set: 0.27502
Accuracy on test set: 0.27502
Accuracy on test set: 0.27502


# **Further Data Augmentation**

In [None]:
# Examine current test errors
test_errors = []
Y_pred_i = models[25000]['test_predictions']

for idx, label in enumerate(Y_true):
    if label != Y_pred_i[idx]:
        test_errors.append((X_test[idx], label,  Y_pred_i[idx]))

print("Number of errors in the test set: {}".format(len(test_errors)))
print("Example errors: [example, true label, predicted label]")
for i in range(10):
    print(test_errors[i])

Number of errors in the test set: 36249
Example errors: [example, true label, predicted label]
("LONDON (Reuters) - The yen sank to a four-month low against  the euro and a six-week low versus the dollar on Tuesday as  investors fretted that soaring oil prices could jeopardize  Japan's economic recovery.", 'Business', 'Entertainment')
('Patent portfolio at the ready', 'Sci/Tech', 'Entertainment')
('AP - In every other city, Barry Bonds is greeted with boos and cheers, a mixture of respect, fear and derision for the best slugger of his generation.', 'Sports', 'Entertainment')
('3M Co. (MMM.N: Quote, Profile, Research) on Monday said third-quarter earnings rose 17 percent due in part to the weak dollar, but the diversified manufacturer #39;s results', 'Business', 'Entertainment')
('', 'Business', 'Entertainment')
("If you can't answer that question, please read on.", 'Business', 'Entertainment')
('Schedule: Semifinals (today), Washington vs. Stanford, 5:30 pm (tape delay on ESPN2 at 8); 

In [None]:
'''
Augmented = {}
For e in test_errors:
   1. X_nn, y_nn = k nearest neighbors to (e) from X_augment, y_augment
   2. Add each (x, y) from (X_nn, y_nn) to Augmented

Add the Augmented examples to the training set
Train the new model and record performance improvements

'''

errors_all_train = models[25000]["errors"]
for augmented_size, result_augment in models_aug.items():
    print("Augmented size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3} | Efficiency {4}".format(
        augmented_size,
        result_augment['accuracy'],
        result_augment['f1'],
        result_augment['errors'],
        (errors_all_train-result_augment['errors'])/augmented_size *100

    ))

Augmented size: 1000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249 | Efficiency 0.0
Augmented size: 5000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249 | Efficiency 0.0
Augmented size: 10000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249 | Efficiency 0.0
Augmented size: 50000  |  Accuracy: 0.27502  |  F1 score: 0.11864284544556164 |  Num errors: 36249 | Efficiency 0.0
