# Text classification exercise

This tutorial segment closely follows the [official webpage](https://uber.github.io/ludwig/examples/#text-classification).

# Dataset

We will be using the [Reuters dataset](https://martin-thoma.com/nlp-reuters/).
Which is a benchmark dataset for document classification. It is a multi-class, multi-label (e.g. each document can belong to many classes) dataset, having 90 classes, 7769 training documents and 3019 testing documents. 

The training set has a vocabulary size of 35247. Even if you restrict it to words which appear at least 5 times and at most 12672 times in the training set, there are still 12017 words.

## Classes and labels

```bash
                          nr of documents    mean number of
       class name             train   test   words in train set
     1: earn                : 2877    1087    104.4
     2: acq                 : 1650     719    150.1
     3: money-fx            :  538     179    219.0
     4: grain               :  433     149    223.6
     5: crude               :  389     189    247.3
     6: trade               :  368     117    294.3
     7: interest            :  347     131    198.0
     8: wheat               :  212      71    225.6
```     

## Getting the Reuters dataset

In [1]:
!curl -O http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/reuters-allcats-6.zip
#wget http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/reuters-allcats-6.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1188k  100 1188k    0     0   467k      0  0:00:02  0:00:02 --:--:--  468k


In [2]:
!unzip reuters-allcats-6.zip

Archive:  reuters-allcats-6.zip
replace reuters-allcats.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


Check *reuters-allcats.csv* is in the output:

In [4]:
import os
os.listdir(".")

['ludwig',
 'model_definition.yaml',
 'reuters-allcats.csv',
 'reuters-allcats.json',
 'model_definition_time.yaml',
 'Untitled.ipynb',
 'reuters-allcats.hdf5',
 'results',
 '.ipynb_checkpoints',
 'environment.yaml',
 'reuters-allcats-6.zip']

## Familiarizing ourselves with the data

Before proceeding with the machine learning task, let us familiarize ourselves with the dataset.

Print the column names in the CSV (header):

In [5]:
!head -n 1 reuters-allcats.csv

class,text


The column names will be used as a config parameter for the model building.

In [13]:
import pandas as pd
reuters_raw = pd.read_csv("reuters-allcats.csv")
reuters_raw.head()

Unnamed: 0,class,text
0,Neg-,2 BAHIA COCOA REVIEW SALVADOR Feb 26 - Sh...
1,Neg-,2 USX ltX DEBT DOWGRADED BY MOODYS NEW YOR...
2,Pos-earn,2 COBANCO INC ltCBCO YEAR NET SANTA CRUZ ...
3,Pos-earn,2 BROWN-FORMAN INC ltBFD 4TH QTR NET LOUIS...
4,Neg-,2 HUGHES CAPITAL UNIT SIGNS PACT WITH BEAR STE...


In [14]:
reuters_raw['class'].describe()

count     4079
unique       7
top       Neg-
freq      1929
Name: class, dtype: object

In [25]:
reuters_raw['class'].value_counts()

Neg-           1929
Pos-earn       1280
Pos-acq         790
Pos-coffee       35
Pos-gold         34
Pos-housing       7
Pos-heat          4
Name: class, dtype: int64

# Preparing Ludwig model definition

One can use Ludwig via CLI or Python API. Let us stick to the CLI for now.

In [30]:
%%writefile model-definition.yaml
input_features:    # Described here https://uber.github.io/ludwig/user_guide/#input-features
    -
        name: text #name of the CSV column for feature
        type: text
        level: word #token vs character level granularity
        encoder: parallel_cnn #type of NN

output_features: #https://uber.github.io/ludwig/user_guide/#output-features
    -
        name: class #name of thge CSV column for label
        type: category #categorical

Writing model-definition.yaml


# Training the model

The entry point to the tool is the *ludwig* executable. We can check what argument it takes:

In [28]:
!ludwig --help


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

usage: ludwig <command> [<args>]

Available sub-commands:
   experiment            Runs a full experiment training a model and testing it
   train                 Trains a model
   predict               Predicts using a pretrained model
   visualize             Visualizes experimental results
   collect_weights       Collects tensors containing a pretrained model weights
   collect_activations   Collects tensors for each datapoint using a pretrained model

ludwig cli runner

positional arguments:
  command     Subcommand to run

optional arguments:
  -h, --help  show this help message and exit


In [29]:
!ludwig experiment --help


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

usage: ludwig experiment [options]

This script trains and tests a model.

optional arguments:
  -h, --help            show this help message and exit
  --output_directory OUTPUT_DIRECTORY
                        directory that contains the results
  --experiment_name EXPERIMENT_NAME
                        experiment name
  --model_name MODEL_NAME
                        name for the model
  --data_csv DATA_CSV   input data CSV file. If it has a split column, it will
                        be used for splitting (0: train, 1: validation, 2:
                        test), otherwise the dataset will be randomly split
  --data_train_csv DATA_TRAIN_CSV
                        input train data CSV file
  --data_validation_csv DATA_VALIDATION_CSV
            

In [1]:
!ludwig experiment --data_csv=reuters-allcats.csv --model_definition_file=model-definition.yaml 


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

 _         _        _      
| |_  _ __| |_ __ _(_)__ _ 
| | || / _` \ V  V / / _` |
|_|\_,_\__,_|\_/\_/|_\__, |
                     |___/ 
ludwig v0.1.1 - Experiment

  model_definition = merge_with_defaults(yaml.load(def_file))
Experiment name: experiment
Model name: run
Output path: results/experiment_run_13

ludwig_version: '0.1.1'
command: ('/anaconda2/envs/hello-ludwig/bin/ludwig experiment '
 '--data_csv=reuters-allcats.csv --model_definition_file=model-definition.yaml')
dataset_type: 'generic'
random_seed: 42
input_data: 'reuters-allcats.csv'
model_definition: {   'combiner': {'type': 'concat'},
    'input_features': [   {   'encoder': 'parallel_cnn',
                              'level': 'word',
                              'name': 'text',
   

Instructions for updating:
Use tf.cast instead.
From /anaconda2/envs/myenv/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
From /anaconda2/envs/myenv/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.

╒══════════╕
│ TRAINING │
╘══════════╛

2019-04-18 21:55:56.685862: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

Epoch   1
Training: 100%|█████████████████████████████████| 23/23 [01:34<00:00,  3.51s/it]
Evaluation train: 100%|███████████

Evaluation vali : 100%|███████████████████████████| 4/4 [00:02<00:00,  1.43it/s]
Evaluation test : 100%|███████████████████████████| 7/7 [00:06<00:00,  1.12it/s]
Took 2m 7.3636s
╒═════════╤════════╤════════════╤═════════════╕
│ class   │   loss │   accuracy │   hits_at_k │
╞═════════╪════════╪════════════╪═════════════╡
│ train   │ 0.0042 │     1.0000 │      1.0000 │
├─────────┼────────┼────────────┼─────────────┤
│ vali    │ 0.3252 │     0.9075 │      0.9871 │
├─────────┼────────┼────────────┼─────────────┤
│ test    │ 0.2386 │     0.9270 │      0.9915 │
╘═════════╧════════╧════════════╧═════════════╛
Last improvement of loss on combined happened 2 epochs ago


Epoch  10
Training: 100%|█████████████████████████████████| 23/23 [01:39<00:00,  3.61s/it]
Evaluation train: 100%|█████████████████████████| 23/23 [00:24<00:00,  1.23it/s]
Evaluation vali : 100%|███████████████████████████| 4/4 [00:03<00:00,  1.25it/s]
Evaluation test : 100%|███████████████████████████| 7/7 [00:06<00:00,  1.16i

In [6]:
import os
os.listdir("./results/experiment_run_13/training_statistics.json")

['class_probabilities.csv',
 'class_probability.csv',
 'prediction_statistics.json',
 'class_probabilities.npy',
 'class_probability.npy',
 'description.json',
 'model',
 'class_predictions.csv',
 'training_statistics.json',
 'class_predictions.npy']

In [10]:
!ludwig visualize --visualization learning_curves --training_statistics results/experiment_run_13/training_statistics.json


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



## Running more experiments

Change neural network to RNN or LSTM, repeat experiments. 