<a href="https://practicalai.me"><img src="https://raw.githubusercontent.com/practicalAI/images/master/images/rounded_logo.png" width="100" align="left" hspace="20px" vspace="20px"></a>

<img src="https://cdn4.iconfinder.com/data/icons/data-analysis-flat-big-data/512/data_cleaning-512.png" width="90px" vspace="10px" align="right">

<div align="left">
<h1>Preprocessing</h1>
In this lesson, we will explore preprocessing and data loading utilities in Tensorflow + Keras, mainly focused on text-based data.

<table align="center">
  <td>
<img src="https://raw.githubusercontent.com/practicalAI/images/master/images/rounded_logo.png" width="25"><a target="_blank" href="https://practicalai.me"> View on practicalAI</a>
  </td>
  <td>
<img src="https://raw.githubusercontent.com/practicalAI/images/master/images/colab_logo.png" width="25"><a target="_blank" href="https://colab.research.google.com/github/practicalAI/practicalAI/blob/master/notebooks/09_Preprocessing.ipynb"> Run in Google Colab</a>
  </td>
  <td>
<img src="https://raw.githubusercontent.com/practicalAI/images/master/images/github_logo.png" width="22"><a target="_blank" href="https://github.com/practicalAI/practicalAI/blob/master/notebooks/basic_ml/09_Preprocessing.ipynb"> View code on GitHub</a>
  </td>
</table>

# Overview

* **Tokenizer**: data processing unit to convert text data to tokens
* **LabelEncoder**: convert text labels to tokens

# Set up

In [1]:
# Use TensorFlow 2.x
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [0]:
import os
import numpy as np
import tensorflow as tf

In [0]:
# Arguments
SEED = 1234
DATA_FILE = 'news.csv'
SHUFFLE = True
INPUT_FEATURE = 'title'
OUTPUT_FEATURE = 'category'
LOWER = True
CHAR_LEVEL = False
TRAIN_SIZE = 0.7
VAL_SIZE = 0.15
TEST_SIZE = 0.15
NUM_EPOCHS = 10
BATCH_SIZE = 32

In [0]:
# Set seed for reproducibility
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Load data

We will download the [AG News dataset](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html), which consists of 120000 text samples from 4 unique classes ('Business', 'Sci/Tech', 'Sports', 'World')

In [0]:
import pandas as pd
import re
import urllib

In [0]:
# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/practicalAI/practicalAI/master/data/news.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open(DATA_FILE, 'wb') as fp:
    fp.write(html)

In [7]:
# Load data
df = pd.read_csv(DATA_FILE, header=0)
X = df[INPUT_FEATURE].values
y = df[OUTPUT_FEATURE].values
df.head(5)

Unnamed: 0,title,category
0,Wall St. Bears Claw Back Into the Black (Reuters),Business
1,Carlyle Looks Toward Commercial Aerospace (Reu...,Business
2,Oil and Economy Cloud Stocks' Outlook (Reuters),Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,Business
4,"Oil prices soar to all-time record, posing new...",Business


# Preprocess data

In [0]:
def preprocess_text(text):
    """Common text preprocessing steps."""
    # Remove unwanted characters
    text = re.sub(r"[^0-9a-zA-Z?.!,¿]+", " ", text)

    # Add space between words and punctuations
    text = re.sub(r"([?.!,¿])", r" \1 ", text)
    text = re.sub(r'[" "]+', " ", text)

    # Remove whitespaces
    text = text.rstrip().strip()

    return text

In [9]:
# Preprocess the titles
df.title = df.title.apply(preprocess_text)
df.head(5)

Unnamed: 0,title,category
0,Wall St . Bears Claw Back Into the Black Reuters,Business
1,Carlyle Looks Toward Commercial Aerospace Reuters,Business
2,Oil and Economy Cloud Stocks Outlook Reuters,Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,Business
4,"Oil prices soar to all time record , posing ne...",Business


<img height="45" src="http://bestanimations.com/HomeOffice/Lights/Bulbs/animated-light-bulb-gif-29.gif" align="left" vspace="5px" hspace="10px">

If you have preprocessing steps like standardization, etc. that are calculated, you need to separate the training and test set first before spplying those operations. This is because we cannot apply any knowledge gained from the test set accidentally during preprocessing/training. However for preprocessing steps like the function above where we aren't learning anything from the data itself, we can perform before splitting the data.

# Split data

In [0]:
import collections
from sklearn.model_selection import train_test_split

### Components

In [0]:
def train_val_test_split(X, y, val_size, test_size, shuffle):
    """Split data into train/val/test datasets."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, shuffle=shuffle)
    X_train, X_val, y_train, y_val = train_test_split(
        X_train, y_train, test_size=val_size, stratify=y_train, shuffle=shuffle)
    return X_train, X_val, X_test, y_train, y_val, y_test

### Operations

In [12]:
# Create data splits
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
    X=X, y=y, val_size=VAL_SIZE, test_size=TEST_SIZE, shuffle=SHUFFLE)
class_counts = dict(collections.Counter(y))
print (f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print (f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")
print (f"X_train[0]: {X_train[0]}")
print (f"y_train[0]: {y_train[0]}")
print (f"Classes: {class_counts}")

X_train: (86700,), y_train: (86700,)
X_val: (15300,), y_val: (15300,)
X_test: (18000,), y_test: (18000,)
X_train[0]: PGA overhauls system for Ryder Cup points
y_train[0]: Sports
Classes: {'Business': 30000, 'Sci/Tech': 30000, 'Sports': 30000, 'World': 30000}


# Tokenizer

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer

### Operations

In [0]:
# Input vectorizer
X_tokenizer = Tokenizer(lower=LOWER,
                        char_level=CHAR_LEVEL,
                        oov_token='<UNK>')

In [15]:
# Fit only on train data
X_tokenizer.fit_on_texts(X_train)
vocab_size = len(X_tokenizer.word_index) + 1
print (f"# tokens: {vocab_size}")

# tokens: 29782


In [16]:
# Convert text to sequence of tokens
print (f"X_train[0]: {X_train[0]}")
X_train = np.array(X_tokenizer.texts_to_sequences(X_train))
X_val = np.array(X_tokenizer.texts_to_sequences(X_val))
X_test = np.array(X_tokenizer.texts_to_sequences(X_test))
print (f"X_train[0]: {X_train[0]}")
print (f"len(X_train[0]): {len(X_train[0])} characters")

X_train[0]: PGA overhauls system for Ryder Cup points
X_train[0]: [2013, 7327, 467, 5, 702, 118, 1137]
len(X_train[0]): 7 characters


<img height="45" src="http://bestanimations.com/HomeOffice/Lights/Bulbs/animated-light-bulb-gif-29.gif" align="left" vspace="5px" hspace="10px">

Checkout other preprocessing functions in the [official documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/preprocessing).

# LabelEncoder

In [0]:
import json
from sklearn.preprocessing import LabelEncoder

### Operations

In [0]:
# Output vectorizer
y_tokenizer = LabelEncoder()

In [19]:
# Fit on train data
y_tokenizer = y_tokenizer.fit(y_train)
num_classes = len(y_tokenizer.classes_)
print (f"# classes: {num_classes}")

# classes: 4


In [20]:
# Convert labels to tokens
print (f"y_train[0]: {y_train[0]}")
y_train = y_tokenizer.transform(y_train)
y_val = y_tokenizer.transform(y_val)
y_test = y_tokenizer.transform(y_test)
print (f"y_train[0]: {y_train[0]}")

y_train[0]: Sports
y_train[0]: 2


In [21]:
# Class weights
counts = collections.Counter(y_train)
class_weights = {_class: 1.0/count for _class, count in counts.items()}
print (f"class counts: {counts},\nclass weights: {class_weights}")

class counts: Counter({2: 21675, 1: 21675, 3: 21675, 0: 21675}),
class weights: {2: 4.61361014994233e-05, 1: 4.61361014994233e-05, 3: 4.61361014994233e-05, 0: 4.61361014994233e-05}


<img height="45" src="http://bestanimations.com/HomeOffice/Lights/Bulbs/animated-light-bulb-gif-29.gif" align="left" vspace="5px" hspace="10px">

Checkout the complete list of sklearn preprocessing functions in the [official documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing).

# Padding

Our inputs are all of varying length but we need each batch to be uniformly shaped. Therefore, we will use padding to make all the inputs in the batch the same length. Our padding index will be 0 (note that X_tokenizer starts at index 1).

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [23]:
sample_X = np.array([[3, 89, 45]])
max_seq_len = 10
padded_sample_X = pad_sequences(sample_X, padding="post", maxlen=max_seq_len)
print (f"{sample_X} → {padded_sample_X}")

[[ 3 89 45]] → [[ 3 89 45  0  0  0  0  0  0  0]]


We will put all of these preprocessing utilities to use in the subsequent lessons.

---
<div align="center">

Subscribe to our <a href="https://practicalai.me/#newsletter">newsletter</a> and follow us on social media to get the latest updates!

<a class="ai-header-badge" target="_blank" href="https://github.com/practicalAI/practicalAI">
              <img src="https://img.shields.io/github/stars/practicalAI/practicalAI.svg?style=social&label=Star"></a>&nbsp;
            <a class="ai-header-badge" target="_blank" href="https://www.linkedin.com/company/practicalai-me">
              <img src="https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social"></a>&nbsp;
            <a class="ai-header-badge" target="_blank" href="https://twitter.com/practicalAIme">
              <img src="https://img.shields.io/twitter/follow/practicalAIme.svg?label=Follow&style=social">
            </a>
              </div>

</div>