# Preprocessing
In this lesson, we will explore preprocessing and data loading utilities in Tensorflow + Keras, mainly focused on text data.

<div align="left">
<a href="https://github.com/madewithml/basics/blob/master/notebooks/12_Preprocessing/12_PT_Preprocessing.ipynb" role="button"><img class="notebook-badge-image" src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
<a href="https://colab.research.google.com/github/madewithml/basics/blob/master/notebooks/12_Preprocessing/12_PT_Preprocessing.ipynb"><img class="notebook-badge-image" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</div>

# Load data

We will download the [AG News dataset](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html), which consists of 120000 text samples from 4 unique classes ('Business', 'Sci/Tech', 'Sports', 'World')

In [0]:
import numpy as np
import pandas as pd
import re
import urllib

In [0]:
SEED = 1234
DATA_FILE = 'news.csv'
INPUT_FEATURE = 'title'
OUTPUT_FEATURE = 'category'

In [2]:
# Set seed for reproducibility
np.random.seed(SEED)

NameError: ignored

In [0]:
# Load data from GitHub to this notebook's local drive
url = "https://raw.githubusercontent.com/madewithml/basics/master/data/news.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open(DATA_FILE, 'wb') as fp:
    fp.write(html)

In [0]:
# Load data
df = pd.read_csv(DATA_FILE, header=0)
X = df[INPUT_FEATURE].values
y = df[OUTPUT_FEATURE].values
df.head(5)

# Preprocess data

In [0]:
def preprocess_text(text):
    """Common text preprocessing steps."""
    # Remove unwanted characters
    text = re.sub(r"[^0-9a-zA-Z?.!,¿]+", " ", text)

    # Add space between words and punctuations
    text = re.sub(r"([?.!,¿])", r" \1 ", text)
    text = re.sub(r'[" "]+', " ", text)

    # Remove whitespaces
    text = text.rstrip().strip()

    return text

In [0]:
# Preprocess the titles
df.title = df.title.apply(preprocess_text)
df.head(5)

**NOTE**: If you have preprocessing steps like standardization, etc. that are calculated, you need to separate the training and test set first before spplying those operations. This is because we cannot apply any knowledge gained from the test set accidentally during preprocessing/training. However for preprocessing steps like the function above where we aren't learning anything from the data itself, we can perform before splitting the data.

# Split data

In [0]:
import collections
from sklearn.model_selection import train_test_split

In [0]:
TRAIN_SIZE = 0.7
VAL_SIZE = 0.15
TEST_SIZE = 0.15
SHUFFLE = True

In [0]:
def train_val_test_split(X, y, val_size, test_size, shuffle):
    """Split data into train/val/test datasets."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, shuffle=shuffle)
    X_train, X_val, y_train, y_val = train_test_split(
        X_train, y_train, test_size=val_size, stratify=y_train, shuffle=shuffle)
    return X_train, X_val, X_test, y_train, y_val, y_test

In [0]:
# Create data splits
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
    X=X, y=y, val_size=VAL_SIZE, test_size=TEST_SIZE, shuffle=SHUFFLE)
class_counts = dict(collections.Counter(y))
print (f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print (f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")
print (f"Sample point: {X_train[0]} → {y_train[0]}")
print (f"Classes: {class_counts}")

# Tokenizer

* **Tokenizer**: data processing unit to convert text data to tokens

We could use something like Torch Text, SpaCy or Allen NLP, but the best tokenizer I've found so far is from Keras. In fact the entire preprocessing suite it worth checking out for text preprocessing. 

In [0]:
# Use TensorFlow 2.x
%tensorflow_version 2.x
import tensorflow as tf

In [0]:
# Set seed for reproducibility
tf.random.set_seed(SEED)

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [0]:
LOWER = True
CHAR_LEVEL = False

In [0]:
# Input vectorizer
X_tokenizer = Tokenizer(lower=LOWER, char_level=CHAR_LEVEL, oov_token='<UNK>')

In [0]:
# Fit only on train data
X_tokenizer.fit_on_texts(X_train)
vocab_size = len(X_tokenizer.word_index) + 1
print (f"# tokens: {vocab_size}")

In [0]:
# Convert text to sequence of tokens
print (f"X_train[0]: {X_train[0]}")
X_train = np.array(X_tokenizer.texts_to_sequences(X_train))
X_val = np.array(X_tokenizer.texts_to_sequences(X_val))
X_test = np.array(X_tokenizer.texts_to_sequences(X_test))
print (f"X_train[0]: {X_train[0]}")
print (f"len(X_train[0]): {len(X_train[0])} characters")

**NOTE**: Checkout other preprocessing functions in the [official documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/preprocessing).

# LabelEncoder

* **LabelEncoder**: convert text labels to tokens

In [0]:
from sklearn.preprocessing import LabelEncoder

In [0]:
# Output vectorizer
y_tokenizer = LabelEncoder()

In [0]:
# Fit on train data
y_tokenizer = y_tokenizer.fit(y_train)
classes = y_tokenizer.classes_
print (f"classes: {classes}")

In [0]:
# Convert labels to tokens
print (f"y_train[0]: {y_train[0]}")
y_train = y_tokenizer.transform(y_train)
y_val = y_tokenizer.transform(y_val)
y_test = y_tokenizer.transform(y_test)
print (f"y_train[0]: {y_train[0]}")

In [0]:
# Class weights
counts = collections.Counter(y_train)
class_weights = {_class: 1.0/count for _class, count in counts.items()}
print (f"class counts: {counts},\nclass weights: {class_weights}")

**NOTE**: Checkout the complete list of sklearn preprocessing functions in the [official documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing).

# Padding

Our inputs are all of varying length but we need each batch to be uniformly shaped. Therefore, we will use padding to make all the inputs in the batch the same length. Our padding index will be 0 (note that X_tokenizer starts at index 1).

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
sample_X = np.array([[3, 89, 45]])
max_seq_len = 10
padded_sample_X = pad_sequences(sample_X, padding="post", maxlen=max_seq_len)
print (f"{sample_X} → {padded_sample_X}")

We'll be using in subsequent lessons on 2D and even 3D inputs and the same powerful `pad_sequences` function can be used!

In [0]:
# 2D inputs
x = [[1, 2, 3], [1, 2, 3, 4]]
max_seq_len = max([len(seq) for seq in x])
x = pad_sequences(x, padding="post", maxlen=max_seq_len)
print (x)
print (f"shape: {x.shape}")

In [0]:
# 3D inputs
x = [ [[0, 1, 0], [1, 0, 0]],  [[1, 0, 0], [1, 0, 0], [0, 0, 1]]]
max_seq_len = max([len(seq) for seq in x])
x = pad_sequences(x, padding="post", maxlen=max_seq_len)
print (x)
print (f"shape: {x.shape}")

We will put all of these preprocessing utilities to use in the subsequent lessons.

---
Share and discover ML projects at <a href="https://madewithml.com/">Made With ML</a>.

<div align="left">
<a class="ai-header-badge" target="_blank" href="https://github.com/madewithml/basics"><img src="https://img.shields.io/github/stars/madewithml/basics.svg?style=social&label=Star"></a>&nbsp;
<a class="ai-header-badge" target="_blank" href="https://www.linkedin.com/company/madewithml"><img src="https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social"></a>&nbsp;
<a class="ai-header-badge" target="_blank" href="https://twitter.com/madewithml"><img src="https://img.shields.io/twitter/follow/madewithml.svg?label=Follow&style=social"></a>
</div>
             