# SCE NLP Workshop

Hey there! Thanks for checking out my workshop. This notebook has code snippets to help you implement your own solution to the [NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) contest on Kaggle.

[Click here]() to open this notebook in Google Colab (free GPUs!).

In [None]:
import numpy as np        # linear algebra
import pandas as pd       # data processing
from tqdm import tqdm     # progress bars
import matplotlib.pyplot as plt  # plot graphs


np.random.RandomState(123) # seed RNG for reproducibility

## data exploration

Load the training data and checkout what we're working with.

In [None]:
# load training data
df_ori = pd.read_csv('../input/nlp-getting-started/train.csv')

df_ori # data frame original: contains unmodified training data

Look at samples from the training data

In [None]:
# look at 10 keywords
'''TODO'''

In [None]:
# look at 10 locations
'''TODO'''

In [None]:
# look at 10 posts
for index, text in df_ori['text'].iteritems():
    # repr() will print special characters as escaped
    # e.g. '\n' for newline
    
    '''TODO'''

We're going to ignore 'id' and 'location' because they're useless. 'keyword' might be helpful, but we'll ignore that too.

In [None]:
# drop columns (axis=1)
df = '''TODO: drop the id and location columns'''

df.shape # returns the dimensions of the data frame

## BERT

BERT is a transformer, which maps a sentence into a sentence embedding (i think?). You can think of it as transforming sentences into vectors that represent the sentence.

We're going to use a pretrained model from [Hugging Face](https://huggingface.co/transformers/model_doc/distilbert.html) called DistilBERT. It's been shown to be faster than BERT with similar performance.

In [None]:
# install the Hugging Face transformers library
!pip install transformers -q

import torch # our BERT model from Hugging Face uses PyTorch

'done'

### preprocessing

BERT doesn't require much preprocessing. We just need to do the following:

1. lowercase
2. handle special characters (e.g. ü)
3. remove punctuation (we're considering each tweet as one sentence)

In [None]:
from unidecode import unidecode # remove accents from characters
import html    # for html encoded characters (e.g. &amp;)
import re      # regular expressions

In [None]:
def clean(text: str) -> str:
    '''Normalize a text sample'''
    
    # unescape html
    text = html.unescape(text)
    
    # remove mentions
    text = re.sub(r'(^|.)@[^\s]*', r'', text)
    
    # remove urls
    text = re.sub(r'https?:\/\/[^\s]*', r'', text)
    
    # remove accented characters
    text = unidecode(text)
    
    # remove unwanted characters
    text = re.sub(r"[^a-zA-Z\s']+", r' ', text)
    
    # remove repeated apostrophes
    text = re.sub(r"(['])[']+", r'\1', text)
    
    # remove whitespace from the sides
    text = text.strip()
    
    # turn whitespace into a space
    text = re.sub(r'\s+', r' ', text)
    
    # lowercase
    text = text.lower()
    
    return text

In [None]:
# clean all our text
df['text'] = '''TODO'''

df['text'].iat[0]

### encode

1. Tokenize words into IDs in BERT's vocabulary
2. Add `[CLS]` tokens to classify the text
3. Add `[SEP]` tokens at the ends of sentences (BERT needs them surrounding each sentence)
4. Pad tokens to the same length
5. Create an attention mask to ignore padding

If our samples were too long (> 512 tokens), we would have to truncate them or create a list of sentences. Luckily, tweets are quite short.

In [None]:
from transformers import AutoTokenizer

# download a pretrained tokenizer (needs internet)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

Hugging Face has a BertNormalizer class that can do some of what our `clean()` function does

```py
from tokenizers.normalizers import BertNormalizer

tokenizer.normalizer = BertNormalizer(
    clean_text=True, handle_chinese_chars=True,
    strip_accents=True, lowercase=True
)
```

In [None]:
def encode_text(text: list):
    '''Encodes text
    
    Arguments:
        text (list): Array of strings.
        
    Returns:
        np.ndarray: 3D array of encodings; (sample, [tokens, mask], value)
    '''
    
    encodings = tokenizer('''TODO''')

    # convert encodings into a 3D numpy array
    encodings = np.stack(
        [encodings.input_ids, encodings.attention_mask], axis=1)
    
    return encodings

In [None]:
# encode all our text
encodings = '''TODO'''
encodings.shape

Tokens are encoded IDs for words in BERT's vocabulary.

Example from [Jay Alammar](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/#how-a-single-prediction-is-calculated)

Input sentence  
`a visually stunning rumination on love`

Break words into tokens in BERT's vocabulary  
`a` `visually` `stunning` `rum` `##ination` `on` `love`

Add special tokens  
`[CLS]` `a` `visually` `stunning` `rum` `##ination` `on` `love` `[SEP]`

Encode tokens into IDs  
`101` `1037` `17453` `14726` `19379` `12758` `2006` `2293` `102`

In [None]:
#  tokens
encodings[0][0]

The attention mask tells BERT which tokens are real (ones) and which are padded (zeros)

In [None]:
# attention mask
encodings[0][1]

### embed

Use BERT to transform the text into embeddings (vectors of numbers to represent the sentence).

In [None]:
from transformers import AutoModel

# download a pretrained model (needs internet)
bert_model = AutoModel.from_pretrained('distilbert-base-uncased')

Get class embeddings. Each chunk from a sample gets a vector of 768 embeddings (from the 768 hidden layers).

Source: [Jay Alammar](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/#processing-with-distilbert)

In [None]:
def embed_encodings(encodings):
    '''Transfrom encoded samples.
    
    Arguments:
        encodings (np.ndarray): 3D array of encodings;
            (sample, [tokens, mask], value)
    
    Returns:
        numpy.ndarray: 2D array of 768 hidden states for the '[CLS]' token
            for each sample; (sample, embeddings)
    '''

    X = []

    for sample in tqdm(encodings):
        # we need our tokens and mask as a pytorch tensor
        tokens = '''TODO'''
        mask = '''TODO'''

        with torch.no_grad():
            last_hidden_states = bert_model('''TODO''')

        # we only care about class embeddings
        embeddings = last_hidden_states[0][:,0,:][0].numpy()
        
        X.append(embeddings)

    # convert X into a 2D numpy array
    X = np.stack(X, axis=0)

    return X

This step takes a while, so we'll reduce the size of our training data. Once everything is working, switch to using a GPU transform all the data.

In [None]:
# reduced data size for convenience
encodings = encodings[:500]

# match input samples
y = df[df.index < 500]
y = y['target'].to_numpy()

X = '''TODO: 2D numpy array of embeddings'''

print(X.shape, y.shape)

## split train and test data

We need to split our training data into training and validation data. We use training data to train our model, then validation data to check our performance (and tweak our model if necessary).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = map(
    lambda x: np.stack(x, axis=0),
    train_test_split('''TODO''')
)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

## ANN

We'll use an artifical neural network on the sentence embeddings to perform classification.

In [None]:
import tensorflow as tf    # deep learning library, like PyTorch

In [None]:
def build_ann(input_shape: tuple):
    '''Builds an artifical neural network.'''
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Input(shape=input_shape),
        
        tf.keras.layers.Dense('''TODO'''),
        tf.keras.layers.Dropout('''TODO'''), # regularization
        
        tf.keras.layers.Dense('''TODO''')
    ])
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
# build our model
ann_model = '''TODO'''

ann_model.summary()

In [None]:
# show the model as a flowchart
tf.keras.utils.plot_model(ann_model)

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# prevent overfitting
es_callback = EarlyStopping(monitor='val_loss', patience=3)

In [None]:
history = ann_model.fit('''TODO''')

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'])
plt.show()

In [None]:
ann_model.evaluate('''TODO''')

## Submission

Enable a GPU hardware accelerator at this point (this will take a while). You'll have to rerun the cells where you imported packages, declared functions, and created the BERT models we used.

1. Clean training data set
2. Use BERT to embed text
3. Build ANN
4. Train on all samples
5. Predict labels on the test data set

In [None]:
from tensorflow.python.client import device_lib # check devices for TensorFlow

# use cuda with PyTorch if available
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f'using {torch.cuda.get_device_name(0)}')
else:
    print('no GPU avaiable')

# TensorFlow automatically uses cuda
print(device_lib.list_local_devices())

In [None]:
def preprocess_pipeline(df):
    '''Clean and Encode a data set.
    
    Arguments:
        df (pandas.DataFrame): The data set, with samples in the 'text' column.
        
    Returns:
        np.ndarray: 3D array of encodings; (index, [tokens, mask], value)
    '''
    
    # clean text
    text = df['text'].apply(clean).to_list()

    # encode text
    encodings = encode_text(text)

    return encodings

In [None]:
df = '''TODO: training data'''

y = '''TODO: numpy array of labels'''

X = '''TODO: 2D numpy array of input features'''

print(X.shape, y.shape)

In [None]:
# embed text
X = '''TODO'''
X.shape

In [None]:
# build ANN
ann_model = '''TODO'''

ann_model.summary()

In [None]:
ann_model.fit('''TODO''')

Time to predict disaster tweets!

In [None]:
# load test set
df_test = pd.read_csv('../input/nlp-getting-started/test.csv')

X_test = '''TODO'''
X_test.shape

In [None]:
X_test = '''TODO'''
X_test.shape

Our model actually predicts probabilities (logits). We have to snap probabilites less than 50% to 0 and greater than 50% to 1.

In [None]:
logits = ann_model.predict(X_test)
logits.shape

In [None]:
pred = np.apply_along_axis(lambda p: 1 if tf.greater(p[0], 0.5) else 0, 1, logits)
pred.shape

In [None]:
df_test['target'] = pred
df_test.head()

Save your predictions to a CSV file. We only want the `id` and `target` columns. Then go back to Kaggle and submit!

In [None]:
df_test = df_test.drop(labels=['text', 'keyword', 'location'], axis=1)
df_test.to_csv('submission.csv', index=False)