[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut9_CNN_NLP_student.ipynb)

# Tutorial 9: Convolutional Neural Nets for Text Data
In this tutorial, we will first explain what the layers `Conv2D` (rank-3 tensors) and `Conv1D` (rank-2 tensors) do. Then, we will use `Conv1D` to classify Tweets into positive, neutral and negative sentiments—the Tweets are from the clients of different airlines. 

For further examples, please visit [demos/cnn](https://github.com/Humboldt-WI/adams/tree/master/demos/cnn).

In [51]:
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
import string
import re
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

## ConveNets
Convnets are widely used in computer vision applications. The most common is the `Conv2D` which takes as input tensors of shape `(height, width, channels)` plus the batch. Let's see a simple example 

In [3]:
# Create a sample input (batch, height, width, channels)
tf.random.set_seed(1234) # for reproducibility
ex_input = tf.concat([tf.ones((1,3,3,1)), 2*tf.ones((1,3,3,1))], axis=3 ) # (1,3,3,2)
ex_input

<tf.Tensor: shape=(1, 3, 3, 2), dtype=float32, numpy=
array([[[[1., 2.],
         [1., 2.],
         [1., 2.]],

        [[1., 2.],
         [1., 2.],
         [1., 2.]],

        [[1., 2.],
         [1., 2.],
         [1., 2.]]]], dtype=float32)>

In [4]:
# Apply a convnet with 1 filter and a kernel of size 2
cnn2D = layers.Conv2D(filters=1,kernel_size=2, input_shape=ex_input.shape[1:])
cnn2D(ex_input)

<tf.Tensor: shape=(1, 2, 2, 1), dtype=float32, numpy=
array([[[[-0.27699053],
         [-0.27699053]],

        [[-0.27699053],
         [-0.27699053]]]], dtype=float32)>

In [5]:
# Let's understand the matrix operations
kernel = cnn2D.get_weights()[0] # random initialization weights
np.sum(1*kernel[:,:,0,:])+np.sum(2*kernel[:,:,1,:]) # replicate the firts output

-0.27699053

Convnets are not restricted to rank-3 tensor `(height, width, channels)`. Keras also has `Conv3D` and `Conv1D` implemented. Let's look at `Conv1D`, which requires a rank-2 tensor as input, such as sequence data.

In [52]:
# Input for cnn1D (batch, seq_length, emb_dim)
tf.random.set_seed(1234)
ex_input = tf.concat([tf.ones((1,1,2)), 2*tf.ones((1,1,2)), 3*tf.ones((1,1,2))], axis = 1) # (1, 3, 2)
ex_input

<tf.Tensor: shape=(1, 3, 2), dtype=float32, numpy=
array([[[1., 1.],
        [2., 2.],
        [3., 3.]]], dtype=float32)>

In [53]:
# Apply a convnet with 1 filter and a kernel of size 2
cnn1D = layers.Conv1D(filters=1,kernel_size=2, input_shape=ex_input.shape[1:])
cnn1D(ex_input)

<tf.Tensor: shape=(1, 2, 1), dtype=float32, numpy=
array([[[-0.8928499],
        [-1.4366169]]], dtype=float32)>

In [54]:
kernel = cnn1D.get_weights()[0]
kernel

array([[[ 0.07607865],
        [-0.27076268]],

       [[ 0.16326022],
        [-0.51234317]]], dtype=float32)

In [55]:
print(np.sum(1*kernel[0,:,:] + 2*kernel[1,:,:] ))
print(np.sum(2*kernel[0,:,:] + 3*kernel[1,:,:] ))

-0.8928499
-1.4366169


# Tweets classification
The purpose is to put `Conv1D` into practice. We have Twitter data concerning airline clients and the labels of their tweets (positive, neutral, negative). The idea is to create a classification model for tweets. We'll only care about the positive and negative in the first part. Then, we include the neutral labels. 

In [56]:
# Load data
tot_tweets =pd.read_csv("Tweets.csv.zip")
tot_tweets = tot_tweets[['airline_sentiment','text']]

## Positive and Negative Tweets

### Exercise 1: 
Remove the samples with the label `neutral`, create train and validation sets, and then transform them to NumPy arrays.

### Exercise 2:
Create a function to standardize the text. In particular, convert to lowercase, replace any character that is not a-z OR A-Z with a space, and remove punctuation and double space. 

### Exercise 3:
Create a vectorization layer and apply it to the text data. Use 10000 tokens with a maximum length for each tweet of 50. 

## Model `Embedding` + `Conv1D` + `MaxPooling1D` + `Flatten` + `Dense`
### Exercise 4:
Create a model with one `Embedding` of dimension 16, followed by a `Conv1D` with 32 filters and a kernel size of 8 and relu activation. Then, apply `MaxPooling1D` with a pool size of 2, `Flatten` the output and finally use the `Dense` layer. Can you explain the number of parameters?

### Exercise 5: 
Train the model using a batch size of 128 for 20 epochs and an `EarlyStopping` callback with patience of 3. Restore the best weights and evaluate the validation set.

## Model `Embedding` + `Conv1D` + `GlobalAveragePooling1D` + `Dense`
### Exercise 6:
Create a new model similar to the previous one but replace the `MaxPooling1D` and `Flatten` layers with `GlobalAveragePooling1D`. Can you explain what we are doing? Next, train the model using the previous settings. Is it better?

## Model `Embedding` + `Conv1D` + `MaxPooling1D`+ `Conv1D` + `MaxPooling1D` + `Flatten` + `Dense`
### Exercise 7:
Let's try now a deeper network by adding `Conv1D` + `MaxPooling1D` to the first configuration.

## Positive, Negative and Neutral Tweets
### Exercise 8:
Now, we're going to use the three labels to create the model. But, first, encode the corresponding labels, split the data and transform it to NumPy.

### Exercise 9:
Repeat the previous procedure to create the vectorization layer.

## Model `Embeddings`+`Conv1D`+`MaxPooling1D`+`Flatten`+`Dense`
### Exercise 10:
Modify the first model, i.e. `Embeddings`+`Conv1D`+`MaxPooling1D`+`Flatten`+`Dense`, to this problem (be aware of the expected output dimension and loss function).