<a href="https://colab.research.google.com/github/Arsilla/TextMining_Tias/blob/main/NLP_Class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run these cells first!

In [59]:
!pip install transformers

In [3]:
from transformers import DistilBertConfig, TFDistilBertModel
import tensorflow as tf
from tqdm import tqdm
import numpy as np
import pandas as pd

import os
import tensorflow as tf
# This address identifies the TPU we'll use when configuring TensorFlow.


def get_bert_model(n_labels):

    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
    loss = tf.keras.losses.SparseCategoricalCrossentropy()
    metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

    distil_bert = 'distilbert-base-uncased'


    config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)
    config.output_hidden_states = False
    transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)

    input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
    input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') 

    embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
    X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
    X = tf.keras.layers.GlobalMaxPool1D()(X)
    X = tf.keras.layers.Dense(50, activation='relu')(X)
    X = tf.keras.layers.Dropout(0.2)(X)
    X = tf.keras.layers.Dense(int(n_labels), activation='sigmoid')(X)
    model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

    for layer in model.layers[:3]:
        layer.trainable = False

    model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

    return model

def bert_tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in tqdm(sentences):
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, padding='max_length', 
                                             return_attention_mask=True, return_token_type_ids=True, truncation=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
        
    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')


# Text Mining exercises

For this excercise, you will try to do create a similar result for the Sustainability Case. In this case we created:

- a topic model
- A sentiment classifier using BERT

However, although all the data is publicly available, it would require a lot of downloading, extracting and pre-processing to work. In addition the sentiment labels are not publicly available.

So we will use two different datasets instead:

- **Sentiment140 dataset**

    A twitter dataset with 1.6 million tweets containing sentiment labels.

Run the first cell to download the data, then read from the folder 'data' using `pandas`.

- **Sklearn Newsgroups dataset**
use:

``` python
from sklearn import datasets
newsgroups = datasets.fetch_20newsgroups()
```

We'll be applying our analysis on the newsgroups data, but using the twitter dataset to train a sentiment classifier, which we will apply to the newsgroups data.
In addition, we will be using DistilBERT (a 'distilled' version of BERT) for the sentiment part, as it will be faster to train.






# Exercise 0: Retrieve both datasets and explore it's contents

In [58]:
# Run this cell to download and unzip the sentiment data
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/training.1600000.processed.noemoticon.csv.zip -P data
!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip

In [17]:
from sklearn import datasets
N = 5000 # We're limiting the dataset, because trainign on 1.6 mil tweets takes too long
newsgroups = datasets.fetch_20newsgroups()
df_sentiment = pd.read_csv('data/training.1600000.processed.noemoticon.csv', 
                           names=['sentiment', 'id', 'date', 'query', 'user', 'text'],
                           encoding='latin-1')

df_sent1 = df_sentiment.iloc[:N]
df_sent2 = df_sentiment.iloc[-N:]
df_sentiment = pd.concat([df_sent1, df_sent2])

# Exercise 1: Clean newsgroups data

If you have looked at the Newsgroup data, you might notice that the data has a lot of unecessary extra information, such as the header of the message.

We're only interested in the message itself. Write a regular expression that can remove the header information from the newsgroups, and apply it to the data

Header example:

``` 
From: ravin@eecg.toronto.edu (Govindan Ravindran)
Subject: decoupling caps - onboard
Organization: Department of Electrical and Computer Engineering, University of Toronto
Lines: 10
```

HINT: Try to examine a few examples and see what they have in common at the start and the beginning of each header. 
Also look at a RegEx cheatsheet (like <a href='https://i.stack.imgur.com/KiaKd.png'> this one</a>)

HINT 2: If you're having trouble with newlines (`\n`), try removing them before applying the RegEx pattern, or using the group `[\W\w]` instead of `.`.


In [None]:
import re
pattern = r""

In [None]:
 # re.IGNORECASE will ignore capital letters in the text when applying the pattern
 # You can also use .lower on text data, but be aware that this changes all letters to lower case
newsgroup_data = []
for data in newsgroups.data:
    new_text = re.sub(pattern,repl="", string=data, flags=re.IGNORECASE)
    newsgroup_data.append(new_text)

In [None]:
newsgroup_data[106]

'\n\n(posted for a friend)\nhello there,\n        I would like to know if any one had any experience with having\non-board decoupling capacitors (inside a cmos chip) for the power\nlines. Say I have a lot of space left im my pad limited design.\nany data on the effect of oxide breakdown? any info or pointers\nare appreciated.\n\nrs\n\n'

## Exercise 1b: Tokenize the newsgroup_data

Using a tokenizer from `nltk`, tokenize the newsgroup_data into a new object called `newsgroup_tokens`. We want to end up with a list of lists, where each sublist contains the tokens of the sentence.

In [57]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()

In [None]:
from nltk.tokenize import 

In [None]:
# Tokenize your data here
# You can use a for loop similar to Exercise 1
for 

# Exercise 2: Create an LDA model

a) Load the `LdaModel` from `gensim.models` and `Dictionary` from `gensim.corpora`.

b) Create a dictionary instance using the `Dictionary` class from gensim, with your tokens as input (the list of lists)

c) Build a corpus by calling the `doc2bow` function on the dictionary you created, with each list of tokens in your `tokens` list as input, and put them in a list.

d) Create and train an LDA model, by calling the LdaModel with as input your corpus. Set the id2word as your dictionary.
Think about what settings you want to use for *alpha* and *eta* to control the number of topics per document, and words per topic respectivally. 

e) *(Optional):* Look at your results and see if your satisfied. If not, try to think what other ways you can improve the model by filtering out certain words (look at the `filter_extremes` function on the dictionary for example) or remove stopwords from your tokens.

In [None]:
## Load stopwords if you want them
## Uncomment to use

# import nltk
# from nltk.corpus import stopwords
# try:
#     input_stopwords = stopwords.words('english')
# except:
#     nltk.download('stopwords')
#     input_stopwords = stopwords.words('english')

In [None]:
lda = 

When your satisfied with your LDA model, save it for later (just don't overwrite the name with someting else).

We're now going to train a BERT model.


# BERT Sentiment analysis
For ease, we'll be using the huggingface transformers instance of DistilBERT, which is a lighter version of the full BERT model.

They have a pre-trained version (for english) available that can be used for classification.

Before we start training the model, we're going to clean up the sentiment dataset.

## Exercise 3: Clean up sentiment dataset

a) The sentiment labels are currenty 0 for negative, 4 for positive. Change the values to 0 (negative) and 1 (positive) for easier input to the model.
    hint: you can do this in differnt ways, but using a dictionary is one way:

``` python
>>> df.loc[0,'column_to_change']
 0    a
 1    b
 2    a
 Name: column_to_change, dtype: object
>>> df['column_to_change'] = df['column_to_change'].replace({'a': 1})
>>> df['column_to_change']
 0    1
 1    b
 2    1
 Name: column_to_change, dtype: object
```

In [None]:
## Your answer here


b) Remove any columns that are not necessary, and only keep the 'sentiment' and 'text' columns

In [None]:
# your answer here
df = 

c) (Optional) Remove any unwanted characters from the text column. Note that this is not *necessary* for the BERT tokenizer to work (it can handle any input basically). However you might consider removing twitter handles or URLS, simply because they are not relevant for the analysis.

In [None]:
# Your answer here

## Exercise 4: Initialize DistilBERT and prepare train and test set

For the purpose of this exercise, I've prepared a version of the DistilBERT model (based on <a href='https://github.com/huggingface/transformers'>one of these</a>) that you can use for fine-tuning (transfer learning) a model for classification.

You can get this model by calling the `get_bert_model` function below.

## NOTE: Make double sure your runtime is set to GPU, otherwise one epoch might take 30 minutes!

In [56]:
from transformers import DistilBertTokenizer, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = get_bert_model(2)

model.summary()

The model expects two inputs, the tokens (as a one-hot encoded vector) and the masked tokens. 
The input is handled by the tokenizer, and you can use the `bert_tokenize` function to get the input transformed for the model.

In [55]:
sentences = ['testing this thing', 
             'short sentences will be padded with zeros, and long ones are cut off to make sure everything is the same size']
tokenized = bert_tokenize(sentences, tokenizer)
tokenized

Now we need to prepare the data for training.

a) Make two arrays of text (tweets) and labels (sentiment) called X and Y, then use the `train_test_split` from sklearn to split up the data twice. Once in `train` and `test`, and then the `test` set again in `test` and `val` (validate). Think about what percentage of the sets you want to assing to each set, and use the `test_size` parameter for that, as a float between 0-1, where `1=100%`.

In [52]:
from sklearn.model_selection import train_test_split
## FILL THIS IN
X = 

Y = 

X_train, X_test, y_train,  y_test = train_test_split()
X_test, X_val, y_test, y_val = train_test_split()

print(f"Train set size.    : {len(X_train)} tweets, {(len(X_train)/len(X))*100}% of total size")
print(f"Test set size.     : {len(X_test)} tweets, {(len(X_test)/len(X))*100}% of total size")
print(f"Validation set size: {len(X_val)} tweets, {(len(X_val)/len(X))*100}% of total size")

In [53]:
# Checking we have good even distributions of labels for all sets
print(np.unique(y_train, return_counts=True))
print(np.unique(y_val, return_counts=True))
print(np.unique(y_test, return_counts=True))

Make sure the input text (`X_train`, `X_val` and `X_test`) have been tokenized for input into the model, using the `bert_tokenize` function!

b) Now train the model on the train set. You can do this by calling the model.train() function on the model object.
Use a batch size of 32 and start with a few epochs. See how the model is performing.

Running the training may take a few minutes per epoch, so perhaps this is a good moment for some coffee.

In [54]:
model.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=32, epochs=10)

c) Make predictions on the test set (`X_test`) and compare them to the true labels (`y_test`), what is the accuracy?. Get the confusion matrix from `sklearn.metrics` to see where the model is making mistakes.


d) Also try to get a few examples of wrong labels, where is it going wrong and do you know why?

## Exercise 5: Combine results

We're now going to apply both models (LDA and Sentiment) to the newsgroup data.

The sentiment model might not perform very well on the newsgroup data, since it's a different dataset. However, we hope that the language is similar enough that the model can generalize to the newsgroups data as well.

a-1) Tokenize the newsgroups data sentences (called `newsgroup_data`) using the `bert_tokenize` function. Create predictions from the tokenized input and save as `y_sent_proba`.

a-2) The output of the `y_sent_proba` is of shape `(N, 2)` where N is the number of samples in newsgroups_data.
The two numbers given to each sample, are the confidence of that newsgroup beloging to class 0 (negative) or 1 (positive) by index.

So if `newsgroup_data[1]` has prediction `[0.3, 0.7]` the model is 70% confident that the sentiment is positive.

To get just a label back, we can get the maximum value of each of these samples by using `np.argmax(y_sent_proba, 1)`. 

Convert `y_sent_proba` to an array of intergers (labels) using np.argmax and save it as `y_sentiment`.

In [34]:
y_sentiment = 

(array([0, 1]), array([  930, 10384]))

(11314, 2)

b) Now have the LDA model run predictions on the newsgroup data per text. You can do this by running `lda.get_document_topics(doc)` for each `doc` in our corpus. (corpus should still exist in our namespace, if not, run the old cells above where we made the LDA model and trained it).

Save these under y_topic

In [None]:
y_topic = []
for

Now we have both sentiment and topic predictions, saved as y_sentiment and y_topic. 

c) Create a visualisation using both of these values. For example, show a barchart of the number of positive and negative values.

# Congrats! You're done!

You have now succesfully created an LDA model and finetuned DistilBERT for sentiment analysis. 

Obviously you can experiment a lot more with the output both models, or experiment with the vectorisations . Have fun!