[link text](https://)# Info and Instructions

## 1 Your Objective for 894
Concrete crack detection sync

multi-image

The prime minister needs 3 results from your model:
1. Needs to flag false posts ("pants-fire" or "false") with a recall of at least 70% (these will be sent to proffessional fact checkers)
2. Needs to flag "true" posts with a precision of at least 95% (these will be used in real-time to verify facts during presentations)
3. Needs to flag "pants-fire" posts with a precision of at least 95% (these will be used in real-time to contradict statements during presentations)
(See dataset information for more clarification around labels)

## 2 Dataset Information:
"We consider six fine-grained labels for
the truthfulness ratings: pants-fire, false, barelytrue, half-true, mostly-true, and true. The distribution of labels in the LIAR dataset is relatively
well-balanced: except for 1,050 pants-fire cases,
the instances for all other labels range from 2,063
to 2,638." - https://arxiv.org/pdf/1705.00648.pdf

## 3 Submission Instructions (**Read Carefully**)
- To submit:
  1. you cannot edit this notebook directly. **Save a copy** to your drive, and make sure to identify yourself in the title using name and student number
  2. **Ensure** you have implemented all the nececessary functions
  3. **Provide answers** to the questions in the conclusion cell
  4. Unlike previous assignments, please **submit all three formats: .py, .ipynb, and html** (see https://torbjornzetterlund.com/how-to-save-a-google-colab-notebook-as-html/)
    - The notebook and html submissions should show the completion of your best performing run
  5. **Ensure** your nNotebook can _restart and run all_
  6. The mark will be assessed on the implementation of the functions with #TODO
  7. **Do not change anything outside the marked functions**  unless in the further exploration section
  8.  Do not use any additional libraries than the ones listed below (you may import additional modules from those libraries if needed)
  9. The mark is primarily based on correctness. However, since you are responsible for optimally tuning this model, meeting high performance is required, you should be able to at least match the results given in the paper.

Changing your run time in colab to GPU will speed up the training drastically


In [None]:
!pip install datasets
!pip install transformers
!pip install pandas

from datasets import load_dataset
import matplotlib.pyplot as plt
import tensorflow.keras as keras
import pandas as pd

try: # this is only working on the 2nd try in colab :)
  from transformers import DistilBertTokenizer, TFDistilBertModel
except Exception as err: # so we catch the error and import it again
  from transformers import DistilBertTokenizer, TFDistilBertModel

import numpy as np
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense, Input, Dropout
from pandas_profiling import ProfileReport

dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/3e/73/742d17d8a9a1c639132affccc9250f0743e484cbf263ede6ddcbe34ef212/datasets-1.4.1-py3-none-any.whl (186kB)
[K     |█▊                              | 10kB 20.3MB/s eta 0:00:01[K     |███▌                            | 20kB 26.9MB/s eta 0:00:01[K     |█████▎                          | 30kB 20.3MB/s eta 0:00:01[K     |███████                         | 40kB 16.4MB/s eta 0:00:01[K     |████████▊                       | 51kB 10.2MB/s eta 0:00:01[K     |██████████▌                     | 61kB 11.7MB/s eta 0:00:01[K     |████████████▎                   | 71kB 10.3MB/s eta 0:00:01[K     |██████████████                  | 81kB 11.2MB/s eta 0:00:01[K     |███████████████▉                | 92kB 11.6MB/s eta 0:00:01[K     |█████████████████▌              | 102kB 9.8MB/s eta 0:00:01[K     |███████████████████▎            | 112kB 9.8MB/s eta 0:00:01[K     |█████████████████████           | 122kB 9.8M

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




# Data Preparation

## Clean the text and your targets
Hints: 
1. Use the exploration cell to explore the data and identify cleaning steps
2. Inspect the tokenized sentences and ensure they make sense and can leverage already trained word embeddings
3. These resources will help you understand what type of cleaning will be required and how you can encode your text for the network:
    - a) Preprocessing: https://huggingface.co/transformers/preprocessing.html
    - b) Summary of tokenizers (DistilBERT uses WordPiece): https://huggingface.co/transformers/tokenizer_summary.html#wordpiece
4. Consider the text length, is this too big/small for DistilBERT? what impact would padding/truncation have?
5. In load data you generated a profiling report of this dataset, might be helpful to review that as well

In [None]:
def prepare_raw_data(df):
  raw_data = df.loc[:, ["id", "statement", "label"]]
  raw_data["label"] = raw_data["label"].astype('category')
  return raw_data

def load_data(save_dir="./"):
  dataset = load_dataset("liar")
  train = prepare_raw_data(pd.DataFrame(dataset["train"]))
  val = prepare_raw_data(pd.DataFrame(dataset["validation"]))
  test = prepare_raw_data(pd.DataFrame(dataset["test"]))
  return train, val, test
         
def clean_data(raw_data):
  # TODO: What data cleaning/filtering should you consider?
  # Hint: check for duplicates or contradictions
  # Hint: What is the minimum and maximum lengths of the statements?
  # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
  clean_data = raw_data.drop_duplicates(subset=["statement", "label"])
  clean_data = clean_data.drop_duplicates(subset=["statement"], keep=False)
  clean_data['token_count'] = [len(x.split()) for x in clean_data.statement]
  clean_data = clean_data[clean_data['token_count'] >= 10]
  return clean_data

def extract_raw_text_and_y(clean_data):
  raw_text, raw_y = clean_data["statement"].values, clean_data["label"].values
  return raw_text, raw_y

def encode_text(text):
    # TODO: encode text using dbert_tokenizer
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    model_inputs_and_masks = dbert_tokenizer(
        text.tolist(), 
        return_tensors="tf",
        padding='max_length',
        truncation=True,
        max_length=100
    )
    input_ids = model_inputs_and_masks['input_ids']
    attention_mask = model_inputs_and_masks['attention_mask']

    return input_ids, attention_mask

def prepare_target(raw_y):
    # TODO: convert labels to 0/1
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # NOTE: labels map as follows: ['false', 'half-true', 'mostly-true', 'true', 'barely-true', 'pants-fire']
    # y should have:
    # column 0 = "pants-fire" or "false" posts
    # column 1 = "true" posts
    # column 2 = "pants-fire"
    y = keras.utils.to_categorical(raw_y)
    y = [y[:, 0] + y[:, 5], y[:, 3], y[:, 5]]
    y = np.array(y).T

    return y


# Modelling

## Build and Train Model

Resources:
- DistilBERT paper: https://arxiv.org/abs/1910.01108
- DistilBERT Tensorflow Documentation: https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel

In [None]:
def build_model(base_model, trainable=False, params={}):
    # TODO: build the model, with the option to freeze the parameters in distilBERT
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint 1: the cls token (token for classification in bert / distilBERT)  corresponds to the first element in the sequence in DistilBERT
    # Hint 2: this guide may be helpful for parameter freezing: https://keras.io/guides/transfer_learning/
    # Hint 3: double check your number of parameters make sense
    # Hint 4: carefully consider your final layer activation and loss function

    # Refer to https://keras.io/api/layers/core_layers/input/
    max_seq_len = params["max_seq_len"]
    inputs = Input(shape = (max_seq_len,), dtype='int64', name='inputs')
    masks  = Input(shape = (max_seq_len,), dtype='int64', name='masks')

    base_model.trainable = trainable

    dbert_output = base_model(inputs, attention_mask=masks)
    dbert_last_hidden_state = dbert_output.last_hidden_state

    # Any additional layers should go here
    # use the 'params' as a dictionary for hyper parameter to facilitate experimentation
    dbert_cls_output = dbert_last_hidden_state[:,0,:]
    # two fully connected layers with dropout. This can be tweaked
    x = Dense(params["layer_width1"], activation='relu')(dbert_cls_output)
    x = Dropout(params["dropout1"])(x)
    x = Dense(params["layer_width2"], activation='relu')(x)
    x = Dropout(params["dropout2"])(x)

    probs = Dense(3, activation='sigmoid')(x)

    model = keras.Model(inputs=[inputs, masks], outputs=probs)
    model.summary()
    return model



In [None]:
def compile_model(model):
    # TODO: compile the model, include relevant auc metrics when training
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint: you may want to read up on the "multi_label" parameter in the keras AUC metrics
    model.compile(
        loss=keras.losses.BinaryCrossentropy(),
        optimizer=keras.optimizers.Adam(learning_rate=1e-5),
        metrics=[
            'accuracy', 
            keras.metrics.AUC(curve="ROC", multi_label=True), 
            keras.metrics.AUC(curve="PR", multi_label=True), 
            keras.metrics.Precision(),
            keras.metrics.Recall()
        ]
    )
    
    return model

In [None]:
def train_model(model, model_inputs_and_masks_train, model_inputs_and_masks_val,
    y_train, y_val, batch_size, num_epochs):
    # TODO: train the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    es = keras.callbacks.EarlyStopping(
        monitor="val_loss", 
        mode='min', 
        verbose=1,
        patience=1
    )
    history = model.fit(
            model_inputs_and_masks_train, 
            y_train,
            batch_size=batch_size,
            epochs=num_epochs,
            verbose=1,
            validation_data=(
                model_inputs_and_masks_val, 
                y_val
            ),
            callbacks=[es]
        )
    return model, history

In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score, precision_score, recall_score

def evaluate_model(model, model_inputs_and_masks_test, y_test):
    # TODO: evaluate the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # HINT: for pr_auc: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html 

    probs = model(model_inputs_and_masks_test, training=False)
    print(probs)
    print(y_test)

    eval_dict = {
        "false": {
            "pr_auc": average_precision_score(y_test[:, 0], probs[:, 0]), "pr_auc_random_guess": sum(y_test[:, 0])/(1.0*y_test.shape[0]), 
            "roc_auc": roc_auc_score(y_test[:, 0], probs[:, 0]), "roc_auc_random_guess": 0.5, 
            "precision": precision_score(y_test[:, 0], probs[:, 0] > 0.2),
            "recall": recall_score(y_test[:, 0], probs[:, 0] > 0.2)
        }, 
        "true": {
            "pr_auc": average_precision_score(y_test[:, 1], probs[:, 1]), "pr_auc_random_guess": sum(y_test[:, 1])/(1.0*y_test.shape[0]), 
            "roc_auc": roc_auc_score(y_test[:, 1], probs[:, 1]), "roc_auc_random_guess": 0.5, 
            "precision": precision_score(y_test[:, 1], probs[:, 1] > 0.2),
            "recall": recall_score(y_test[:, 1], probs[:, 1] > 0.2)
        }, 
        "pants": {
            "pr_auc": average_precision_score(y_test[:, 2], probs[:, 2]), "pr_auc_random_guess": sum(y_test[:, 2])/(1.0*y_test.shape[0]), 
            "roc_auc": roc_auc_score(y_test[:, 2], probs[:,2]), "roc_auc_random_guess": 0.5, 
            "precision": precision_score(y_test[:, 2], probs[:, 2] > 0.2),
            "recall": recall_score(y_test[:, 2], probs[:, 2] > 0.2)
        }
    }
    return eval_dict

# Execution



In [None]:
## DO NOT Change
train, val, test = load_data()
train_raw_x, train_raw_y = extract_raw_text_and_y(clean_data(train))
val_raw_x, val_raw_y = extract_raw_text_and_y(clean_data(val))
test_raw_x, test_raw_y = extract_raw_text_and_y(clean_data(test))

train_input, train_mask = encode_text(train_raw_x)
train_y = prepare_target(train_raw_y)

val_input, val_mask = encode_text(val_raw_x)
val_y = prepare_target(val_raw_y)

test_input, test_mask = encode_text(test_raw_x)
test_y = prepare_target(test_raw_y)

train_model_inputs_and_masks = {
    'inputs' : train_input,
    'masks' : train_mask
}

val_model_inputs_and_masks = {
    'inputs' : val_input,
    'masks' : val_mask
}

test_model_inputs_and_masks = {
    'inputs' : test_input,
    'masks' : test_mask
}


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2364.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1680.0, style=ProgressStyle(description…

Using custom data configuration default



Downloading and preparing dataset liar/default (download: 989.82 KiB, generated: 3.26 MiB, post-processed: Unknown size, total: 4.22 MiB) to /root/.cache/huggingface/datasets/liar/default/1.0.0/1a6abd9863f27194da30fcb66988477abfa3780df3b0ad1d0032979c48ec7918...


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset liar downloaded and prepared to /root/.cache/huggingface/datasets/liar/default/1.0.0/1a6abd9863f27194da30fcb66988477abfa3780df3b0ad1d0032979c48ec7918. Subsequent calls will reuse this data.



Use the cell below to execute and experiment with your model

In [None]:
dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')

params={"max_seq_len" : train_input.shape[1],
        "layer_width1" : 128,
        "dropout1" : 0.5,
        "layer_width2" : 64,
        "dropout2" : 0.5}

model = build_model(dbert_model, params=params)
model = compile_model(model)
# keeping num_epochs small here for demonstration purposes. Should train for longer
model, history = train_model(model, train_model_inputs_and_masks, val_model_inputs_and_masks, train_y, val_y, batch_size=128, num_epochs=10)
eval_dict = evaluate_model(model, test_model_inputs_and_masks, test_y)
print(eval_dict)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_transform', 'activation_13', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
inputs (InputLayer)             [(None, 100)]        0                                            
__________________________________________

## Conclusions (TODO)
TODO: Make Your Final Conclusions About Your Model (Answer questions below, answer in this cell)
- a) What is driving your model's decisions?
- b) Is your model biased in some ways? If so how? 
- c) Does your model accomplish the objectives? If not, is your model useful and how can you justify this?

# Further exploration (REMOVE ALL CODE AFTER THIS CELL BEFORE SUBMISSION)
Any code after this is not evaluated, and must be removed before submission.
Leaving code below will result in losing marks.