# Contents

1. [Training](#training)
2. [Using model](#use_model)

---
 
# Train BERT model directly <a id='training'></a>
Earlier I tried a lot of models to see if something is condescending, but they all used the reply to the main post. However, I can also use transfer learning with BERT to see if I can just predict by using the post only.

Pros:
- BERT is a very powerful model
- Can look at context of words, which we determined earlier was important
- Is NOT limited to only looking at the sentence embeddings, since in the last notebook we only used those.

Cons:
- Slow to train (it has many parameters)
- Can't see what it's using to predict.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
import torch
from transformers import TFDistilBertForSequenceClassification

Read data in

In [3]:
# load data
cond_df = pd.read_csv("./cond_data/added_features/balanced_train_more_features.csv")
cond_df.drop("Unnamed: 0", axis = 1, inplace= True)
cond_df.head(3)

Unnamed: 0,quotedpost,quotedreply,label,post,reply,post_user,reply_user,start_offset,end_offset,reddit_post_id,reddit_reply_id,has_cond,post_len,reply_len,cleaned_post,cleaned_reply
0,Please educate yoyrself before you bring your ...,"Not condescending at all, jeez.",True,"Well a guy is saying Barra, who has those grea...",> Please educate yoyrself before you bring you...,StalinHimself,Kel_Casus,135,208,dbl4vl9,dblfraj,1,37,17,"Well a guy is saying Barra, who has those grea...","Not condescending at all, jeez."
1,There might be some small piece that's incorrect,You said that. Not me. Not James-Cizuz. You sa...,True,> I think you're the one who has a reading com...,> theories are constantly growing and evolving...,kishi,jids,365,413,c2dtpq9,c2dtywp,1,314,230,"Well you're a stupid poopy-head.\n\nSee, I don...",Why would theories self-correct if they were a...
2,If I try and force down a breakfast I start ga...,Yes!\n\nPeople were so condescending about it ...,False,For me it's like temporarily having the flu. T...,> If I try and force down a breakfast I start ...,amphetaminesfailure,CowGiraffe,331,383,cuv97mf,cuvnb27,1,107,179,For me it's like temporarily having the flu. T...,Yes!\n\nPeople were so condescending about it ...


Things to note:
- Needs to put text in in a specific format but this can be handled using the tokenizer
- Restricted to 512 words/tokens.

In [4]:
# get only short posts just to be safe
short_posts = cond_df[cond_df["post_len"] < 512]
short_posts.shape

(4961, 16)

In [5]:
# train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    short_posts["cleaned_post"],
    short_posts["label"])

# it has to be a 1 and 0, not T or F
y_train = [1 if s else 0 for s in list(y_train)]
y_val = [1 if s else 0 for s in list(y_val)]

Again use the BERT tokenizer.

In [6]:
# This just tokenizes it for BERT
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

In [7]:
train_encoding = tokenizer(list(X_train), truncation = True, padding = True)
val_encoding = tokenizer(list(X_val), truncation = True, padding = True)

Follow the tutorial from https://huggingface.co/transformers/custom_datasets.html

In [9]:
# import transformer things
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

In [8]:
class CondDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    
train_dataset = CondDataset(train_encoding, y_train)
val_dataset = CondDataset(val_encoding, y_val)


In [11]:
# Continued

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

# uncomment to train
# trainer.train()

# save the model, uncomment to save
# model.save_pretrained("models/basic_bert_model_final/")
print("Finally done")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Step,Training Loss
10,0.702091
20,0.694434
30,0.696628
40,0.697723
50,0.69546
60,0.697719
70,0.693809
80,0.686077
90,0.68737
100,0.681903


Finally, done


The loss seems to be hovering around 0.5 and isn't getting better.

# See how well model works <a id='use_model'></a>
Load the model again.

In [10]:
model = DistilBertForSequenceClassification.from_pretrained("./models/basic_bert_model_final/")

Make a function that returns the prediction.

Also, the model returns a value for each class but it's not a probability, it's just a number. Therefore, just take the larger number and that will be the prediction.

In [30]:
def predict(sentences):
    results = []
    
    for sentence in sentences:
        inputs_encoding = tokenizer.encode(sentence)
        
        if len(inputs_encoding) > 512:
            inputs_encoding = inputs_encoding[:512]
        
        # see which one is bigger
        value0 = model(torch.tensor(inputs_encoding).unsqueeze(0))[0][0][0].data.item()
        value1 = model(torch.tensor(inputs_encoding).unsqueeze(0))[0][0][1].data.item()

        # return true or false depending if the first or second one is larger
        results.append(value0 < value1)
        
    return results

In [35]:
%%time
# get predictions for validation data
val_predictions = predict(list(X_val))

Wall time: 3min 40s


In [37]:
# convert it into a number
val_predictions_number = [1 if i else 0 for i in val_predictions]

In [41]:
from sklearn.metrics import plot_confusion_matrix, roc_auc_score, accuracy_score, confusion_matrix

In [40]:
print("Accuracy:")
print(accuracy_score(y_val, val_predictions))
print("AUC ROC:")
print(roc_auc_score(y_val, val_predictions))

Accuracy:
0.8477034649476228
AUC ROC:
0.8471577650855685


The accuracy is actually surprisingly good considering I only used the post and not the response. So this is quite good for predicting condescending posts. However let's look at the confusion matrix too.

In [48]:
conf_matrix = confusion_matrix(y_val, val_predictions, normalize="true")
conf_matrix = pd.DataFrame(conf_matrix, columns = ["Predicted 0", "Predicted 1"], index=["Actual 0", "Actual 1"])
conf_matrix

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,0.87224,0.12776
Actual 1,0.177924,0.822076


There are a few more false negatives than false positives but not by a lot, so it seems ok. Overall, this model shows that it's possible to predict condescending posts by just looking at the post itself (not the reply).

However I don't really know what the model is looking at.

Try it on the provided test dataset (completely separate set of data). I wasn't going to do this but since the training used my validation data I wasn't 100% sure if it was fitting on that.

In [49]:
test_data = pd.read_json('./cond_data/imbalanced_test.jsonl', lines=True)
test_data.head()

Unnamed: 0,quotedpost,quotedreply,label,post,reply,post_user,reply_user,start_offset,end_offset,reddit_post_id,reddit_reply_id
0,I have seen more biased and tunnel visioned vi...,Challenge them when you see them. Challenge th...,False,I have seen more biased and tunnel visioned vi...,>I have seen more biased and tunnel visioned v...,ykci,kencabbit,0,103,c0jxu8p,c0jxv1l
1,Poor Trump supporters see people like my famil...,"All true, and most of these Trump supporters a...",False,Exactly. Poor Trump supporters see people like...,> Poor Trump supporters see people like my fam...,imrollinv2,michaelochurch,9,230,e3cxi4j,e3d6v8o
2,"I love that you blame the casual fans,",Did I ever say that? I actually feel that the ...,True,"I love your logic: ""why won't the fans keep co...",">I love your logic: ""why won't the fans keep c...",Skeennn,[deleted],957,995,ck7zo1m,ck802cb
3,What exactly are you talking about? I suspect ...,Hilarious. You can't imagine Black voters wou...,True,The point is that you are speaking in vague hy...,> What exactly are you talking about? I suspec...,YabuSama2k,Darrkman,183,301,d5p2erl,d5pd789
4,And yet.. you don't see a difference between t...,"No, man. I get to respond to your bullshit con...",True,"> Bullshit, there's nothing civil in how you s...",">Are you Martin Shkreli? If so.. okay, but if ...",congelar,GroundhogExpert,558,835,dl8afrk,dl8bibd


First we need to remove the quotes like we did before.

In [51]:
# Add in the post/reply columns but without quotes
# this is the same code from the other notebook
import regex as re

def remove_reddit_quotes(text):
    quote_regex = r">.*\n\n"    
    return re.sub(quote_regex, "", text)

In [52]:
%%time
test_data["cleaned_post"] = test_data["post"].map(remove_reddit_quotes)
test_data["cleaned_reply"] = test_data["reply"].map(remove_reddit_quotes)

Wall time: 65.8 ms


Then, use the model to predict condecenscion using only the post.

In [57]:
%%time
# predict for the test data
test_predictions = predict(list(test_data["cleaned_post"]))

Wall time: 18min 16s


In [59]:
# imbalanced classes so just use auc roc
print("AUC ROC:")
print(roc_auc_score(test_data["label"], test_predictions))

AUC ROC:
0.7025306748466258


In [62]:
conf_matrix = confusion_matrix(test_data["label"], test_predictions, normalize="true")
conf_matrix = pd.DataFrame(conf_matrix, columns = ["Predicted 0", "Predicted 1"], index=["Actual 0", "Actual 1"])
conf_matrix

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,0.770092,0.229908
Actual 1,0.365031,0.634969


We can see that the model has over fit on the validation data from earlier as the performance isn't that good. Also there are quite a lot of false negatives (about 35%).

Since the aim is to figure out what being condescending is, as well as make a model that can predict people who are condescending, this model isn't that helpful

- The accuracy/auc roc is ok but there are still a lot of false negatives
- The model is really complicated so it's hard to explain what is going on.