**Discussion Question:**

In a 2017 debate, Gary Marcus argued that most of the accomplishments of neural networks are on perception problems, with a few exceptions like game-playing and media generation.  "Sheer bottom-up statistics hasn't gotten us very far on an important set of problems -- language, reasoning, planning, and common sense..." (quoted in Genius Makers, Metz 2021 p. 269).  Share an anecdote about ChatGPT that either demonstrates facility or lack of facility with one of these traits.  What are its weak points?


In this discussion section, you'll have a chance to play with BERT, the Google-developed large language model that is often used in NLP projects because it's freely available and a good size, neither too large nor too small.

The task will remain the same from the "Bayesian Tomatoes" Naive Bayes task: sentiment
analysis. We're going to change the corpus slightly; we used SST1 there, a sentiment dataset with multiple classification categories and a tree structure to the sentiment classifications that we ignored (hence all the skipped lines of the dataset). This time, we'll use SST2, a sentiment dataset collected later that has just binary sentiment classifications. The solution code for assignment 3 gets 71.8% accuracy on this dataset, as a benchmark for your other results. (Which should all be better than that.)

The goal of this exercise will be to have a function that takes a sentence and outputs
whether the sentiment is negative (0) or positive (1), using a light classifier trained on top of BERT's output.

For the first part of the exercise, you'll need to install the "transformers" module, and possibly "torch" as well.  The examples here happen to use PyTorch instead of TensorFlow/Keras, so if you've never worked with PyTorch, it's recommended you work in Google Colab where PyTorch is already installed instead of taking the time now to install PyTorch on your local machine.

In [None]:
!pip install transformers
!pip install torch

Next, use the functions provided below to follow these steps (based on jalammar's
introduction to BERT on GitHub, https://github.com/jalammar/jalammar.github.io/blob/master/
notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb). You shouldn't
need to copy any code from there - just call the provided functions.

a) Download the 2000 sentiment-labeled sentences to a Pandas dataframe.

b) Create a BERT tokenizer and turn the dataset's sentences into BERT tokens.

c) Pad the tokens with zeros, so that the sentences are all the same length.

d) Call get_bert_sentence_vectors() to extract the vectors corresponding to the first token, CLS, for each sentence. This vector is trained to represent the overall meaning of the sentence for classification tasks as best as possible.  Name your array of vectors "vecs."

In [1]:
import numpy as np
import pandas as pd
import nltk
import torch
import transformers as ppb
from matplotlib import pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [2]:
# Location of SST2 sentiment dataset
SST2_LOC = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'
WEIGHTS = 'distilbert-base-uncased'
# Performance on whole 6920 sentence set is very similar, but takes rather longer
SET_SIZE = 2000

In [3]:
# Download the dataset from its Github location, return as a Pandas dataframe
def get_dataframe():
    df = pd.read_csv(SST2_LOC, delimiter='\t', header=None)
    return df[:SET_SIZE]

# Extract just the labels from the dataframe
def get_labels(df):
    return df[1]

# Get a trained tokenizer for use with BERT
def get_tokenizer():
    return ppb.DistilBertTokenizer.from_pretrained(WEIGHTS)

# Convert the sentences into lists of tokens
def get_tokens(dataframe, tokenizer):
    return dataframe[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

# We want the sentences to all be the same length; pad with 0's to make it so
def pad_tokens(tokenized):
    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)
    padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
    return padded

# Grab a trained DistiliBERT model
def get_model():
    return ppb.DistilBertModel.from_pretrained(WEIGHTS)

# This step takes a little while, since it actually runs the model on all sentences.
# Get model with get_model(), 0-padded token lists with pad_tokens() on get_tokens().
# Only returns the [CLS] vectors representing the whole sentence, corresponding to first token.
def get_bert_sentence_vectors(model, padded_tokens):
    # Mask the 0's padding from attention - it's meaningless
    mask = torch.tensor(np.where(padded_tokens != 0, 1, 0))
    with torch.no_grad():
        word_vecs = model(torch.tensor(padded_tokens).to(torch.int64), attention_mask=mask)
    # First vector is for [CLS] token, represents the whole sentence
    return word_vecs[0][:,0,:].numpy()

In [None]:
# TODO use the above functions to get an array of sentence vectors

If you called the above functions in the right order and named your output "vecs", you should now have 2000 vectors representing the 2000 sentences, each with 768 elements in the vector.  You can check this below.

In [None]:
vecs.shape

Now let's check that sentences with similar meanings have vectors that are close in the space.  Write a function find_closest_sentences(vecs, sentences) that finds the two different sentences in the data that have the most similar meaning (the distance between the vectors is smallest).

(The sentences are the zeroth element of the SST2 dataframe, df[0]. It may take a little while. For speed, try using numpy.linalg.norm instead of your own implementation of distance.)

In [None]:
# Find closest sentences:  Find the two sentences closest in the space
# (vecs from BERT, sentences in same order from df[0])
def find_closest_sentences(vecs, sentences):
    # TODO

In [None]:
find_closest_sentences(vecs, df[0])

Now use scikit-learn and train_test_split() to separate these 2000 sentences and labels into a train and test set (80%/20%), and then pick your favorite scikit-learn classifier to train on the training data semantic vectors to predict the sentiment, testing on your test data.  (Some options include k-nearest-neighbors, Adaboost, random forests, or a simple neural network or "multilayer perceptron" as scikit-learn calls it.)  Recall that the function names you want are yourmodel.fit(train_features,train_labels) and yourmodel.score(test_features, test_labels).  Compare your results with other students who tried different classifiers.

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(vecs, labels,test_size=0.2)

In [None]:
# TODO:  train and test a classifier on the features here