<a href="https://colab.research.google.com/github/SamGreatYeah/hello-world/blob/main/Siheng_Chen_DS110_HW7(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with BERT

BERT is a trained neural network model that produces vectors representing word
meanings.  Another machine learning algorithm can rest on top of these vectors and use them to classify text.

Here, we're working with some code that gets BERT running with the worst of our classifiers for high-dimensional data, k-nearest neighbors. 

**(1, 10 points)** Try training a random forest classifier (sklearn.ensemble.RandomForestClassifier) and a AdaBoost classifier () and see whether their performance is any better.

In [2]:
# Based on tutorial at
# http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
# and including some code from there

!pip install transformers

import numpy as np
import pandas as pd
import torch
import nltk
import transformers as ppb
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Location of SST2 sentiment dataset
SST2_LOC = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'
WEIGHTS = 'distilbert-base-uncased'
# Performance on whole 6920 sentence set is very similar, but takes rather longer
SET_SIZE = 2000


# Download the dataset from its Github location, return as a Pandas dataframe
def get_dataframe():
    df = pd.read_csv(SST2_LOC, delimiter='\t', header=None)
    return df[:SET_SIZE]

# Extract just the labels from the dataframe
def get_labels(df):
    return df[1]

# Get a trained tokenizer for use with BERT
def get_tokenizer():
    return ppb.DistilBertTokenizer.from_pretrained(WEIGHTS)

# Convert the sentences into lists of tokens
def get_tokens(dataframe, tokenizer):
    return dataframe[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

# We want the sentences to all be the same length; pad with 0's to make it so
def pad_tokens(tokenized):
    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)
    padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
    return padded

# Grab a trained DistiliBERT model
def get_model():
    return ppb.DistilBertModel.from_pretrained(WEIGHTS)

# This step takes a little while, since it actually runs the model on all sentences.
# Get model with get_model(), 0-padded token lists with pad_tokens() on get_tokens().
# Only returns the [CLS] vectors representing the whole sentence, corresponding to first token.
def get_bert_sentence_vectors(model, padded_tokens):
    # Mask the 0's padding from attention - it's meaningless
    mask = torch.tensor(np.where(padded_tokens != 0, 1, 0))
    with torch.no_grad():
        word_vecs = model(torch.tensor(padded_tokens).to(torch.int64), attention_mask=mask)
    # First vector is for [CLS] token, represents the whole sentence
    return word_vecs[0][:,0,:].numpy()


# To separate into train and test:
# train_features, test_features, train_labels, test_labels = train_test_split(vecs, labels)
def train_knn(train_features, train_labels):
    knc = KNeighborsClassifier()
    knc.fit(train_features, train_labels)
    return knc

# General purpose scikit-learn classifier evaluator.  The classifier is trained with .fit()
def evaluate(classifier, test_features, test_labels):
    return classifier.score(test_features, test_labels)


Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 3.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 11.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 2.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

In [3]:
df = get_dataframe()
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [4]:
labels = get_labels(df)
tokenizer = get_tokenizer()
tokens = get_tokens(df, tokenizer)
padded = pad_tokens(tokens)
model = get_model()
vecs = get_bert_sentence_vectors(model, padded)

train_features, test_features, train_labels, test_labels = train_test_split(vecs, labels)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
knn = train_knn(train_features, train_labels)
print(evaluate(knn, test_features, test_labels))

0.736


In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# (1) TODO:  Train random forest classifier (default settings okay)

forest = RandomForestClassifier(n_estimators=200).fit(train_features, train_labels)
forest.score(test_features,test_labels)

# (2) TODO:  Train boosted classifier (default settings okay)


0.816

**(2, 6 points)** Now, try improving the performance of both methods by using five times as many trees as the default in both cases.  Regardless of whether there's improvement, this is worthwhile to check.  Use the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier).

In [None]:
# TODO:  Train random forest classifier with 5x trees

# TODO:  Train boosted classifier with 5x trees

**(3, 6 points)** What's one reason increasing the number of trees in the random forest could result in better performance?

**TODO**

Removing stopwords and lemmatizing are two steps that aren't supposed to be necessary with a word vector creator like BERT, which trains on the full sentences.  But we can still experiment.

**(4, 10 points)** Fill in the code for lemma_and_stop(), which should take a string and return a list or WordList that has the stop words removed and the other words lemmatized.  Then run df_lemmatize() on the original dataframe, and carry out the full experiment with the larger random forest doing the learning and classification.

In [None]:
from textblob import TextBlob

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = stopwords.words('english')

def df_lemmatize(df):
  for index in df.index:
    df.loc[index, 0] = ' '.join(lemma_and_stop(df.loc[index,0]))
  return df

# TODO
def lemma_and_stop(my_str):
  # TODO

In [None]:
df = get_dataframe()
df = df_lemmatize(df)

In [None]:
labels = get_labels(df)
tokenizer = get_tokenizer()
tokens = get_tokens(df, tokenizer)
padded = pad_tokens(tokens)
model = get_model()
vecs = get_bert_sentence_vectors(model, padded)
train_features, test_features, train_labels, test_labels = train_test_split(vecs, labels)

In [None]:
# TODO: Random forest classifier working with the train & test features

**(5, 16 points)** Now we'll try doing some sentence classification over the web, although we won't use the BERT classifier, which is a little fiddly to use, just yet.  Use the requests module and Beautiful Soup to call down the page https://www.rottentomatoes.com/m/shang_chi_and_the_legend_of_the_ten_rings/reviews?intcmp=rt-scorecard_tomatometer-reviews which contains many review snippets for the Marvel movies Shang Chi and the Legend of the Ten Rings.  (This will classify some html junk as sentences, too.)  Use TextBlob to break it into sentences, and use the .sentiment attribute of the sentence TextBlob (*not* BERT) to print the sentence with its classification for each sentence.

After you get the initial classified Beautiful Soup results, clean the results by requiring your sentences to contain the words "Original Score" in order to print them.  (This is admittedly a hack, and one that doesn't work perfectly.)  Just turn in this version of the code that produces cleaner results.

In [None]:
# TODO get Shang Chi reviews site, use Beautiful Soup get_text, and
# print a sentiment for each sentence

import requests
from bs4 import BeautifulSoup

#TODO


**(6, 12 points)**  Now we want to bridge the BERT code and the Beautiful Soup code, running our own sentiment classifier on the sentences.  To do that, finish the predict_from_sentence() function below, which should take a trained scikit-learn classifier and a sentence, and output a sentiment prediction.  (You may need to rename the random_forest classifier in the tests.)  Two functions to help bridge from BERT have been provided, so the work left is only a few lines; recall the scikit-learn classifiers all have a predict() method that can make a prediction for a particular example.  When you pass the tests, copy your Beautiful Soup code from the previous code box and put it in the last code box, using predict_from_sentence() to make predictions instead of the TextBlob built-in functionality.  (Note there that you may need to use str() to cast your Sentences to strings.)  The result will still be a little messy, but you should be able to find review sentences that your sentiment analyzer has rated.

In [None]:
def get_tokens_from_sentence(sentence):
  df = pd.DataFrame([[sentence]])
  return get_tokens(df,get_tokenizer())

def get_bert_vecs_from_sentence(sentence):
  tokens = get_tokens_from_sentence(sentence)
  model = get_model()
  vecs =  get_bert_sentence_vectors(model, pad_tokens(tokens))
  return vecs

#TODO Take a trained classifier and a sentence, turn the sentence into bert
# vectors, and run the classifier on those vectors
def predict_from_sentence(clf, sentence):
  # TODO

print(predict_from_sentence(random_forest, "I like this movie"))  # Expect [1]
print(predict_from_sentence(random_forest, "This terrible movie is terrible"))  # Expect [0]

In [None]:
# TODO:  final scraping code combined with BERT/random forest classification
# TODO get Shang Chi reviews site, use Beautiful Soup get_text, and
# print a sentiment for each sentence
