

The goal of the notebook are three folds:


1.   Explore word embedding
2.   Understand contextual word embedding using BERT
3.   Text classificaiton with both traditional machine learning methods and deep learning methods

**A note about GPU**: You'd better use GPU to run it, otherwise it will be quite slow to train deep learning models.  

First, import the packages or modules required for this homework.

In [None]:
pip install tensorflow


In [None]:

import tensorflow as tf
import numpy as np
from tensorflow import keras

In [None]:
pip install tf_keras


## Part I: Explore Word Embedding (15%)

Word embeddings are useful representation of words that capture information about word meaning as well as location. They are used as a fundamental component for downstream NLP tasks, e.g., text classification. In this part, we will explore the embeddings produced by [GloVe (global vectors for word representation)](https://nlp.stanford.edu/projects/glove/). It is simlar to Word2Vec but differs in their underlying methodology: in GloVe, word embeddings are learned based on global word-word co-occurrence statistics. Both Word2Vec and GloVe tend to produce vector-space embeddings that perform similarly in downstream NLP tasks.




We first load the GloVe vectors

In [None]:
import gensim.downloader as api
# download the model and return as object ready for use
glove_word_vectors = api.load('glove-wiki-gigaword-100')


Take a look at the vocabulary size and dimensionality of the embedding space

In [None]:
print('vocabulary size = ', len(glove_word_vectors.index_to_key))
print('embedding dimensionality = ', glove_word_vectors['happy'].shape)

vocabulary size =  400000
embedding dimensionality =  (100,)



What is embedding exactly?

  Ans. **_In the context of natural language processing (NLP), word embeddings are numerical representations of words where words with similar meanings have similar vector representations. Word embeddings are widely used in NLP tasks such as text classification, sentiment analysis, machine translation, and more, as they provide a dense representation of words that can capture semantic relationships and similarities between words._**

In [None]:
# Check word embedding for 'happy'
# You can access the embedding of a word with glove_word_vectors[word] if word
# is in the vocabulary
glove_word_vectors['happy']

array([-0.090436 ,  0.19636  ,  0.29474  , -0.47706  , -0.80436  ,
        0.3078   , -0.55205  ,  0.58453  , -0.17056  , -0.84846  ,
        0.19528  ,  0.23671  ,  0.46827  , -0.58977  , -0.12163  ,
       -0.24697  , -0.072944 ,  0.17259  , -0.0485   ,  0.9527   ,
        0.50629  ,  0.58497  , -0.19367  , -0.45459  , -0.031095 ,
        0.51633  , -0.24052  , -0.1007   ,  0.53627  ,  0.024225 ,
       -0.50162  ,  0.73692  ,  0.49468  , -0.34744  ,  0.89337  ,
        0.057439 , -0.19127  ,  0.39333  ,  0.21182  , -0.89837  ,
        0.078704 , -0.16344  ,  0.45261  , -0.41096  , -0.19499  ,
       -0.13489  , -0.016313 , -0.021849 ,  0.17136  , -1.2413   ,
        0.079503 , -0.91144  ,  0.35699  ,  0.36289  , -0.24934  ,
       -2.1196   ,  0.14534  ,  0.52964  ,  0.90134  ,  0.033603 ,
        0.022809 ,  0.70625  , -1.0362   , -0.59809  ,  0.70592  ,
       -0.072793 ,  0.67033  ,  0.52763  , -0.47807  , -0.67374  ,
        0.36632  , -0.38284  , -0.10349  , -0.6402   ,  0.1810

With word embeddings learned from GloVe or Word2Vec, words with similar semantic meanings tend to have vectors that are close together. Please code and calculate the **cosine similarities** between words based on their embeddings (i.e., word vectors).

For each of the following words in occupation, compute its cosine similarty to 'woman' and its similarity to 'man' and check which gender is more similar.

*occupation = {homemaker, nurse, receptionist, librarian, socialite, hairdresser, nanny, bookkeeper, stylist, housekeeper, maestro, skipper, protege, philosopher, captain, architect, financier, warrior, broadcaster, magician}*

**Inline Question #1:**
- Fill in the table below with cosine similarities between words in occupation list and {woman, man}. Please show only two digits after decimal.
- Which words are more similar to 'woman' than to 'man'?

  **_['homemaker', 'nurse', 'receptionist', 'librarian', 'socialite', 'hairdresser', 'nanny', 'bookkeeper', 'stylist', 'housekeeper']_**
- Which words are more similar to 'man' than to 'woman'?

  **_['maestro', 'skipper', 'protege', 'philosopher', 'captain', 'architect', 'financier', 'warrior', 'broadcaster', 'magician']_**
- Do you see any issue here? What do you think might cause these issues?

    **_The issue here is that the results seem to reinforce gender stereotypes, where traditionally female-associated occupations like "homemaker," "nurse," and "hairdresser" are more similar to the word "woman," while traditionally male-associated occupations like "maestro," "captain," and "warrior" are more similar to the word "man."_**
    
    **_These issues likely arise from biases present in the training data used to create the GloVe word embeddings. If the text data used to train the word embeddings contains biases or stereotypical associations, these biases will be reflected in the resulting word embeddings. If the training data lacks diverse representations of different genders in various occupations._**

**Your Answer:**

| `similarity`|    woman  |      man     |
|-------------|-----------|--------------|
| homemaker   |   0.43    |    0.24      |
| nurse       |   0.61    |    0.46      |
| receptionist|   0.34    |    0.19      |
| librarian   |   0.34    |    0.23      |
| socialite   |   0.42    |    0.27      |
| hairdresser |   0.39    |    0.26      |
| nanny       |   0.36    |    0.29      |
| bookkeeper  |   0.21    |    0.14      |
| stylist     |   0.31    |    0.25      |
| housekeeper |   0.46    |    0.31      |
| maestro     |  -0.02    |    0.14      |
| skipper     |   0.15    |    0.34      |
| protege     |   0.12    |    0.20      |
| philosopher |   0.23    |    0.28      |
| captain     |   0.31    |    0.53      |
| architect   |   0.22    |    0.30      |
| financier   |   0.14    |    0.26      |
| warrior     |   0.39    |    0.51      |
| broadcaster |   0.23    |    0.25      |
| magician    |   0.27    |    0.38      |


In [None]:
################################################################################
# TODO: Fill in your codes                                                     #                                                          #
################################################################################
import numpy as np

def cosine_similarity(x, y):
    dot = np.dot(x, y)
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)

    cos_sim = dot / (norm_x * norm_y)

    return cos_sim

occupation = ['homemaker', 'nurse', 'receptionist', 'librarian', 'socialite', 'hairdresser', 'nanny', 'bookkeeper', 'stylist', 'housekeeper', 'maestro', 'skipper', 'protege', 'philosopher', 'captain', 'architect', 'financier', 'warrior', 'broadcaster', 'magician']

similarity_to_woman = {}
similarity_to_man = {}

for i in occupation:
    x = glove_word_vectors[i]
    y_woman = glove_word_vectors['woman']
    y_man = glove_word_vectors['man']

    similarity_to_woman[i] = cosine_similarity(x, y_woman)
    similarity_to_man[i] = cosine_similarity(x, y_man)

for i in occupation:
    print(f"Occupation: {i}")
    print(f"Similarity to 'woman': {similarity_to_woman[i]:.2f}")
    print(f"Similarity to 'man': {similarity_to_man[i]:.2f}")
    print()


for i in occupation:
  similarity_to_woman[i] = cosine_similarity(x, y_woman)
  similarity_to_man[i] = cosine_similarity(x, y_man)
  if similarity_to_woman[i] > similarity_to_man[i]:
    print(i)



Occupation: homemaker
Similarity to 'woman': 0.43
Similarity to 'man': 0.24

Occupation: nurse
Similarity to 'woman': 0.61
Similarity to 'man': 0.46

Occupation: receptionist
Similarity to 'woman': 0.34
Similarity to 'man': 0.19

Occupation: librarian
Similarity to 'woman': 0.34
Similarity to 'man': 0.23

Occupation: socialite
Similarity to 'woman': 0.42
Similarity to 'man': 0.27

Occupation: hairdresser
Similarity to 'woman': 0.39
Similarity to 'man': 0.26

Occupation: nanny
Similarity to 'woman': 0.36
Similarity to 'man': 0.29

Occupation: bookkeeper
Similarity to 'woman': 0.21
Similarity to 'man': 0.14

Occupation: stylist
Similarity to 'woman': 0.31
Similarity to 'man': 0.25

Occupation: housekeeper
Similarity to 'woman': 0.46
Similarity to 'man': 0.31

Occupation: maestro
Similarity to 'woman': -0.02
Similarity to 'man': 0.14

Occupation: skipper
Similarity to 'woman': 0.15
Similarity to 'man': 0.34

Occupation: protege
Similarity to 'woman': 0.12
Similarity to 'man': 0.20

Occupa

In [None]:

similar_to_woman=[]

for i in occupation:
    similarity_to_woman = cosine_similarity(glove_word_vectors[i], glove_word_vectors['woman'])
    similarity_to_man = cosine_similarity(glove_word_vectors[i], glove_word_vectors['man'])
    if similarity_to_woman > similarity_to_man:
      similar_to_woman.append(i)
      #print(f'Words are more similar to woman than to man: {i}')

print(similar_to_woman)

similar_to_man=[]
for i in occupation:
    similarity_to_woman = cosine_similarity(glove_word_vectors[i], glove_word_vectors['woman'])
    similarity_to_man = cosine_similarity(glove_word_vectors[i], glove_word_vectors['man'])
    if similarity_to_woman < similarity_to_man:
      similar_to_man.append(i)
      #print(f'Words are more similar to man than to woman: {i}')
print(similar_to_man)

['homemaker', 'nurse', 'receptionist', 'librarian', 'socialite', 'hairdresser', 'nanny', 'bookkeeper', 'stylist', 'housekeeper']
['maestro', 'skipper', 'protege', 'philosopher', 'captain', 'architect', 'financier', 'warrior', 'broadcaster', 'magician']


## Part II Understand contextual word embedding using BERT

A big difference between Word2Vec and BERT is that Word2Vec learns context-free word representations, i.e., the embedding for 'orange' is the same in "I love eating oranges" and in "The sky turned orange". BERT, on the other hand, produces contextual word presentations, i.e., embeddings for the same word in different contexts are different.

For example, let us compare the context-based embedding vectors for 'orange' in the following three sentences using Bert:
* "I love eating oranges"
* "My favorite fruits are oranges and apples"
* "The sky turned orange"

Same as in "Lab 5 BERT", we use the BERT model and tokenizer from the Huggingface transformer library ([1](https://huggingface.co/course/chapter1/1), [2](https://huggingface.co/docs/transformers/quicktour))

In [None]:
# Note that we need to install the latest version of transformers
# Due to problems we encountered in class and reported here
# https://discuss.huggingface.co/t/pretrain-model-not-accepting-optimizer/76209
# https://github.com/huggingface/transformers/issues/29470

!pip install --upgrade transformers
import transformers
print(transformers.__version__)



4.39.3


In [None]:
from transformers import BertTokenizer, TFBertModel




We use the 'bert-base-cased' from Huggingface as the underlying BERT model and the associated tokenizer.

In [None]:
bert_model = TFBertModel.from_pretrained('bert-base-cased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
example_sentences = ["I love eating oranges",
                     "My favorite fruits are oranges and apples",
                     "The sky turned orange"]

Let us start by tokenizing the example sentences.

In [None]:
# Check how Bert tokenize each sentence
# This helps us identify the location of 'orange' in the tokenized vector
for sen in example_sentences:
  print(bert_tokenizer.tokenize(sen))

['I', 'love', 'eating', 'orange', '##s']
['My', 'favorite', 'fruits', 'are', 'orange', '##s', 'and', 'apples']
['The', 'sky', 'turned', 'orange']


Notice that the prefix '##' indicates that the token is a continuation of the previous one. This also helps us identify location of 'orange' in the tokenized vector, e.g., 'orange' is the 4th token in the first sentence. Note that here the tokenize() function just splits a text into words, and doesn't add a 'CLS' (classification token) or a 'SEP' (separation token) to the text.

Next, we use the tokenizer to transfer the example sentences to input that the Bert model expects.

In [None]:
bert_inputs = bert_tokenizer(example_sentences,
                             padding=True,
                             return_tensors='tf')

bert_inputs

{'input_ids': <tf.Tensor: shape=(3, 10), dtype=int32, numpy=
array([[  101,   146,  1567,  5497,  5925,  1116,   102,     0,     0,
            0],
       [  101,  1422,  5095, 11669,  1132,  5925,  1116,  1105, 22888,
          102],
       [  101,  1109,  3901,  1454,  5925,   102,     0,     0,     0,
            0]])>, 'token_type_ids': <tf.Tensor: shape=(3, 10), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(3, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])>}

So there are actually three outputs: the input ids (starting with '101' for the '[CLS]' token), the token_type_ids which are usefull when one has distinct segments, and the attention masks which are used to mask out padding tokens.

Please refer to our Lab 4 for more details about input_ids, token_type_ids, and attention_masks.

More resources:
*    https://huggingface.co/docs/transformers/preprocessing
*    https://huggingface.co/docs/transformers/tokenizer_summary

Now, let us get the BERT encoding of our example sentences.

In [None]:
bert_outputs = bert_model(bert_inputs)

print('shape of first output: \t\t', bert_outputs[0].shape)
print('shape of second output: \t', bert_outputs[1].shape)

shape of first output: 		 (3, 10, 768)
shape of second output: 	 (3, 768)


There are two outputs here: one with dimensions [3, 10, 768] and one with [3, 768]. The first one [batch_size, sequence_length, embedding_size] is the output of the last layer of the Bert model and are the contextual embeddings of the words in the input sequence. The second output [batch_size, embedding_size] is the embedding of the first token of the sequence (i.e., classification token).

Note you can also get the first output through bert_output.last_hidden_state (see below, also check https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert#transformers.TFBertModel)

We need the first output to get contextualized embeddings for 'orange' in each sentence.

In [None]:
bert_outputs[0]

<tf.Tensor: shape=(3, 10, 768), dtype=float32, numpy=
array([[[ 0.72526133,  0.10203083, -0.24569108, ..., -0.04899119,
          0.39186496,  0.02921054],
        [ 0.22793137,  0.12613058,  0.12215026, ...,  0.31073734,
         -0.13684127,  0.33928937],
        [ 0.11651945,  0.20991722, -0.62462795, ...,  0.7515276 ,
         -0.7327964 ,  0.05906113],
        ...,
        [-0.10604811, -0.1910808 , -0.11248238, ...,  0.13676174,
          0.0700286 ,  0.19438452],
        [ 0.05706045, -0.29356378, -0.03861157, ...,  0.01285999,
          0.27537084,  0.15846887],
        [ 0.24304445, -0.06842123,  0.09176344, ..., -0.161979  ,
          0.24152774,  0.00447912]],

       [[ 0.54033554, -0.11092855, -0.12229536, ..., -0.16148904,
          0.23800467, -0.03805562],
        [-0.08707761, -0.13345687,  0.35856643, ..., -0.06155135,
          0.13525176,  0.38933346],
        [-0.01516274, -0.41845623, -0.13030319, ...,  0.12552255,
         -0.4938446 ,  0.521323  ],
        ...,


In [None]:
bert_outputs.last_hidden_state

<tf.Tensor: shape=(3, 10, 768), dtype=float32, numpy=
array([[[ 0.72526133,  0.10203083, -0.24569108, ..., -0.04899119,
          0.39186496,  0.02921054],
        [ 0.22793137,  0.12613058,  0.12215026, ...,  0.31073734,
         -0.13684127,  0.33928937],
        [ 0.11651945,  0.20991722, -0.62462795, ...,  0.7515276 ,
         -0.7327964 ,  0.05906113],
        ...,
        [-0.10604811, -0.1910808 , -0.11248238, ...,  0.13676174,
          0.0700286 ,  0.19438452],
        [ 0.05706045, -0.29356378, -0.03861157, ...,  0.01285999,
          0.27537084,  0.15846887],
        [ 0.24304445, -0.06842123,  0.09176344, ..., -0.161979  ,
          0.24152774,  0.00447912]],

       [[ 0.54033554, -0.11092855, -0.12229536, ..., -0.16148904,
          0.23800467, -0.03805562],
        [-0.08707761, -0.13345687,  0.35856643, ..., -0.06155135,
          0.13525176,  0.38933346],
        [-0.01516274, -0.41845623, -0.13030319, ...,  0.12552255,
         -0.4938446 ,  0.521323  ],
        ...,


Now, we get the embeddings of 'orange' in each sentence by simply finding the 'orange'-token positions in the embedding output and extract the proper components:

In [None]:
orange_1 = bert_outputs[0][0, 4]
orange_2 = bert_outputs[0][1, 5]
orange_3 = bert_outputs[0][2, 4]

oranges = [orange_1, orange_2, orange_3]

We calculate pair-wise cosine similarities:

In [None]:
def cosine_similarities(vecs):
    for v_1 in vecs:
        similarities = ''
        for v_2 in vecs:
            similarities += ('\t' + str(np.dot(v_1, v_2)/
                np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
        print(similarities)

In [None]:
cosine_similarities(oranges)

	1.0	0.91	0.69
	0.91	1.0	0.66
	0.69	0.66	1.0


The similarity metrics make sense. The 'orange' in "The sky turned orange" is different from the rest.

Next, please compare the contextual embedding vectors of 'bank' in the following four sentences:


*   "I need to bring my money to the bank today"
*   "I will need to bring my money to the bank tomorrow"
*   "I had to bank into a turn"
*   "The bank teller was very nice"


**Inline Question #1:**

- Please calculate the pair-wise cosine similarities between 'bank' in the four sentences and fill in the table below. (Note, bank_i represent bank in the i_th sentence)
- Please explain the results. Does it make sense?
  
  ** _Overall, the results seem reasonable.The "bank" in the third sentence. The cosine similarity values are lower (0.59, 0.59, 0.62) compared to the other contexts, indicating less similarity with the other contexts. This makes sense because the word "bank" is used in a different sense here (possibly referring to a maneuver in driving: cause to tilt sideways in making a turn.)._
  
  ** _The occurrences of "bank" in the first, second, and fourth sentences indeed correspond to each other in the context, as they all relate to financial institutions. The cosine similarity values for these contexts are relatively high, indicating that the contextual embeddings of "bank" in these sentences have a considerable degree of similarity with each other._

**Your Answer:**

| `similarity`|  bank_1  |  bank_2  |  bank_3  |  bank_4  |
|-------------|----------|----------|----------|----------|
| bank_1      |   1.0    |   0.99   |   0.59   |   0.86   |
| bank_2      |   0.99   |   1.0    |   0.59   |   0.87   |
| bank_3      |   0.59   |   0.59   |   1.0    |   0.62   |
| bank_4      |   0.86   |   0.87   |   0.62   |   1.0    |

In [None]:
################################################################################
# TODO: Fill in your codes                                                     #                                                              #
################################################################################
bank_sentences = ["I need to bring my money to the bank today",
                  "I will need to bring my money to the bank tomorrow",
                     "I had to bank into a turn",
                  "The bank teller was very nice"]

for sen in bank_sentences:
  print(bert_tokenizer.tokenize(sen))



['I', 'need', 'to', 'bring', 'my', 'money', 'to', 'the', 'bank', 'today']
['I', 'will', 'need', 'to', 'bring', 'my', 'money', 'to', 'the', 'bank', 'tomorrow']
['I', 'had', 'to', 'bank', 'into', 'a', 'turn']
['The', 'bank', 'tell', '##er', 'was', 'very', 'nice']


In [None]:
bert_inputs = bert_tokenizer(bank_sentences,
                             padding=True,
                             return_tensors='tf')

bert_outputs = bert_model(bert_inputs)

bank_1 = bert_outputs[0][0, 9]
bank_2 = bert_outputs[0][1, 10]
bank_3 = bert_outputs[0][2, 4]
bank_4 = bert_outputs[0][3, 2]

bank = [bank_1, bank_2, bank_3, bank_4]

In [None]:
cosine_similarities(bank)

	1.0	0.99	0.59	0.86
	0.99	1.0	0.59	0.87
	0.59	0.59	1.0	0.62
	0.86	0.87	0.62	1.0


## Part III Text classification

In this part, you will build text classifiers that try to infer whether tweets from [@realDonaldTrump](https://twitter.com/realDonaldTrump) were written by Trump himself or by a staff person.
This is an example of binary classification on a text dataset.

It is known that Donald Trump uses an Android phone, and it has been observed that some of his tweets come from Android while others come from other devices (most commonly iPhone). It is widely believed that Android tweets are written by Trump himself, while iPhone tweets are written by other staff. For more information, you can read this [blog post by David Robinson](http://varianceexplained.org/r/trump-tweets/), written prior to the 2016 election, which finds a number of differences in the style and timing of tweets published under these two devices. (Some tweets are written from other devices, but for simplicity the dataset for this assignment is restricted to these two.)

This is a classification task known as "authorship attribution", which is the task of inferring the author of a document when the authorship is unknown. We will see how accurately this can be done with linear classifiers using word features.

You might find it familiar: Yes! We are using the same data set as your homework 2 from MSBC 5180.

### Tasks

In this section, you will build two text classifiers: one with a traditional machine learning method that you studied in MSBC.5190 and one with a deep learning method.


*   For the first classifier, you can use any non-deep learning based methods. You can use your solution to Homework 2 of MSBC 5180 here.
*   For the second classifier, you may try the following methods
    *    Fine-tune BERT (similar to our Lab 5 Fine-tune BERT for Sentiment Analysis)
    *    Use pre-trained word embedding (useful to check: https://keras.io/examples/nlp/pretrained_word_embeddings/)
    *    Train a deep neural network (e.g., CNN, RNN, Bi-LSTM) from scratch, similar to notebooks from our textbook:
        *    https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/dense_sentiment_classifier.ipynb
        *    https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/convolutional_sentiment_classifier.ipynb
        *    https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/rnn_sentiment_classifier.ipynb
        *    https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/lstm_sentiment_classifier.ipynb
        *    https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/bi_lstm_sentiment_classifier.ipynb
    *   There are also lots of useful resources on Keras website: https://keras.io/examples/nlp/

You may want to split the current training data to train and validation to help model selection. Please do not use test data for model selection.



### Load the Data Set

#### Sample code to load raw text###

Please download `tweets.train.tsv` and `tweets.test.tsv` from Canvas (Module Assignment) and upload them to Google Colab. Here we load raw text data to text_train and text_test.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

#training set
df_train = pd.read_csv('tweets.train.tsv', sep='\t', header=None)

text_train = df_train.iloc[0:, 1].values.tolist()
Y_train = df_train.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])

df_test = pd.read_csv('tweets.test.tsv', sep='\t', header=None)
text_test = df_test.iloc[0:, 1].values.tolist()
Y_test = df_test.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])

In [None]:
type(text_train)

list

Let us take a quick look of some training examples

In [None]:
text_train[:5]

["My statement as to what's happening in Sweden was in reference to a story that was broadcast on _USERNAME_ concerning immigrants & Sweden.",
 'Will be having many meetings this weekend at The Southern White House. Big 5:00 P.M. speech in Melbourne, Florida. A lot to talk about!',
 "Don't believe the main stream (fake news) media.The White House is running VERY WELL. I inherited a MESS and am in the process of fixing it.",
 'Looking forward to the Florida rally tomorrow. Big crowd expected!',
 "'One of the most effective press conferences I've ever seen!' says Rush Limbaugh. Many agree.Yet FAKE MEDIA calls it differently! Dishonest"]

In [None]:
y_train[:5]

array([1, 1, 1, 1, 1])

#### Sample code to preprocess data for BERT (only needed if you decide to fine-tune BERT) ####

The pre-processing step is similar to Lab 5.

Feel free to dispose it if you want to preprocess the data differently and use methods other than BERT.

In [None]:
# The longest text in the data is 75 and we use it as the max_length
max_length = 75
x_train = bert_tokenizer(text_train,
              max_length=75,
              truncation=True,
              padding='max_length',
              return_tensors='tf')

y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])

x_test = bert_tokenizer(text_test,
              max_length=75,
              truncation=True,
              padding='max_length',
              return_tensors='tf')

y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])

### 1: A traditional machine learning approach

Please implement your text classifier using a traditional machine learning method.

**Inline Question #1:**
- What machine leaning model did you use?  **_Logistic Regression Classifier_**
- What are the features used in this model?  **_TF-IDF (Term Frequency-Inverse Document Frequency) vectors_**
- What is the model's performance in the test data?  **_Test accuracy: 0.89_**



In [None]:
################################################################################
# TODO: Fill in your codes                                                     #
################################################################################
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

pipeline.fit(text_train, y_train)



# Predict on training data
y_train_pred = pipeline.predict(text_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy:.2f}")
# testing data
y_pred = pipeline.predict(text_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.2f}")

Training Accuracy: 0.92
Test Accuracy: 0.89


### 2: A deep learning apporach

Please implement your text classifier using a deep learning method

**Inline Question #1:**
- What deep leaning model did you use?
  
  **_BERT (Bidirectional Encoder Representations from Transformers), which is a pre-trained language model._**
- Please briefly explain the input, output, and layers (e.g., what does each layer do) of your model.

  * **_Input: The input to the BERT model is a sequence of tokens, which are generated using a tokenizer. Each token represents a word or a subword (depending on the tokenization strategy used). These tokens are converted into embeddings, which are numerical representations of the words or subwords._**
  * **_Output: The output of the BERT model for a classification task is a single scalar value representing the classification score. This score indicates the model's confidence in assigning the input sequence to a particular class. In binary classification tasks like the one specified in the code (num_labels=1), the output score represents the probability of the input sequence belonging to one of the two classes._**
  * **_Layer: The BERT model is a transformer-based architecture with several layers:_**
      1. **_Embedding Layer: converts the input tokens into dense vector representations (embeddings)._**
      2. **_Encoder Layers: The BERT model has multiple encoder layers, each consisting of multi-head self-attention mechanisms and feed-forward neural networks. Capturing the contextual relationships between tokens in the input sequence._**
      3. **_Pooler Layer: combines the output representations from the encoder layers to provide a single vector representation for the entire input sequence._**
      4. **_Classification Layer: A fully connected layer is added on top of the pooler layer to produce the final output for the sequence classification task._**


- What is the model's performance in the test data?
   **_Test accuracy: 0.92432_**
- Is it better or worse than Solution 1? What might be the cause?
  
  **_Model performance of Bert is better than Solution 1. BERT outperforms logistic regression due to its ability to capture complex patterns and relationships in data, leveraging transformer architecture and pre-training on large text corpora. With a significantly larger number of parameters, BERT learns intricate features, enabling it to handle a wide range of tasks and achieve superior performance, especially in natural language understanding. Unlike logistic regression, BERT considers contextual information from the input sequence, understanding the meaning of words in the context of the entire sentence, which is crucial for tasks like sentiment analysis and text classification._**



In [None]:
# The longest text in the data is 75 and we use it as the max_length
max_length = 75
x_train = bert_tokenizer(text_train,
              max_length=75,
              truncation=True,
              padding='max_length',
              return_tensors='tf')

y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])

x_test = bert_tokenizer(text_test,
              max_length=75,
              truncation=True,
              padding='max_length',
              return_tensors='tf')

y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])

In [None]:
################################################################################
# TODO: Fill in your codes                                                     #
################################################################################

from transformers import TFBertForSequenceClassification, BertTokenizer

# Load pre-trained BERT model for sequence classification
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=1)

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = ['accuracy']
bert_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Train
history = bert_model.fit(x=dict(x_train), y=y_train, validation_data=(dict(x_test), y_test), epochs=2, batch_size=32)


train_loss, train_accuracy = bert_model.evaluate(x_train, y_train)
print("Training Loss:", train_loss)
print("Training Accuracy:", train_accuracy)

test_loss, test_accuracy = bert_model.evaluate(x_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Cause: for/else statement not yet supported
Cause: for/else statement not yet supported



Epoch 2/2
Training Loss: 0.1074032410979271
Training Accuracy: 0.9722329378128052
Test Loss: 0.21019971370697021
Test Accuracy: 0.9243243336677551
