# Classifying Text Similarity
So... I came across a job posting on Handshake by Berkeley Haas that required the use of Google's Bidirectional Encoder Representations (BERT) from Transformers. BERT is designed to understand the context and meaning of words in a sentence by considering the words that come before and after them. We'll dive deeper into what transformers are and how exactly do you train the model for text classification. 

In this case, we're looking at how we can use BERT to detect if two texts are similar to one another

## Load Data

In [1]:
import pandas as pd 
import numpy as np
import tensorflow as tf




In [3]:
import bz2
train_file = bz2.BZ2File("./data/train.ft.txt.bz2")
test_file = bz2.BZ2File("./data/test.ft.txt.bz2")

In [4]:
# Store sentences and labels as lists
train_lines = train_file.readlines()
test_lines = test_file.readlines()

When we use the bz2 module, the readlines() method returns the content of the file as a list of bytes objects (raw binary strings). For processing the data more easily, we convert the byte objects into strings. 

In [5]:
train_lines = [x.decode('utf-8') for x in train_lines]
test_lines = [x.decode('utf-8') for x in test_lines]

## Preprocess Data

Now that we've loaded our data in string format, we can start preprocessing the data. Here's what we will do:

1. Split label from sentence
2. tokenization
3. Punctuation removal
4. Removing stop words
5. lowercasing
6. Lemmatization

In [6]:
# Get labels: 0 if __label__1, 1 if __label__2
train_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in train_lines]
train_sentences = [x.split(' ', 1)[1][:-1].lower() for x in train_lines]

To break the code down:
- the first line checks each train line for '__label__1', and assigns a target value of 0. If we find '__label__2', assign target value to 1. 

- In the 2nd line, we remove the label using x.split('', 1)[1] which creates a list with 2 elements (looks something like ['__label__1', 'sentence']) and selects the element indexed at 1 which effectively removes the label at the beginning of the sentence.

- The [:-1] serves to remove "\n" which is the newline character at the end of each line

Finally we call .lower() to those sentences to standardize the data and avoid duplication of words due to case differences.

In [8]:
# import regular expressions module for pattern matching and string manipulation
import re
for i in range(len(train_sentences)):
    # remove non-alphabetic characters like numbers
    train_sentences[i] = re.sub('\d','0',train_sentences[i])

What we're doing here is iterating through the indices of train_sentences. For each train_sentence, we substitute any num-alphabetic character with the digit 0. To break down the re.sub() function:

- Arg 1 ('\d'): regular expression that matches any digit
- Arg 2 (0): Replace matched digits with 0
- Arg 3 (train_sentences[i]): Input train_sentences[i]

The main reason for this process is to:
1. Improve Model Generalization
2. Noise Reduction
3. Dimensionality Reduction

In [9]:
# Now we do the same thing for the test data
test_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test_lines]
test_sentences = [x.split(' ', 1)[1][:-1].lower() for x in test_lines]

for i in range(len(test_sentences)):
    test_sentences[i] = re.sub('\d','0',test_sentences[i])

The next type of data cleaning we should do is to remove URLs:

In [None]:
for i in range(len(train_sentences)):
    if 'www.' in train_sentences[i] or 'http:' in train_sentences[i] or 'https:' in train_sentences[i] or '.com' in train_sentences[i]:
        train_sentences[i] = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", train_sentences[i])

The code above loops through the train sentences, checks for ['www.', 'http:', 'https:', '.com'] in the train sentences. If any of these exists, it will look for any sequence of characters that:

1. Does not contain a space
2. ends with a period followed by 3 characters

Which is essentially a url! we then replace that entire url with < url >. Again we do the same thing for the test data!

In [10]:
for i in range(len(test_sentences)):
    if 'www.' in test_sentences[i] or 'http:' in test_sentences[i] or 'https:' in test_sentences[i] or '.com' in test_sentences[i]:
        test_sentences[i] = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", test_sentences[i])

In [13]:
train_sentences[1]

"the best soundtrack ever to anything.: i'm reading a lot of reviews saying that this is the best 'game soundtrack' and i figured that i'd write a review to disagree a bit. this in my opinino is yasunori mitsuda's ultimate masterpiece. the music is timeless and i'm been listening to it for years now and its beauty simply refuses to fade.the price tag on this is pretty staggering i must say, but if you are going to buy any cd for this much money, this is the only one that i feel would be worth every penny."

Alright now that we've cleaned the data, we can proceed with Tokenization.

Tokenization is the process of breaking down text into smaller pieces called tokens. For example, if you want to tokenize a sentence, the individual tokens could be the words that make up the sentence. This helps convert the sentence into a format that is easier for a machine to understand and process the text. 

In fact this is a necessary step before we can apply any other preprocessing algorithm like lemmatization or removing stop words.

In [14]:
# import the tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

# set the maximum number of words to 20000
max_words = 20000

# create a tokenizer object
tokenizer = Tokenizer(num_words=max_words)

In [17]:
# fit the tokenizer to the training data
tokenizer.fit_on_texts(train_sentences)

Now that we have our tokenized text, the next step if convert the tokenized text into sequences. Now the main reason for this is as follows:

1. Input Formatting: Models don't understand text but they understand numbers. When we do this conversion, we convert thet text into sequence of numbers where each number represents a specific word. We can then feed our converted data into the model

2. Maintaining Context: This conversion allows us to maintain the order of words which will be important for the model to understand context and the meaning of sentences. This is also important for models like Recurrent Neural Networks and Transformers (like BERT) which consider the order of words.

3. Embedding: Once we have text data in sequence format, we can use word embedding to convert the sequences into dense vectors of fixed size which capture the semantic properties of the words. I'll explain this in further detail later! 

In [None]:
# convert the training data to sequences

## Build The Model

## Model Training

## Model Evaluation

# Model Fine-Tuning

## Making Predictions