Name : Devdeep Shetranjiwala <br>
Email ID : devdeep0702@gmail.com 

## Screening exercise
> When submitting your application please also complete the following exercise.
Write a Jupyter Notebook to conducting a small task with a transformer and explain what you are trying to solve.

(Please check the installation, examples, and tutorial if needed: https://huggingface.co/docs/transformers/index)

The goal of this task was to classify sentences as either grammatically correct or incorrect using a pre-trained transformer model from the Hugging Face Transformers library, fine-tuned on the CoLA dataset. The CoLA dataset is a corpus of English sentences labeled with a binary acceptability judgment indicating whether the sentence is grammatically correct or incorrect. The task involves natural language processing (NLP) and binary classification.

We used the BERT (Bidirectional Encoder Representations from Transformers) model, which is a pre-trained transformer model that has achieved state-of-the-art results on a wide range of NLP tasks. We fine-tuned the pre-trained BERT model on the CoLA dataset using TensorFlow, which is an open-source platform for machine learning that provides a high-level API for building and training machine learning models.

By fine-tuning the pre-trained BERT model on the CoLA dataset, we were able to leverage the pre-trained model's knowledge of natural language to improve the accuracy of our classification task. We evaluated the performance of the model on the validation dataset and used it to make predictions on new sentences.

The ability to accurately classify sentences as grammatically correct or incorrect has many practical applications in NLP, such as in automated essay grading, grammar checking, and language translation.

In [None]:
!pip install tensorflow
!pip install transformers

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer

# Download the dataset
!wget https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
!unzip cola_public_1.1.zip
# Next, we will load the dataset and preprocess it:

import pandas as pd

# Load the dataset
train_df = pd.read_csv("cola_public/tokenized/in_domain_train.tsv", delimiter="\t", header=None, names=["sentence_source", "label", "label_notes", "sentence"])

# Preprocess the dataset
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(data):
    sentences = data["sentence"].tolist()
    labels = data["label"].tolist()
    labels = [0 if label == 0 else 1 for label in labels]  # Convert label 2 to label 1
    encodings = tokenizer(sentences, truncation=True, padding=True)
    return tf.data.Dataset.from_tensor_slices((dict(encodings), labels))

train_data = preprocess_data(train_df)
# Now, we will fine-tune the pre-trained BERT model on the CoLA dataset:

# Create the model
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Define the optimizer and loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile the model
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

# Train the model
model.fit(train_data.shuffle(1000).batch(16), epochs=3)

# Fine-tuned a pre-trained BERT model on the CoLA dataset using TensorFlow and Hugging Face Transformers. 
# The trained model can now be used to classify new sentences as either grammatically correct or incorrect.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2023-03-30 06:32:50--  https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
Resolving nyu-mll.github.io (nyu-mll.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to nyu-mll.github.io (nyu-mll.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255330 (249K) [application/zip]
Saving to: ‘cola_public_1.1.zip.1’


2023-03-30 06:32:50 (60.7 MB/s) - ‘cola_public_1.1.zip.1’ saved [255330/255330]

Archive:  cola_public_1.1.zip
replace cola_public/README? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f79807b4e50>

In [None]:
# Load the validation dataset
val_df = pd.read_csv("cola_public/tokenized/in_domain_dev.tsv", delimiter="\t", header=None, names=["sentence_source", "label", "label_notes", "sentence"])

# Preprocess the validation dataset
val_data = preprocess_data(val_df)

# Evaluate the model on the validation dataset
model.evaluate(val_data.batch(16))




[0.5098571181297302, 0.8254269361495972]

> we can do some testing for this model by evaluating it on the CoLA validation dataset. Here's how we can do that
This will output the model's loss and accuracy on the validation dataset. We can also use the model to make predictions on new sentences

In [None]:
# Example sentence
sentence = "The cat is sleeping on the mat."

# Preprocess the sentence
input_ids = tokenizer.encode(sentence, return_tensors="tf")
input_dict = {"input_ids": input_ids, "attention_mask": tf.ones_like(input_ids)}

# Make a prediction
prediction = tf.nn.softmax(model(input_dict)[0], axis=1)

# Print the predicted label and probability distribution
labels = ["grammatically incorrect", "grammatically correct"]
print(f"Sentence: {sentence}")
print(f"Predicted label: {labels[prediction.numpy().argmax()]}")
print(f"Probability distribution: {prediction.numpy()[0]}")


Sentence: The cat is sleeping on the mat.
Predicted label: grammatically correct
Probability distribution: [0.00481604 0.995184  ]


In [None]:
# Example sentence
sentence = "Me is Devdeep."

# Preprocess the sentence
input_ids = tokenizer.encode(sentence, return_tensors="tf")
input_dict = {"input_ids": input_ids, "attention_mask": tf.ones_like(input_ids)}

# Make a prediction
prediction = tf.nn.softmax(model(input_dict)[0], axis=1)

# Print the predicted label and probability distribution
labels = ["grammatically incorrect", "grammatically correct"]
print(f"Sentence: {sentence}")
print(f"Predicted label: {labels[prediction.numpy().argmax()]}")
print(f"Probability distribution: {prediction.numpy()[0]}")

Sentence: Me is Devdeep.
Predicted label: grammatically incorrect
Probability distribution: [0.9743079  0.02569204]
