### Question 2

### 1. Create your own dataset for text classification. It should contain at least 2000 words in total and at least three categories with at least 100 examples per category (an example can be a poem or a paragraph from a book). You can create it by scraping the web or using some of the documents you have on your computer (do not use anything confidential) or ChatGPT

In [2]:
import pandas as pd
from datasets import load_dataset
import string
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Load the Yelp review dataset
yelp_dataset = load_dataset("yelp_review_full")

# Extract reviews and corresponding ratings
reviews = yelp_dataset["train"]["text"]
ratings = yelp_dataset["train"]["label"]

df = pd.DataFrame({"review": reviews, "rating": ratings})

# Sample the DataFrame to get examples from each category
df_4 = df[df["rating"] == 4].sample(n=100, random_state=42)
df_3 = df[df["rating"] == 3].sample(n=100, random_state=42)
df_2 = df[df["rating"] == 2].sample(n=100, random_state=42)
df_1 = df[df["rating"] == 1].sample(n=100, random_state=42)
df_0 = df[df["rating"] == 0].sample(n=100, random_state=42)

final_df = pd.concat([df_0, df_1, df_2, df_3, df_4])

final_df.reset_index(drop=True, inplace=True)

# Lowercase, remove punctuation, remove stopwords
stop_words = set(stopwords.words('english'))
translator = str.maketrans('', '', string.punctuation)

final_df['review'] = final_df['review'].apply(lambda x: ' '.join([word.lower() for word in x.translate(translator).split() if word.lower() not in stop_words]))

# Save the final dataset to a CSV file
final_df.to_csv("yelp_review_classification_dataset.csv", index=False)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### 2. Split the dataset into training (at least 240examples) and test (at least 60 examples) sets.

In [3]:
from sklearn.model_selection import train_test_split

# Load the preprocessed dataset
final_df = pd.read_csv("yelp_review_classification_dataset.csv")

# Splitting the dataset into training and test sets
train_df, test_df = train_test_split(final_df, test_size=0.2, random_state=42, stratify=final_df['rating'])

# Save the training and test sets to CSV files
train_df.to_csv("yelp_review_train.csv", index=False)
test_df.to_csv("yelp_review_test.csv", index=False)


In [15]:
print(train_df.shape)
print(test_df.shape)

(400, 2)
(100, 2)


### 3. Fine-tune a pre-trained language model capable of generating text (e.g., GPT) that you can take, e.g., from the Hugging Face Transformers library with the dataset you created (this tutorial could be very helpful: https://huggingface.co/docs/transformers/training). [20 points] Report the testv accuracy [5 points]. Discuss what could be done to improve accuracy

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = yelp_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
!pip3 install transformers[torch]
!pip3 install accelerate -U



In [7]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch",learning_rate=5e-5)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.079505,0.545
2,No log,1.053312,0.551
3,No log,1.040243,0.586


TrainOutput(global_step=375, training_loss=0.9760638020833333, metrics={'train_runtime': 369.6659, 'train_samples_per_second': 8.115, 'train_steps_per_second': 1.014, 'total_flos': 789354427392000.0, 'train_loss': 0.9760638020833333, 'epoch': 3.0})

In [12]:
test_results = trainer.evaluate(small_eval_dataset)
print("Test Accuracy:", test_results["eval_accuracy"])

Test Accuracy: 0.586


#### To improve accuracy:
* Data Quality: Ensure the data is clean, relevant, and representative of the problem you're trying to solve.
* Increase Training Data: More data can help the model generalize better.
* Data Augmentation: Augment the dataset with synthetic data generated by modifying existing data points or using techniques like back-translation.
* Learning Rate Scheduling: Implement learning rate schedulers to adjust the learning rate during training.
* Extended Training: Sometimes simply training for more epochs can lead to better results, as long as overfitting is controlled.
* Fine-tuning on Specific Data: If your dataset is niche, further fine-tuning the pre-trained model on data more specific to your domain can be beneficial.

### References :
* Medium article: https://medium.com/@amanatulla1606/fine-tuning-the-model-what-why-and-how-e7fa52bc8ddf
* Hugging face artice: https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras
* OpenAI's ChatGPT model was employed for certain conversational AI tasks