# Pre training model

#### To use the DistilBERT model fine-tuned on the SST-2 dataset for sentiment classification, load the pre-trained model using Hugging Face’s Transformers library. Distilbert-base-uncased-finetuned-sst-2-english, is designed for binary sentiment classification (positive or negative). Tets model fine-tuned

In [None]:
# Step 1: Import necessary libraries
from transformers import pipeline

# Step 2: Load the pre-trained DistilBERT model fine-tuned on SST-2
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Step 3: Define some example text to classify
texts = [
    "I love this! It's absolutely amazing.",
    "This was the worst experience of my life. I hated it.",
    "The product was a bit dull, but the price was great.",
    "I enjoyed personal care, but the quality was lacking.",

]

# Step 4: Classify the text
results = classifier(texts)

# Step 5: Display the results
for text, result in zip(texts, results):
    print(f"Text: {text}\nSentiment: {result['label']}, Confidence: {result['score']:.2f}\n")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Text: I love this! It's absolutely amazing.
Sentiment: POSITIVE, Confidence: 1.00

Text: This was the worst experience of my life. I hated it.
Sentiment: NEGATIVE, Confidence: 1.00

Text: The product was a bit dull, but the price was great.
Sentiment: POSITIVE, Confidence: 1.00

Text: I enjoyed personal care, but the quality was lacking.
Sentiment: NEGATIVE, Confidence: 1.00



#### DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) to the meta review dataset using Hugging Face's datasets library

In [None]:
# Step 1: Import necessary libraries
from transformers import pipeline
from datasets import load_dataset

# Step 2: Load dataset from Hugging Face
ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Beauty_and_Personal_Care", split="full")

# Step 3: Load the pre-trained DistilBERT model fine-tuned on SST-2
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Step 4: Extract a sample of reviews to classify (for demo purposes, let's take 5 reviews)
reviews = ds['text'][:5]  # Take the first 5 reviews from the dataset

# Step 5: Use the DistilBERT model to classify the sentiment of the reviews
results = classifier(reviews)

# Step 6: Display the results
for review, result in zip(reviews, results):
    print(f"Review: {review}\nSentiment: {result['label']}, Confidence: {result['score']:.2f}\n")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Review: Opened the package & instant migraine. I cannot believe the stench.  I have purchased other packages that did not smell at all so I do not know if these were a damaged shipment or damaged during packaging or what, but the minute I opened the Amazon package I smelled it before I even opened the Terra Tattoos package. I couldn’t believe it. Then I find that the pink inks from the back have smeared all over the fronts of the tattoos. Yes, you eventually take the clear part off to apply the tattoo, but I always lay it down with the clear covering first to line it up & I didn’t want to risk the pink ink transferring to my art projects so it’s going back. I’ll update the review when Amazon sends my replacement. I’m upset because now my resin is here & I don’t have the tattoos to do my project, but I’m more mad about the fact that I have a massive migraine because of the gasoline fumes that lingered.  I believe they arrived a week ago, but I didn’t check the tattoos then as I didn’t h

### Training Process

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
import evaluate
from datasets import load_dataset


In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch
import evaluate

# Step 1: Load the sampled dataset
ds = load_dataset("json", data_files="sampled_data_0_4_percent.jsonl", split="train")

# Step 2: Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
    tokens = tokenizer(examples['text'], padding="max_length", truncation=True)
    tokens['labels'] = [int(rating >= 4) for rating in examples['rating']]  # Binary labels
    return tokens

# Step 3: Split the dataset into train and test sets
split_ds = ds.train_test_split(test_size=0.2, seed=42)
train_ds = split_ds['train']
eval_ds = split_ds['test']

# Tokenize the train and test datasets
train_ds = train_ds.map(tokenize_function, batched=True)
eval_ds = eval_ds.map(tokenize_function, batched=True)

# Remove unnecessary columns
train_ds = train_ds.remove_columns(["text", "rating"])
eval_ds = eval_ds.remove_columns(["text", "rating"])

# Set the format for PyTorch
train_ds.set_format("torch")
eval_ds.set_format("torch")

# Step 4: Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # Binary classification
)

# Step 5: Load the accuracy metric
accuracy = evaluate.load("accuracy")

# Define compute_metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# Step 6: Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",  # Updated key
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    push_to_hub=False,
    report_to='none'
)

# Step 7: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    compute_metrics=compute_metrics
)

# Step 8: Train the model
trainer.train()




Map:   0%|          | 0/76516 [00:00<?, ? examples/s]

Map:   0%|          | 0/19130 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1741,0.184569,0.933926


TrainOutput(global_step=4783, training_loss=0.2024452549230347, metrics={'train_runtime': 820.5741, 'train_samples_per_second': 93.247, 'train_steps_per_second': 5.829, 'total_flos': 1.0135875475562496e+16, 'train_loss': 0.2024452549230347, 'epoch': 1.0})

In [None]:
# Save the model
trainer.save_model('./my_finetuned_model')
tokenizer.save_pretrained('./my_finetuned_model')

('./my_finetuned_model/tokenizer_config.json',
 './my_finetuned_model/special_tokens_map.json',
 './my_finetuned_model/vocab.txt',
 './my_finetuned_model/added_tokens.json',
 './my_finetuned_model/tokenizer.json')

In [None]:
from transformers import pipeline

# Load the trained model
classifier = pipeline("sentiment-analysis", model='./my_finetuned_model', tokenizer=tokenizer)

# Test with some new text
test_texts = ["This product was great!", "The worse experience in my life"]
predictions = classifier(test_texts)

# Print predictions
for text, prediction in zip(test_texts, predictions):
    print(f"Text: {text}\nPrediction: {prediction['label']} with confidence {prediction['score']:.2f}\n")


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Text: This product was great!
Prediction: LABEL_1 with confidence 1.00

Text: The worse experience in my life
Prediction: LABEL_0 with confidence 0.98



- Training Summary

-Metrics:
Training Loss: 0.2024 (low, indicating a good fit to the training data).

Validation Loss: 0.1846 (low, suggesting the model generalizes well).

Validation Accuracy: 0.9339 (~93.39%, excellent performance).

-Efficiency:

Runtime: 820.57 seconds (~13.7 minutes) for 4783 steps.

Samples per Second: 93.25.

Steps per Second: 5.83.

Floating Point Operations (FLOPs): 1.01e+16 (high, expected for large models like DistilBERT).