<a href="https://colab.research.google.com/github/A790227/data-project-llm/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre training model

#### To use the DistilBERT model fine-tuned on the SST-2 dataset for sentiment classification, load the pre-trained model using Hugging Face’s Transformers library. Distilbert-base-uncased-finetuned-sst-2-english, is designed for binary sentiment classification (positive or negative). Tets model fine-tuned

In [22]:
# Step 1: Import necessary libraries
from transformers import pipeline

# Step 2: Load the pre-trained DistilBERT model fine-tuned on SST-2
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Step 3: Define some example text to classify
texts = [
    "I love this movie! It's absolutely amazing.",
    "This was the worst experience of my life. I hated it.",
    "The plot was a bit dull, but the acting was great.",
    "I enjoyed the cinematography, but the story was lacking.",
    "This film is a masterpiece. Highly recommend it!"
]

# Step 4: Classify the text
results = classifier(texts)

# Step 5: Display the results
for text, result in zip(texts, results):
    print(f"Text: {text}\nSentiment: {result['label']}, Confidence: {result['score']:.2f}\n")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Text: I love this movie! It's absolutely amazing.
Sentiment: POSITIVE, Confidence: 1.00

Text: This was the worst experience of my life. I hated it.
Sentiment: NEGATIVE, Confidence: 1.00

Text: The plot was a bit dull, but the acting was great.
Sentiment: POSITIVE, Confidence: 1.00

Text: I enjoyed the cinematography, but the story was lacking.
Sentiment: NEGATIVE, Confidence: 0.99

Text: This film is a masterpiece. Highly recommend it!
Sentiment: POSITIVE, Confidence: 1.00



#### DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) to the IMDb dataset using Hugging Face's datasets library

In [23]:
# Step 1: Import necessary libraries
from transformers import pipeline
from datasets import load_dataset

# Step 2: Load the IMDb dataset from Hugging Face
dataset = load_dataset('imdb')

# Step 3: Load the pre-trained DistilBERT model fine-tuned on SST-2
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Step 4: Extract a sample of reviews to classify (let's take 5 reviews)
reviews = dataset['test']['text'][:5]

# Step 5: Use the DistilBERT model to classify the sentiment of the reviews
results = classifier(reviews)

# Step 6: Display the results
for review, result in zip(reviews, results):
    print(f"Review: {review}\nSentiment: {result['label']}, Confidence: {result['score']:.2f}\n")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Review: I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have 

### Training Process

In [25]:
import torch
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
import evaluate
from datasets import load_dataset


In [35]:
# Import necessary libraries
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch
import evaluate

# Load the IMDb dataset
dataset = load_dataset('imdb')

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

train_ds = dataset['train'].map(tokenize_function, batched=True)
eval_ds = dataset['test'].map(tokenize_function, batched=True)

# Remove unnecessary columns
train_ds = train_ds.remove_columns(["text"])
eval_ds = eval_ds.remove_columns(["text"])

# Set the format for PyTorch
train_ds.set_format("torch")
eval_ds.set_format("torch")

# Load the model
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Load the accuracy metric
accuracy = evaluate.load("accuracy")

# Fix: Convert logits to a PyTorch Tensor before using torch.argmax()
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    logits = torch.tensor(logits)  # Convert logits to PyTorch tensor
    predictions = torch.argmax(logits, dim=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()




Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2137,0.195239,0.92708


TrainOutput(global_step=1563, training_loss=0.25599718551489303, metrics={'train_runtime': 401.9891, 'train_samples_per_second': 62.191, 'train_steps_per_second': 3.888, 'total_flos': 3311684966400000.0, 'train_loss': 0.25599718551489303, 'epoch': 1.0})

In [36]:
# Save the model
trainer.save_model('./my_finetuned_model')
tokenizer.save_pretrained('./my_finetuned_model')

('./my_finetuned_model/tokenizer_config.json',
 './my_finetuned_model/special_tokens_map.json',
 './my_finetuned_model/vocab.txt',
 './my_finetuned_model/added_tokens.json',
 './my_finetuned_model/tokenizer.json')

In [37]:
from transformers import pipeline

# Load the trained model
classifier = pipeline("sentiment-analysis", model='./my_finetuned_model', tokenizer=tokenizer)

# Test with some new text
test_texts = ["This movie was great!", "I didn't enjoy the film at all."]
predictions = classifier(test_texts)

# Print predictions
for text, prediction in zip(test_texts, predictions):
    print(f"Text: {text}\nPrediction: {prediction['label']} with confidence {prediction['score']:.2f}\n")


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Text: This movie was great!
Prediction: LABEL_1 with confidence 0.99

Text: I didn't enjoy the film at all.
Prediction: LABEL_0 with confidence 0.94



Model training has successfully completed, and the following key information can be observed:

Key Metrics from Training:
- Training Loss: 0.255997
- Validation Loss: 0.195239
- Accuracy: 92.71%


Training Summary:
- The model achieved a strong accuracy of 92.71% after one epoch.
- The loss metrics suggest that the model has learned quite well on the training data.
