<a href="https://colab.research.google.com/github/Jayabaskar-R/Finetuned_Sentiment_Analysis/blob/Finetuned_Trained_Model/Finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas transformers datasets torch boto3 nltk scikit-learn

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting boto3
  Downloading boto3-1.37.4-py3-none-any.whl.metadata (6.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collect

In [None]:
!pip install datasets



In [None]:
import pandas as pd
import re
import nltk
import torch
from datasets import Dataset
from transformers import DistilBertTokenizer

# Download stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# Load dataset
url = "https://raw.githubusercontent.com/GuviMentor88/Training-Datasets/refs/heads/main/twitter_training.csv"
df = pd.read_csv(url, header=None)
df.columns = ["Tweet ID", "Entity", "Sentiment", "Tweet Content"]

# Drop unnecessary columns
df = df[["Sentiment", "Tweet Content"]]

# Handle missing values
df["Tweet Content"] = df["Tweet Content"].fillna("")

# Encode labels
label_mapping = {"Positive": 2, "Negative": 0, "Neutral": 1}
df["Sentiment"] = df["Sentiment"].map(label_mapping)

# Text preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

df["Tweet Content"] = df["Tweet Content"].apply(preprocess_text)

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(df)
train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset, test_dataset = train_test_split["train"], train_test_split["test"]

# Tokenization
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["Tweet Content"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Rename 'Sentiment' to 'labels' and ensure it is an integer
def format_labels(example):
    if example["Sentiment"] is None:
        example["labels"] = 1  # Assign "Neutral" as default if missing
    else:
        example["labels"] = int(example["Sentiment"])
    return example

train_dataset = train_dataset.map(format_labels)
test_dataset = test_dataset.map(format_labels)

# Set correct format for PyTorch training
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

print("Data Preprocessing and Tokenization Completed Successfully!")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/59745 [00:00<?, ? examples/s]

Map:   0%|          | 0/14937 [00:00<?, ? examples/s]

Map:   0%|          | 0/59745 [00:00<?, ? examples/s]

Map:   0%|          | 0/14937 [00:00<?, ? examples/s]

Data Preprocessing and Tokenization Completed Successfully!


In [None]:
import os
os.environ["WANDB_MODE"] = "disabled"

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

# Load Pretrained Model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

# Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train Model
trainer.train()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.5139,0.442421
2,0.2634,0.384375
3,0.1139,0.39372


TrainOutput(global_step=22407, training_loss=0.3699818472096776, metrics={'train_runtime': 8940.3514, 'train_samples_per_second': 20.048, 'train_steps_per_second': 2.506, 'total_flos': 2.374321761713664e+16, 'train_loss': 0.3699818472096776, 'epoch': 3.0})

In [None]:
!pip install boto3



In [None]:
model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
import torch
import os
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Define model path
model_path = "fine_tuned_sentiment_model"

# Save the trained model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

# Zip the model for easy upload
!zip -r fine_tuned_sentiment_model.zip fine_tuned_sentiment_model

print("Model saved & zipped successfully!")


  adding: fine_tuned_sentiment_model/ (stored 0%)
  adding: fine_tuned_sentiment_model/special_tokens_map.json (deflated 42%)
  adding: fine_tuned_sentiment_model/config.json (deflated 49%)
  adding: fine_tuned_sentiment_model/tokenizer_config.json (deflated 75%)
  adding: fine_tuned_sentiment_model/model.safetensors (deflated 8%)
  adding: fine_tuned_sentiment_model/vocab.txt (deflated 53%)
Model saved & zipped successfully!


##move to hugging face


In [None]:
pip install huggingface_hub



In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!ls /content/

sample_data


In [None]:
pip install huggingface_hub



In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The token `JabasR2001` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-a