hf = 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

##1. Setup and Installation

## Install Core Libraries
 We install the core Python libraries required for interacting with Hugging Face models, PyTorch, and associated tools for efficient training and inference.

In [None]:
pip install --upgrade "fsspec==2025.3.2" accelerate bitsandbytes datasets evaluate peft transformers torch trl

Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting peft
  Downloading peft-0.15.2-py3-none-any.whl.metadata (13 kB)
Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting trl
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multipl

In [None]:
!pip show fsspec gcsfs

Name: fsspec
Version: 2025.3.2
Summary: File-system specification
Home-page: https://github.com/fsspec/filesystem_spec
Author: 
Author-email: 
License: BSD 3-Clause License

Copyright (c) 2018, Martin Durant
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS

# Import Core Libraries

 Next, we import the necessary Python modules, including PyTorch, Hugging Face datasets, and key components from the transformers library for model handling and training setup.

In [None]:
# Import libraries
import os
import torch
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

# Import PEFT and Evaluation Libraries
 We import further modules specifically for Parameter-Efficient Fine-Tuning (PEFT) using LoRA and for model evaluation.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
import evaluate
import numpy as np

# Hugging Face Hub Login
Authentication with the Hugging Face Hub is required for accessing certain models or features. This cell initiates the interactive login prompt within the notebook.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Mount Google Drive and Define Output Directory
 To potentially save outputs like model checkpoints persistently, we mount Google Drive and define an output directory. (Note: The code currently defaults to local Colab storage).

In [None]:
from google.colab import drive
drive.mount('/content/drive')
output_base_dir = "/content/drive/MyDrive/llama3_tamil_sentiment"
output_base_dir = "./llama3_tamil_sentiment" # Use local Colab storage first
os.makedirs(output_base_dir, exist_ok=True)

Mounted at /content/drive


## Check GPU Availability
 We check for the availability of a GPU, which is crucial for running large language models efficiently, and set the appropriate device for PyTorch operations.

In [None]:
# Check GPU availability and type
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.2f} GB")
    device = torch.device("cuda")
else:
    print("GPU not available, using CPU. This will be very slow and likely fail for Llama 3.")
    device = torch.device("cpu")

GPU not available, using CPU. This will be very slow and likely fail for Llama 3.


# Define Configuration Variables
 We define the core configuration parameters for our model, dataset, and potential training process.

In [None]:
# --- Configuration ---
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" # Using 8B due to availability
DATASET_ID = "Tngarg/Codemix_tamil_english"
NUM_EPOCHS = 1 # Start with 1 epoch due to Colab limits, can increase if stable
LEARNING_RATE = 2e-4 # Common learning rate for LoRA
BATCH_SIZE = 2 # Keep batch size VERY small
GRADIENT_ACCUMULATION_STEPS = 8 # Effective batch size = BATCH_SIZE * GRAD_ACCUM_STEPS = 16
MAX_SEQ_LENGTH = 256 # Adjust based on dataset analysis and VRAM

 # Define Label Mappings (Code)
 For the sentiment classification task, we define the number of labels and the mappings between label names and their integer representations.

In [None]:
num_labels = 3 # Replace with actual number of classes
# Create dummy mappings if not derived from data (replace if needed)
id2label = {0: "NEGATIVE", 1: "NEUTRAL", 2: "POSITIVE"} # Adjust based on dataset
label2id = {"NEGATIVE": 0, "NEUTRAL": 1, "POSITIVE": 2} # Adjust based on dataset
print(f"Number of labels: {num_labels}")
print(f"id2label mapping: {id2label}")
print(f"label2id mapping: {label2id}")

Number of labels: 3
id2label mapping: {0: 'NEGATIVE', 1: 'NEUTRAL', 2: 'POSITIVE'}
label2id mapping: {'NEGATIVE': 0, 'NEUTRAL': 1, 'POSITIVE': 2}


## Re-install Core Libraries (Code)
 We ensure specific core LLM and acceleration libraries are installed or updated.

In [None]:
!pip install -q -U transformers accelerate bitsandbytes torch

# Hugging Face Login (Colab Secrets/Widget)
 A more secure method for Hugging Face authentication
 is attempted using Colab secrets,
 falling back to
 manual login via a widget if necessary.

In [None]:
from huggingface_hub import login
import os

# Try to get token from Colab secrets first
try:
  from google.colab import userdata
  hf_token = userdata.get('HF_TOKEN')
  if hf_token:
    print("Using Hugging Face token from Colab secrets.")
    # login(token=hf_token, add_to_git_credential=True)
  else:
    print("HF_TOKEN secret not found. Please login manually.")
    # Fallback to manual login if secret not set
    #login(add_to_git_credential=True) #Remove this line
    from google.colab import output
    output.enable_custom_widget_manager()
    login(add_to_git_credential=True)  #This will open Huggingface login widget
except ImportError:
  # Manual login if not in Colab or secrets unavailable
  print("Not in Colab or secrets unavailable. Please login manually.")
  login(add_to_git_credential=True)

print("-" * 20)
print("Setup Complete!")
print("-" * 20)

Using Hugging Face token from Colab secrets.
--------------------
Setup Complete!
--------------------


## Install llama-stack
 To facilitate downloading Llama models directly from Meta, we install the llama-stack library.

In [None]:
pip install llama-stack

Collecting llama-stack
  Downloading llama_stack-0.2.2-py3-none-any.whl.metadata (18 kB)
Collecting blobfile (from llama-stack)
  Downloading blobfile-3.0.0-py3-none-any.whl.metadata (15 kB)
Collecting fire (from llama-stack)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting llama-stack-client>=0.2.2 (from llama-stack)
  Downloading llama_stack_client-0.2.2-py3-none-any.whl.metadata (15 kB)
Collecting python-dotenv (from llama-stack)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting tiktoken (from llama-stack)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting pyaml (from llama-stack-client>=0.2.2->llama-stack)
  Downloading pyaml-25.1.0-py3-none-any.whl.metadata (12 kB)
Collecting pycryptodomex>=3.8 (from blobfile-

In [None]:
# !llama model list

## Download Llama 3.2 1B via llama-stack
 We use the llama-stack tool to download the Llama 3.2 1B Instruct model directly from Meta. (Execution requires pasting a signed URL when prompted).

In [None]:
!llama model download --source meta --model-id Llama3.2-1B-Instruct

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1;34mDownloading checklist.chk[0m       [90m━━━━━━━━━━━[0m [35m100.0%[0m [32m156/156    [0m -         [36m0:00:00[0m
                                                   [32mbytes      [0m                  
[1;34mDownloading tokenizer.model[0m     [90m━━━━━━━━━━━[0m [35m100.0%[0m [32m2.2/2.2 MB [0m -         [36m0:00:00[0m
[1;34mDownloading params.json[0m         [90m━━━━━━━━━━━[0m [35m100.0%[0m [32m220/220    [0m -         [36m0:00:00[0m
                                                   [32mbytes      [0m                  
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1;34mDownloading checklist.chk[0m       [90m━━━━━━━━━━━[0m [35m100.0%[0m [32m156/156    [0m -         [36m0:00:00[0m
                                                   [32mbytes      [0m                  
[1;34mDownloading tokenizer.model[0m     [90m━━━━━━━━━━━[0m 

https://llama3-2-lightweight.llamameta.net/*?Policy=eyJTdGF0ZW1lbnQiOlt7InVuaXF1ZV9oYXNoIjoidGRkdTRrYWR3MGV6bTFpbXlxcTlxeTVxIiwiUmVzb3VyY2UiOiJodHRwczpcL1wvbGxhbWEzLTItbGlnaHR3ZWlnaHQubGxhbWFtZXRhLm5ldFwvKiIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NDk3NzY1MH19fV19&Signature=Qc2x6XMnEVpV6lVIN3B7CMFRvs%7EDjQoA8aFzBm0%7E8g0kx%7EqVnDTBwvLN5ynSFWeExSfvtNr1cW%7Ev-AFA4hPv2UFoJEOuMy8pSdS-gMK4lLo8yI8zPPdx9aYQP2qPapIO24JM5gWXJt-7kiNDHQX8-rwjAfRiF9nY4%7E9o-2yeeshu-uGmnPDE5ExUJRGQkBF%7Eb96Qs7WPXyKaW3HrArbM5YhYRSRwtPHyJxJp4Ua-9nuO%7Ef63DQotIogVtUuUlfw2s7eGRTeWFQtr36xGJSnRk2GKVSs3bRvtqaMdKTQ26o1chqMg43HxyxPJ5RNsR7rA1UEkKQ9JC6dDHLxZlNknNQ__&Key-Pair-Id=K15QRJLYKIFSLZ&Download-Request-ID=9803516973029017

## Define Model ID for Transformers
 We confirm the model identifier we intend to load using the transformers library and re-import necessary modules.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# --- CHOOSE YOUR MODEL ID ---
# This is the crucial part. Replace this with the *exact* ID of the Llama 3.2 1B model
# you intend to use once it's available and you have access.
# Example using Llama 3.1 8B Instruct (quantized for Colab):
model_id = "meta-llama/Llama-3.2-1B-Instruct"
# If a specific Llama 3.2 1B Instruct model exists, use its ID, e.g.:
# model_id = "some-org/Meta-Llama-3.2-1B-Instruct" # <--- Replace this when known!

print(f"Attempting to load model: {model_id}")

Attempting to load model: meta-llama/Llama-3.2-1B-Instruct


#Define Quantization Configuration

To load the model efficiently with reduced memory usage, we define the 4-bit quantization configuration using BitsAndBytesConfig.

In [None]:
# --- Configuration for loading in 4-bit (to save memory) ---
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for modern GPUs
)

In [None]:
# import os

# # Set your Hugging Face token as an environment variable
# os.environ["HF_TOKEN"] = ""

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineG

In [None]:
!huggingface-cli whoami  # Should return your username

Rishpraveen


# Upgrade Core Libraries (Code)
We upgrade the transformers and huggingface_hub libraries to ensure we have the latest versions.

In [None]:
pip install --upgrade transformers huggingface_hub



# Download Model via huggingface-cli (Code)
 Alternatively, we can download the model files from the Hugging Face Hub using the huggingface-cli tool, potentially targeting specific subdirectories.


In [None]:
!huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --include "original/*" --local-dir Llama-3.2-1B-Instruct

Fetching 3 files:   0% 0/3 [00:00<?, ?it/s]Downloading 'original/params.json' to 'Llama-3.2-1B-Instruct/.cache/huggingface/download/original/jqHB00sRqBVJXCrFOHz5gDS2Bg8=.9cd8dbdf2dc6f4d8abb60bdb5ce64f4bec2fdfd9.incomplete'

params.json: 100% 220/220 [00:00<00:00, 1.00MB/s]
Download complete. Moving file to Llama-3.2-1B-Instruct/original/params.json
Downloading 'original/consolidated.00.pth' to 'Llama-3.2-1B-Instruct/.cache/huggingface/download/original/_dLw4ih-O1I9AkO57vYC89Z48Os=.fc17d497df5e4175b3a8acb4f5865b26f7fc1b009b25bef814b95fde10e8a1f3.incomplete'
Downloading 'original/tokenizer.model' to 'Llama-3.2-1B-Instruct/.cache/huggingface/download/original/7iVfz3cUOMr-hyjiqqRDHEwVBAM=.82e9d31979e92ab929cd544440f129d9ecd797b69e327f80f17e1c50d5551b55.incomplete'

consolidated.00.pth:   0% 0.00/2.47G [00:00<?, ?B/s][A

tokenizer.model:   0% 0.00/2.18M [00:00<?, ?B/s][A[A

tokenizer.model: 100% 2.18M/2.18M [00:00<00:00, 18.0MB/s]
Download complete. Moving file to Llama-3.2-1B-Instruct/o

In [None]:
from huggingface_hub import login
login(token="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")

# Create Text Generation Pipeline (Code)
 We create a Hugging Face pipeline object configured for text generation using the Llama 3.2 model.


In [None]:
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.2-1B-Instruct",
    token="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",
    device_map="auto",  # Automatically uses GPU(s)
    torch_dtype=torch.float16  # Use float16 for reduced memory footprint
    )


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cpu


# Test Text Generation Pipeline
 Let's test the text generation pipeline created in the previous step with a basic question.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct")
pipe(messages)

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': [{'role': 'user', 'content': 'Who are you?'},
   {'role': 'assistant',
    'content': 'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta'}]}]

# Direct Model and Tokenizer Loading (Code)
Alternatively, for more granular control over the process, we can load the model and tokenizer directly using the Auto* classes from transformers.


In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

 # Load Quantized Model
 Now, we load the Llama 3.2 model again, this time explicitly applying the 4-bit quantization settings defined earlier for significant memory savings.


In [None]:
# --- Load Model ---
# device_map="auto" will automatically place parts of the model on available devices (GPU, CPU)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16, # Match compute dtype
    #device_map="auto", # Automatically distribute model across available devices (GPU/CPU)
    trust_remote_code=True # Often needed for custom model code
)

print("-" * 20)
print(f"Model '{model_id}' and Tokenizer loaded successfully!")
print(f"Model loaded on device: {model.device}") # Check where the model is loaded
print("-" * 20)

CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

Import Sentiment Function Libraries (Code)
Intro Text: Before defining our sentiment analysis function, we import the specific libraries required for text generation and output parsing.

In [None]:
# @title Define Sentiment Analysis Function (Zero-Shot Prompting)
import torch
from transformers import pipeline
import re # For parsing the output

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

In [None]:
# Using a pipeline for easier text generation with instruct models
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16, # Should match model loading dtype
    device_map="auto" # Ensure pipeline uses the mapped device
)

Device set to use cuda:0


In [None]:
def classify_sentiment_llama3(text_tamil):
    """
    Classifies the sentiment of Tamil text using Llama 3 via prompting.
    """
    # --- Craft the Prompt ---
    # This prompt clearly defines the task, input, expected output format, and language.
    # Using the chat template structure is important for Instruct models.
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant trained to analyze sentiment in Tamil text. Classify the sentiment of the following Tamil text as only one of these options: Positive, Negative, or Neutral. Provide only the sentiment label as your answer (e.g., 'Sentiment: Positive').",
        },
        {
            "role": "user",
            "content": f"Tamil Text: \"{text_tamil}\"\n\nSentiment:"
        },
        # The model should complete this after "Sentiment:"
    ]

    # --- Prepare prompt using tokenizer's chat template ---
    # The pipeline handles this internally, but showing it for clarity if you were doing manual generation:
    # prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # print(f"DEBUG: Generated prompt:\n{prompt}\n") # Uncomment to see the final prompt structure

    # --- Parameters for Generation ---
    terminators = [
        tokenizer.eos_token_id,
        # You might add other terminators if the model tends to run on
        # tokenizer.convert_tokens_to_ids("<|eot_id|>") # Example specific to Llama 3 terminators
    ]

    # --- Run Inference using the pipeline ---
    try:
        outputs = text_generator(
            messages, # Pass the structured messages directly
            max_new_tokens=10,  # Limit output length (just need the label)
            eos_token_id=terminators,
            do_sample=False, # For more deterministic output
            temperature=0.1, # Lower temperature for less randomness
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id # Use EOS token for padding
        )

        # --- Extract the generated text ---
        # The pipeline usually returns a list containing a dictionary
        generated_text = outputs[0]['generated_text']

        # --- Parse the assistant's response ---
        # The response often includes the original prompt; we need the part added by the assistant.
        # The pipeline's output structure might vary slightly, inspect `outputs` if needed.
        # Method 1: Find the last message content (usually the assistant's)
        assistant_response = ""
        if isinstance(generated_text, list) and len(generated_text) > 0: # Check if output is a list of chat turns
            # Find the last message with role 'assistant'
            for msg in reversed(generated_text):
                if msg.get("role") == "assistant":
                    assistant_response = msg.get("content", "").strip()
                    break
        elif isinstance(generated_text, str): # Sometimes it might return the full string
            # Try splitting based on a known part of the prompt or user message
            parts = generated_text.split("Sentiment:")
            if len(parts) > 1:
                assistant_response = parts[-1].strip()
            else: # Fallback if structure is unexpected
                assistant_response = generated_text # Take the whole thing and hope for the best

        print(f"DEBUG: Raw assistant response: '{assistant_response}'") # Debugging line

        # --- Extract the specific label ---
        # Look for the keywords "Positive", "Negative", or "Neutral" case-insensitively
        if re.search(r'Positive', assistant_response, re.IGNORECASE):
            return "Positive"
        elif re.search(r'Negative', assistant_response, re.IGNORECASE):
            return "Negative"
        elif re.search(r'Neutral', assistant_response, re.IGNORECASE):
            return "Neutral"
        else:
            # Fallback if the model didn't follow instructions exactly
            print(f"Warning: Could not parse sentiment from response: '{assistant_response}'. Returning 'Unknown'.")
            return "Unknown"

    except Exception as e:
        print(f"An error occurred during text generation: {e}")
        # print("DEBUG: Full pipeline output:", outputs) # Uncomment for detailed error debugging
        return f"Error: {e}"

In [None]:
# @title Test with Tamil Examples

# --- Sample Tamil Texts ---
# (You can replace these with your own data)
tamil_texts = [
    "இந்த படம் மிகவும் அருமையாக இருந்தது.",
    "சேவை மிகவும் மோசம், நான் திருப்தி அடையவில்லை.",
    "இந்த செய்தி நேற்று வெளியிடப்பட்டது.",
    "வானிலை இன்று சாதாரணமாக உள்ளது.",
    "அதமான் சுவை! நான் மீண்டும் வருவேன்.",
    "பயணம் மிகவும் சோர்வாக இருந்தது.",
    "Super padam! Vera level acting.", #
    "Worst experience ever, don't recommend.",
    "waste of time"
]

In [None]:
print("\n--- Starting Sentiment Analysis ---")
for i, text in enumerate(tamil_texts):
    print(f"\nText {i+1}: \"{text}\"")
    sentiment = classify_sentiment_llama3(text)
    print(f"Predicted Sentiment: {sentiment}")
    print("-" * 15)

print("\n--- Sentiment Analysis Complete ---")


--- Starting Sentiment Analysis ---

Text 1: "இந்த படம் மிகவும் அருமையாக இருந்தது."


NameError: name 'classify_sentiment_llama3' is not defined

In [None]:
import torch
from transformers import pipeline
import re

# Assuming 'model' and 'tokenizer' are already loaded from the previous steps
# And 'text_generator' pipeline is initialized

def classify_sentiment_llama3_enhanced_prompt(text_input):
    """
    Classifies sentiment using Llama 3 with a more detailed system prompt.
    """
    messages = [
        {
            "role": "system",
            "content": """You are an expert sentiment analysis assistant specializing in Tamil and Tamil-English code-mixed text.
Your task is to classify the sentiment of the user's text as strictly one of: Positive, Negative, or Neutral.
Pay close attention to the overall meaning. Be very sensitive to negative language, complaints, dissatisfaction, and critical words, even if mixed with English or colloquial Tamil terms. Do not misclassify negative statements as positive.
Provide only the single-word sentiment label (Positive, Negative, or Neutral) as your answer. For example: 'Negative'."""
        },
        {
            "role": "user",
            "content": f"Analyze the sentiment of this text: \"{text_input}\""
        },
        # Llama 3 should generate the label here
    ]

    terminators = [
        tokenizer.eos_token_id,
        # tokenizer.convert_tokens_to_ids("<|eot_id|>") # Optional, specific to Llama 3 terminators if needed
    ]

    try:
        outputs = text_generator(
            messages,
            max_new_tokens=5, # Just need one word label
            eos_token_id=terminators,
            do_sample=False,
            temperature=0.1,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

        # --- Extract the generated text ---
        # The pipeline output structure might vary. Let's refine extraction.
        assistant_response = ""
        generated_output = outputs[0]['generated_text']

        # Check if the output is the full chat history (list of dicts)
        if isinstance(generated_output, list):
            # Get the last message, assuming it's the assistant's reply
            if generated_output and generated_output[-1].get("role") == "assistant":
                assistant_response = generated_output[-1].get("content", "").strip()
        # Check if the output is just the completion string
        elif isinstance(generated_output, str):
             # Find the response part after the user's prompt (less reliable)
             # This might need adjustment based on how your pipeline returns completions
            user_prompt_end = f"Analyze the sentiment of this text: \"{text_input}\""
            if user_prompt_end in generated_output:
                 # Find the start of the assistant's actual response
                 # It might be complex if the model repeats parts of the prompt.
                 # Simplest approach: Split and take the last part, hoping it's the answer.
                 parts = generated_output.split(user_prompt_end)
                 if len(parts) > 1:
                     # Further split by potential delimiters if the model adds extra text
                     response_part = parts[-1].strip()
                     # Try to find the label directly if it adds fluff like "Sentiment:"
                     label_match = re.search(r'\b(Positive|Negative|Neutral)\b', response_part, re.IGNORECASE)
                     if label_match:
                         assistant_response = label_match.group(1)
                     else:
                          assistant_response = response_part.split('\n')[0].strip() # Take first line
                 else:
                     assistant_response = generated_output # Fallback
            else:
                 # If prompt structure isn't found, maybe it just returned the answer
                 assistant_response = generated_output

        print(f"DEBUG: Raw assistant response: '{assistant_response}'")

        # --- Extract the specific label ---
        # Using regex for robustness against variations like "Sentiment: Negative" or just "Negative"
        match = re.search(r'\b(Positive|Negative|Neutral)\b', assistant_response, re.IGNORECASE)
        if match:
            # Capitalize consistently
            return match.group(1).capitalize()
        else:
            print(f"Warning: Could not parse sentiment from response: '{assistant_response}'. Returning 'Unknown'.")
            return "Unknown"

    except Exception as e:
        print(f"An error occurred during text generation: {e}")
        # print("DEBUG: Full pipeline output:", outputs)
        return f"Error: {e}"

# @title Test with Enhanced Prompt (Add tricky examples)

tricky_texts = [
    "இந்த படம் மிகவும் அருமையாக இருந்தது.", # Positive
    "சேவை மிகவும் மோசம், நான் திருப்தி அடையவில்லை.", # Negative
    "enna service idhu? very bad experience da.", # Code-mixed Negative
    "food was okay, nothing special.", # Code-mixed Neutral/Negative (test ambiguity)
    "Super padam! Vera level acting.", # Code-mixed Positive
    "Waste of time and money. Highly disappointing.", # Negative
    "அவன் ஒரு சரியான தண்டம்.", # He is completely useless. (Colloquial Negative)
    "சும்மா சொல்லக்கூடாது, server response ரொம்ப slow.", # Frankly speaking, server response very slow. (Code-mixed Negative)
    "Decent attempt, but could be better.", # Neutral/Slightly Negative
    "This product is absolute rubbish.", # Negative
    "ada pongada dei",
    "yara neengalam",
    "kollathingada",
    "mairu"
]

print("\n--- Starting Sentiment Analysis with Enhanced Prompt ---")
for i, text in enumerate(tricky_texts):
    print(f"\nText {i+1}: \"{text}\"")
    # *** Use the new function ***
    sentiment = classify_sentiment_llama3_enhanced_prompt(text)
    print(f"Predicted Sentiment: {sentiment}")
    print("-" * 15)

print("\n--- Sentiment Analysis Complete ---")


--- Starting Sentiment Analysis with Enhanced Prompt ---

Text 1: "இந்த படம் மிகவும் அருமையாக இருந்தது."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
---------------

Text 2: "சேவை மிகவும் மோசம், நான் திருப்தி அடையவில்லை."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
---------------

Text 3: "enna service idhu? very bad experience da."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
---------------

Text 4: "food was okay, nothing special."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
---------------

Text 5: "Super padam! Vera level acting."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
---------------

Text 6: "Waste of time and money. Highly disappointing."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
---------------

Text 7: "அவன் ஒரு சரியான தண்டம்."
DEBUG: Raw assistant response: 'Negative'
Predicted Sentiment: Negative
----

In [None]:
!pip install google-api-python-client --quiet

In [None]:
# @title Fetch YouTube Comments and Analyze Sentiment (Using Simpler Prompt Logic)

import os
import pandas as pd
import re
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google.colab import userdata
from urllib.parse import urlparse, parse_qs
import time # To avoid hitting quota limits too fast
import torch # Ensure torch is imported if not already
from transformers import pipeline # Ensure pipeline is imported

# --- Configuration ---
YOUTUBE_VIDEO_URL = "https://youtu.be/mTHXBofIc14?si=ONSBrajTQSE2Ul1R" # @param {type:"string"}
MAX_COMMENTS_TO_FETCH = 50 # @param {type:"integer"} # Reduced for faster testing
COMMENTS_PER_PAGE = 50 # Max allowed by API is 100

# --- Prerequisite Check: Ensure Model and Tokenizer are loaded ---
try:
    # Assuming 'model' and 'tokenizer' were loaded in a previous cell
    # If not, you'll need to reload them here.
    if 'model' not in globals() or 'tokenizer' not in globals():
         print("Reloading model and tokenizer...")
         # Add the model/tokenizer loading code from cell 7 here if needed
         tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
         model = AutoModelForCausalLM.from_pretrained(
             "meta-llama/Llama-3.2-1B-Instruct",
             torch_dtype=torch.bfloat16, # Match model loading dtype
             device_map="auto"
         )
         print("Model and tokenizer reloaded.")

    # Create the pipeline (Ensure it uses the loaded model/tokenizer)
    text_generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16, # Match model loading dtype
        device_map="auto" # Ensure pipeline uses the mapped device
    )
    print("Text generation pipeline ready.")

except NameError as e:
     raise NameError(f"ERROR: Model, Tokenizer, or Pipeline not found. Please run the model loading cells first. Details: {e}")
except Exception as e:
     raise RuntimeError(f"An error occurred setting up the model pipeline: {e}")


# --- Helper Functions ---

def get_youtube_api_key():
    """Gets the YouTube API key from Colab secrets."""
    try:
        api_key = userdata.get('YOUTUBE_API_KEY')
        if not api_key:
            raise ValueError("YouTube API Key not found in Colab secrets. Please add it with the name 'YOUTUBE_API_KEY'.")
        return api_key
    except ImportError:
        raise EnvironmentError("userdata module not found. Are you running this in Google Colab?")
    except Exception as e:
        print(f"Error retrieving API key: {e}")
        return None

def extract_video_id(url):
    """Extracts the YouTube video ID from various URL formats."""
    parsed_url = urlparse(url)
    if parsed_url.hostname == 'youtu.be':
        return parsed_url.path[1:]
    if parsed_url.hostname in ('www.youtube.com', 'youtube.com'):
        if parsed_url.path == '/watch':
            p = parse_qs(parsed_url.query)
            return p.get('v', [None])[0]
        if parsed_url.path.startswith('/embed/'):
            return parsed_url.path.split('/')[2]
        if parsed_url.path.startswith('/v/'):
            return parsed_url.path.split('/')[2]
    print(f"Warning: Could not extract video ID from URL: {url}")
    return None

# --- Llama 3 Sentiment Function (Simpler Prompt Version) ---
# Re-implementing the logic from the original function here for clarity
def classify_sentiment_simple_prompt(text_input):
    """Classifies sentiment using Llama 3 with the simpler prompt."""
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant trained to analyze sentiment in text. Classify the sentiment of the following text as Positive, Negative, or Neutral."
        },
        {
            "role": "user",
            "content": f"Text: \"{text_input}\"\n\nSentiment:" # Simple prompt structure
        },
        # Model completes after "Sentiment:"
    ]

    terminators = [
        tokenizer.eos_token_id,
        # Add specific Llama 3 terminators if needed, e.g.:
        # tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    try:
        outputs = text_generator(
            messages,
            max_new_tokens=10, # Just need the label
            eos_token_id=terminators,
            do_sample=False,
            temperature=0.1,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

        # --- Parsing Logic (adapted from original function) ---
        generated_text_obj = outputs[0]['generated_text']
        assistant_response = ""

        # Handle if output is list of turns or just the completion string
        if isinstance(generated_text_obj, list):
            # Get the last message (assistant's response)
            if generated_text_obj and generated_text_obj[-1].get("role") == "assistant":
                 assistant_response = generated_text_obj[-1].get("content", "").strip()
            else: # Fallback if structure isn't as expected
                 assistant_response = str(generated_text_obj) # Log the unexpected
        elif isinstance(generated_text_obj, str):
             # Try splitting if it includes the prompt
             parts = generated_text_obj.split("Sentiment:")
             if len(parts) > 1:
                 assistant_response = parts[-1].strip()
             else: # Assume it's just the completion
                 assistant_response = generated_text_obj.strip()

        print(f"DEBUG (Simple Prompt): Raw assistant response: '{assistant_response}'") # Debugging

        # Extract the specific label (case-insensitive search)
        if re.search(r'\bPositive\b', assistant_response, re.IGNORECASE):
            return "Positive"
        elif re.search(r'\bNegative\b', assistant_response, re.IGNORECASE):
            return "Negative"
        elif re.search(r'\bNeutral\b', assistant_response, re.IGNORECASE):
            return "Neutral"
        else:
            print(f"Warning (Simple Prompt): Could not parse sentiment from response: '{assistant_response}'. Returning 'Unknown'.")
            return "Unknown"

    except Exception as e:
        print(f"An error occurred during text generation: {e}")
        # print("DEBUG: Full pipeline output:", outputs) # Uncomment for detailed error debugging
        return f"Error: {e}"


# --- Main Logic ---
api_key = get_youtube_api_key()
video_id = extract_video_id(YOUTUBE_VIDEO_URL)
comments_data = []

if api_key and video_id:
    try:
        youtube = build('youtube', 'v3', developerKey=api_key)
        print(f"Fetching comments for video ID: {video_id}")
        next_page_token = None
        comments_fetched = 0

        while comments_fetched < MAX_COMMENTS_TO_FETCH:
            request = youtube.commentThreads().list(
                part="snippet",
                videoId=video_id,
                maxResults=min(COMMENTS_PER_PAGE, MAX_COMMENTS_TO_FETCH - comments_fetched),
                textFormat="plainText",
                pageToken=next_page_token
            )
            response = request.execute()

            for item in response.get("items", []):
                if comments_fetched >= MAX_COMMENTS_TO_FETCH: break
                comment = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                # Basic cleaning: remove potential excessive newlines/spaces for the model
                comment_cleaned = re.sub(r'\s+', ' ', comment).strip()
                if not comment_cleaned: # Skip empty comments after cleaning
                    continue

                author = item["snippet"]["topLevelComment"]["snippet"]["authorDisplayName"]
                published_at = item["snippet"]["topLevelComment"]["snippet"]["publishedAt"]
                like_count = item["snippet"]["topLevelComment"]["snippet"]["likeCount"]

                print(f"\nAnalyzing comment {comments_fetched + 1}/{MAX_COMMENTS_TO_FETCH}: \"{comment_cleaned[:100]}...\"")
                # --- Use the simpler prompt sentiment function ---
                sentiment = classify_sentiment_simple_prompt(comment_cleaned)
                print(f"Predicted Sentiment: {sentiment}")
                # ---

                comments_data.append({
                    "Author": author,
                    "Published At": published_at,
                    "Comment": comment, # Store original comment
                    "Likes": like_count,
                    "Predicted Sentiment": sentiment
                })
                comments_fetched += 1

            next_page_token = response.get("nextPageToken")
            if not next_page_token or comments_fetched >= MAX_COMMENTS_TO_FETCH: break
            time.sleep(0.5)

        print(f"\nFetched and analyzed {len(comments_data)} comments.")

    except HttpError as e:
        print(f"\nAn HTTP error {e.resp.status} occurred:")
        error_content = e.content.decode('utf-8')
        print(error_content)
        # Handle specific errors...
    except Exception as e:
        print(f"\nAn unexpected error occurred: {e}")

else:
    # Handle missing API key or video ID...
    pass

# --- Display Results ---
if comments_data:
    df_comments = pd.DataFrame(comments_data)
    print("\n--- Comment Sentiment Analysis Results (Using Simpler Prompt) ---")
    pd.set_option('display.max_colwidth', 200)
    pd.set_option('display.max_rows', 100)
    display(df_comments)
else:
    print("\nNo comments were fetched or analyzed.")

Device set to use cuda:0


Text generation pipeline ready.
Fetching comments for video ID: mTHXBofIc14

Analyzing comment 1/50: "Dudes a whole clown another opinion rejected 🤡🤡🤡..."




DEBUG (Simple Prompt): Raw assistant response: 'The sentiment of the text is Negative. The use'
Predicted Sentiment: Negative

Analyzing comment 2/50: "They should cast a guy in wig to play Abby. Like Terry Crews!..."
DEBUG (Simple Prompt): Raw assistant response: 'The sentiment of the text is Positive. The user'
Predicted Sentiment: Positive

Analyzing comment 3/50: "Facially, she really does remind me of Abby. I don't know how they'd find a female actress of Abby's..."
DEBUG (Simple Prompt): Raw assistant response: 'The sentiment of the given text is Neutral. The'
Predicted Sentiment: Neutral

Analyzing comment 4/50: "I just had a thought that the scene where Ellie was training against a guy that was almost twice her..."
DEBUG (Simple Prompt): Raw assistant response: 'The sentiment of the given text is Neutral. The'
Predicted Sentiment: Neutral

Analyzing comment 5/50: "But like, Abby isn’t supposed to be buff at this point in the story right? I didn’t pay too much att..."
DEBUG (Sim

Unnamed: 0,Author,Published At,Comment,Likes,Predicted Sentiment
0,@JusticeGamingNY,2025-04-16T14:10:56Z,Dudes a whole clown another opinion rejected 🤡🤡🤡,0,Negative
1,@RibbonPL,2025-04-16T14:07:59Z,They should cast a guy in wig to play Abby. Like Terry Crews!,0,Positive
2,@OKA4LIVE,2025-04-16T14:07:41Z,"Facially, she really does remind me of Abby. I don't know how they'd find a female actress of Abby's size who looks even remotely similar to her facially. Or captures the essence of her personalit...",0,Neutral
3,@EmuleelArts,2025-04-16T14:05:26Z,I just had a thought that the scene where Ellie was training against a guy that was almost twice her size could’ve been really good subtle foreshadowing for her conflict against a much bigger Abby...,1,Neutral
4,@EmuleelArts,2025-04-16T13:59:42Z,"But like, Abby isn’t supposed to be buff at this point in the story right?\nI didn’t pay too much attention to all the trailers and pre-release material so if Abby is shown to be just as skinny th...",1,Neutral
5,@shannarong3475,2025-04-16T13:57:02Z,I remember seeing an interview where Neil Druckmann said they didn’t bulk up Kaitlyn because the live-action Abby doesn’t need to look super muscular. They’re focusing more on showing her strength...,0,Neutral
6,@Zuernaxashyr,2025-04-16T13:54:45Z,Dunno about that.. Abby was getting pretty muscular before her father died.,0,Neutral
7,@LarryHayward-v7p,2025-04-16T13:47:38Z,The Movies and TV shows need to thrive without reliance on the games.\n\nRather than a strict road map for an adaptation. As it allows for the best of both worlds.💯💯💯 it’s a unique exploration of ...,0,Positive
8,@looniemoonie5955,2025-04-16T13:47:35Z,"I mean that's why she's such a threat to Ellie, no?\nBecause she's buff and strong. Imagine Batman going against Bane except he is 70 yo shortie. Not much impact here (unless he still got the stre...",0,Negative
9,@adib1081,2025-04-16T13:47:22Z,Can't defend the season 2 cast anymore bro,0,Negative



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

