![image](car.jpeg)

**Car-ing is sharing**, an auto dealership company for car sales and rental, is taking their services to the next level thanks to **Large Language Models (LLMs)**.

As their newly recruited AI and NLP developer, you've been asked to prototype a chatbot app with multiple functionalities that not only assist customers but also provide support to human agents in the company.

The solution should receive textual prompts and use a variety of pre-trained Hugging Face LLMs to respond to a series of tasks, e.g. classifying the sentiment in a car’s text review, answering a customer question, summarizing or translating text, etc.


## Before you start

In order to complete the project you may wish to install some Hugging Face libraries such as `transformers` and `evaluate`.

In [None]:
# !pip install -q transformers
# !pip install -q evaluate==0.4.0
# !pip install -q datasets==2.10.0
# !pip install -q sentencepiece==0.1.97
# !pip install -q openai
# !pip install -q tenacity


#from transformers import logging

In [1]:
import os
import sys
import importlib.metadata
import numpy as np
import pandas as pd
import torch
from openai import OpenAI
import logging
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import openai
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
# Mostly for testing if packages are working okay & huggingface utilities
#Load the car reviews dataset
file_path = "data/car_reviews.csv"
df = pd.read_csv(file_path, delimiter=";")

# Put the car reviews and their associated sentiment labels in two lists
reviews = df['Review'].tolist()
real_labels = df['Class'].tolist()


# Instruction 1: sentiment classification

# Load a sentiment analysis LLM into a pipeline
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Perform inference on the car reviews and display prediction results
predicted_labels = classifier(reviews)
for review, prediction, label in zip(reviews, predicted_labels, real_labels):
    print(f"Review: {review}\nActual Sentiment: {label}\nPredicted Sentiment: {prediction['label']} (Confidence: {prediction['score']:.4f})\n")

# Load accuracy and F1 score metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

# Map categorical sentiment labels into integer labels
references = [1 if label == "POSITIVE" else 0 for label in real_labels]
predictions = [1 if label['label'] == "POSITIVE" else 0 for label in predicted_labels]

# Calculate accuracy and F1 score
accuracy_result_dict = accuracy.compute(references=references, predictions=predictions)
accuracy_result = accuracy_result_dict['accuracy']

# Fix for f1 calculation - handle the result safely whether it's a float or array
from sklearn.metrics import f1_score
f1_result = f1_score(references, predictions)
print(f"Accuracy: {accuracy_result}")
print(f"F1 result: {f1_result}")


# Instruction 2: Translation

# Load translation LLM into a pipeline and translate car review
first_review = reviews[0]
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
translated_review = translator(first_review, max_length=27)[0]['translation_text']
print(f"Model translation:\n{translated_review}")

# Load reference translations from file
with open("data/reference_translations.txt", 'r') as file:
    lines = file.readlines()
references = [line.strip() for line in lines]
print(f"Spanish translation references:\n{references}")

# Load and calculate BLEU score metric
bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=[translated_review], references=[references])
print(bleu_score['bleu'])


# Instruction 3: extractive QA
# Instantiate model and tokenizer
model_ckp = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_ckp)
model = AutoModelForQuestionAnswering.from_pretrained(model_ckp)

# Define context and question, and tokenize them
context = reviews[1]
print(f"Context:\n{context}")
question = "What did he like about the brand?"
inputs = tokenizer(question, context, return_tensors="pt")

# Perform inference and extract answer from raw outputs
with torch.no_grad():
  outputs = model(**inputs)
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits) + 1
answer_span = inputs["input_ids"][0][start_idx:end_idx]

# Decode and show answer
answer = tokenizer.decode(answer_span)
print("Answer: ", answer)


# Instruction 4

# Get original text to summarize upon car review
text_to_summarize = reviews[-1]
print(f"Original text:\n{text_to_summarize}")

# Load summarization pipeline and perform inference
model_name = "cnicu/t5-small-booksum"
summarizer = pipeline("summarization", model=model_name)
outputs = summarizer(text_to_summarize, max_length=53)
summarized_text = outputs[0]['summary_text']
print(f"Summarized text:\n{summarized_text}")


Device set to use cuda:0


Review: I am very satisfied with my 2014 Nissan NV SL. I use this van for my business deliveries and personal use. Camping, road trips, etc. We dont have any children so I store most of the seats in my warehouse. I wanted the passenger van for the rear air conditioning. We drove our van from Florida to California for a Cross Country trip in 2014. We averaged about 18 mpg. We drove thru a lot of rain and It was a very comfortable and stable vehicle. The V8 Nissan Titan engine is a 500k mile engine. It has been tested many times by delivery and trucking companies. This is why Nissan gives you a 5 year or 100k mile bumper to bumper warranty. Many people are scared about driving this van because of its size. But with front and rear sonar sensors, large mirrors and the back up camera. It is easy to drive. The front and rear sensors also monitor the front and rear sides of the bumpers making it easier to park close to objects. Our Nissan NV is a Tow Monster. It pulls our 5000 pound travel tr

Device set to use cuda:0
Your input_length: 365 is bigger than 0.9 * max_length: 27. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 365 is bigger than 0.9 * max_length: 27. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


Model translation:
Estoy muy satisfecho con mi 2014 Nissan NV SL. Uso esta furgoneta para mis entregas de negocios y uso personal.
Spanish translation references:
['Estoy muy satisfecho con mi Nissan NV SL 2014. Utilizo esta camioneta para mis entregas comerciales y uso personal.', 'Estoy muy satisfecho con mi Nissan NV SL 2014. Uso esta furgoneta para mis entregas comerciales y uso personal.']


Downloading builder script: 100%|██████████| 5.94k/5.94k [00:00<00:00, 7.41MB/s]

Downloading extra modules: 100%|██████████| 3.34k/3.34k [00:00<00:00, 15.2MB/s]



0.6022774485691839


Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Context:
The car is fine. It's a bit loud and not very powerful. On one hand, compared to its peers, the interior is well-built. The transmission failed a few years ago, and the dealer replaced it under warranty with no issues. Now, about 60k miles later, the transmission is failing again. It sounds like a truck, and the issues are well-documented. The dealer tells me it is normal, refusing to do anything to resolve the issue. After owning the car for 4 years, there are many other vehicles I would purchase over this one. Initially, I really liked what the brand is about: ride quality, reliability, etc. But I will not purchase another one. Despite these concerns, I must say, the level of comfort in the car has always been satisfactory, but not worth the rest of issues found.
Answer:  ride quality, reliability
Original text:
I've been dreaming of owning an SUV for quite a while, but I've been driving cars that were already paid for during an extended period. I ultimately made the decisio

Device set to use cuda:0


Summarized text:
the Nissan Rogue provides me with the desired SUV experience without burdening me with an exorbitant payment; the financial arrangement is quite reasonable. I have hauled 12 bags of mulch in the back with the seats down and could have held more.


In [3]:
# Check Python version
print(f"Python version: {sys.version}")

# List of packages to check (use distribution names)
packages = [
    "numpy",
    "pandas",
    "openai",
    "tenacity",
    "transformers"
]

# Check package versions
print("\nPackage versions:")
for package_name in packages:
    try:
        version = importlib.metadata.version(package_name)
        print(f"- {package_name}: {version}")
    except importlib.metadata.PackageNotFoundError:
        print(f"- {package_name}: Not found (or version not available via importlib.metadata)")

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]

Package versions:
- numpy: 2.2.4
- pandas: 2.2.3
- openai: 1.72.0
- tenacity: 9.1.2
- transformers: 4.51.2


In [4]:
# load env variable
token = os.getenv("HF_TOKEN")

In [None]:
client = OpenAI(
	base_url="https://router.huggingface.co/hf-inference/models/Qwen/Qwen2.5-1.5B-Instruct/v1",
	api_key=token
)


def run_chatgpt(prompt, client, model="Qwen/Qwen2.5-1.5B-Instruct"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        seed=123,
    )
    return response.choices[0].message.content


prompt = f"Respond with 'hello world' if you got this message."
run_chatgpt(prompt, client)

'hello world'

In [6]:
# Set up logging
#logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
#logger = logging.getLogger(__name__)


@retry(
    retry=retry_if_exception_type((openai.RateLimitError, openai.APIConnectionError, openai.APITimeoutError)),
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    before_sleep=lambda retry_state: logger.warning(
        f"Retry attempt {retry_state.attempt_number} after error: {retry_state.outcome.exception()}"
    )
)
def run_hfgpt(system_prompt, user_prompt, client, model="Qwen/Qwen2.5-1.5B-Instruct"):
    """
    Run a query against an HuggingFace model with retry logic and error handling.

    Args:
        system_prompt: The system instruction for the model
        user_prompt: The user query to process
        client: The OpenAI-compatible client
        model: Model identifier to use

    Returns:
        The model's response text
    """
    # Check if the client is initialized
    client = OpenAI(
        base_url="https://router.huggingface.co/hf-inference/models/Qwen/Qwen2.5-1.5B-Instruct/v1",
        api_key=token
    )
    try:
        print(f"Sending request to model: {model}")
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.0,
            seed=123,
        )
        return response.choices[0].message.content
    except openai.RateLimitError as e:
        print(f"Rate limit exceeded: {e}")
        raise
    except openai.APIConnectionError as e:
        print(f"API connection error: {e}")
        raise
    except openai.APITimeoutError as e:
        print(f"API timeout error: {e}")
        raise
    except openai.BadRequestError as e:
        print(f"Bad request error: {e}")
        # Don't retry bad requests as they're likely client errors
        raise
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise

system_prompt = "You are a helpful, harmless and honest assistant" # constitutional ai prompt
user_prompt = f"Respond with 'hello world' if you got this message."

result = run_hfgpt(system_prompt, user_prompt, client)
print(result)

Sending request to model: Qwen/Qwen2.5-1.5B-Instruct
hello world


The model API is working and we have a function that concentrates the API best practices to enable us todo the consequent tasks.

## Load the Data

We will load the data with pandas and craft a prompt to analyze the car reviews.

In [None]:
!head -n 30 data/car_reviews.csv

In [None]:
!head data/reference_translations.txt

## Loading data up

In [7]:
car_reviews = pd.read_csv("data/car_reviews.csv", delimiter=";")

In [8]:
def read_file(filepath, encoding='utf-8', chunk_size=None):
    """
    Efficiently read file content with proper resource management and optional chunking.

    Args:
        filepath: Path to the file
        encoding: File encoding (default: utf-8)
        chunk_size: Size of chunks to read (None reads entire file at once)

    Returns:
        File content as string
    """
    try:
        with open(filepath, 'r', encoding=encoding) as file:
            if chunk_size:
                # Process large files in chunks to reduce memory usage
                content = []
                while chunk := file.read(chunk_size):
                    content.append(chunk)
                return ''.join(content)
            else:
                # Read entire file at once for small files
                return file.read()
    except FileNotFoundError:
        print(f"Error: File '{filepath}' not found")
        return None
    except PermissionError:
        print(f"Error: No permission to read '{filepath}'")
        return None
    except UnicodeDecodeError:
        print(f"Error: File '{filepath}' has encoding issues. Try different encoding.")
        return None
    except Exception as e:
        print(f"Error reading file '{filepath}': {e}")
        return None

# Example usage
filepath = 'data/reference_translations.txt'
file_content = read_file(filepath)

if file_content is not None:
    print(file_content)
    print(f"The data type of this is: {type(file_content)}")

Estoy muy satisfecho con mi Nissan NV SL 2014. Utilizo esta camioneta para mis entregas comerciales y uso personal.
Estoy muy satisfecho con mi Nissan NV SL 2014. Uso esta furgoneta para mis entregas comerciales y uso personal.
The data type of this is: <class 'str'>


In [9]:
car_reviews.head()

Unnamed: 0,Review,Class
0,I am very satisfied with my 2014 Nissan NV SL....,POSITIVE
1,The car is fine. It's a bit loud and not very ...,NEGATIVE
2,"My first foreign car. Love it, I would buy ano...",POSITIVE
3,I've come across numerous reviews praising the...,NEGATIVE
4,I've been dreaming of owning an SUV for quite ...,POSITIVE


In [10]:
car_reviews.describe()

Unnamed: 0,Review,Class
count,5,5
unique,5,2
top,I am very satisfied with my 2014 Nissan NV SL....,POSITIVE
freq,1,3


In [11]:
# Select the review column and the real labels
reviews = car_reviews['Review'].tolist()
real_labels = car_reviews['Class'].tolist()

In [12]:
print(reviews[:2],"#"* 100, real_labels[0:2])

['I am very satisfied with my 2014 Nissan NV SL. I use this van for my business deliveries and personal use. Camping, road trips, etc. We dont have any children so I store most of the seats in my warehouse. I wanted the passenger van for the rear air conditioning. We drove our van from Florida to California for a Cross Country trip in 2014. We averaged about 18 mpg. We drove thru a lot of rain and It was a very comfortable and stable vehicle. The V8 Nissan Titan engine is a 500k mile engine. It has been tested many times by delivery and trucking companies. This is why Nissan gives you a 5 year or 100k mile bumper to bumper warranty. Many people are scared about driving this van because of its size. But with front and rear sonar sensors, large mirrors and the back up camera. It is easy to drive. The front and rear sensors also monitor the front and rear sides of the bumpers making it easier to park close to objects. Our Nissan NV is a Tow Monster. It pulls our 5000 pound travel trailer 

## Sentiment classification with an off the shelf LLM and BERT

Some issues you may encounter is using up all your all you credits to do this task. The second method is more realiable since you are bringing the model to the workspace. In addition, it is better getting the model information from the model pages for example https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct then checking out the inference provider. Ollama could be another good alternative. But, I recommend using between 0.5B-7B models in your computer if you have at least 8B of VRAM. Alternatively, you could change a bit of the above API to use openAI models instead. Do what you are comfortable with.

In [13]:
def format_input(text):
    """Format the input text for sentiment analysis"""
    return f"Analyze the sentiment of this text: '{text}'"

# Define system prompt for sentiment analysis
system_prompt = """You are a sentiment analysis model. Follow these rules strictly:
1. If the review expresses positive feelings, satisfaction, praise, or approval, respond with: POSITIVE
2. If the review expresses negative feelings, complaints, criticism, or disapproval, respond with: NEGATIVE
3. Include a confidence score between 0 and 1 in parentheses after your sentiment label, e.g. "POSITIVE (0.95)"
4. Base your analysis on the emotional tone and content of the text"""

# Iterate through reviews and get sentiment scores
if "Review" in car_reviews.columns:
    # Create the sentiment columns if they don't exist
    if 'predicted_labels' not in car_reviews.columns:
        car_reviews['predicted_labels'] = None
    if 'confidence' not in car_reviews.columns:
        car_reviews['confidence'] = None

    for index, review_text in car_reviews["Review"].items():
        user_prompt = format_input(review_text)
        print("#" * 100)
        print(f"Review: {review_text}")
        try:
            result = run_hfgpt(system_prompt, user_prompt, client)

            # Parse the result to extract sentiment and confidence
            # Expected format: "POSITIVE (0.9294)" or "NEGATIVE (0.8654)"
            parts = result.strip().split('(')
            sentiment = parts[0].strip()
            confidence = float(parts[1].strip(')')) if len(parts) > 1 else None

            print(f"Review {index}: {sentiment} (Confidence: {confidence})")

            # Store sentiment and confidence separately
            car_reviews.at[index, 'predicted_labels'] = sentiment
            car_reviews.at[index, 'confidence'] = confidence

            # Print formatted output
            if 'Actual_Sentiment' in car_reviews.columns:
                actual = car_reviews.at[index, 'Actual_Sentiment']
                print(f"Actual Sentiment: {actual}")
            print(f"Predicted Sentiment: {sentiment} (Confidence: {confidence:.4f})")
            print("#" * 100)

        except Exception as e:
            print(f"Error processing review {index}: {e}")
            car_reviews.at[index, 'predicted_labels'] = None
            car_reviews.at[index, 'confidence'] = None
else:
    print(f"Column 'Review' not found")

# Function to display formatted results (optional)
def display_sentiment_analysis(df):
    for index, row in df.iterrows():
        if 'Review' in row and 'predicted_labels' in row and 'confidence' in row:
            print(f"Review: {row['Review']}")
            if 'Actual_Sentiment' in row:
                print(f"Actual Sentiment: {row['Actual_Sentiment']}")
            print(f"Predicted Sentiment: {row['predicted_labels']} (Confidence: {row['confidence']:.4f})")
            print()

# To display the results in the desired format
display_sentiment_analysis(car_reviews)

####################################################################################################
Review: I am very satisfied with my 2014 Nissan NV SL. I use this van for my business deliveries and personal use. Camping, road trips, etc. We dont have any children so I store most of the seats in my warehouse. I wanted the passenger van for the rear air conditioning. We drove our van from Florida to California for a Cross Country trip in 2014. We averaged about 18 mpg. We drove thru a lot of rain and It was a very comfortable and stable vehicle. The V8 Nissan Titan engine is a 500k mile engine. It has been tested many times by delivery and trucking companies. This is why Nissan gives you a 5 year or 100k mile bumper to bumper warranty. Many people are scared about driving this van because of its size. But with front and rear sonar sensors, large mirrors and the back up camera. It is easy to drive. The front and rear sensors also monitor the front and rear sides of the bumpers making 

In [14]:
# Load a sentiment analysis LLM into a pipeline
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Perform inference on the car reviews and display prediction results
predicted_labels = classifier(reviews)
for review, prediction, label in zip(reviews, predicted_labels, real_labels):
    print(f"Review: {review}\nActual Sentiment: {label}\nPredicted Sentiment: {prediction['label']} (Confidence: {prediction['score']:.4f})\n")


Device set to use cuda:0


Review: I am very satisfied with my 2014 Nissan NV SL. I use this van for my business deliveries and personal use. Camping, road trips, etc. We dont have any children so I store most of the seats in my warehouse. I wanted the passenger van for the rear air conditioning. We drove our van from Florida to California for a Cross Country trip in 2014. We averaged about 18 mpg. We drove thru a lot of rain and It was a very comfortable and stable vehicle. The V8 Nissan Titan engine is a 500k mile engine. It has been tested many times by delivery and trucking companies. This is why Nissan gives you a 5 year or 100k mile bumper to bumper warranty. Many people are scared about driving this van because of its size. But with front and rear sonar sensors, large mirrors and the back up camera. It is easy to drive. The front and rear sensors also monitor the front and rear sides of the bumpers making it easier to park close to objects. Our Nissan NV is a Tow Monster. It pulls our 5000 pound travel tr

Seems that the now popular causal language models are giving the same kind of sentiment scores with more conservative with the BERT models which is a encoder model; it is good at Natural Language Understanding. That means, it compresses information into a numerical representation and a score is given. Where as the decoder which excels at Natural language generation seems to work to upon aligning it to a system prompt.

This is important while doing LLM evaluations. Then we can track this in a database or csvfile. Probably the latter to make things lite weight. But, if its for school or work I think a database would be better. Let's see if the LLM did the right thing.

In [17]:
real_labels

['POSITIVE', 'NEGATIVE', 'POSITIVE', 'NEGATIVE', 'POSITIVE']

In [18]:
# Load accuracy and F1 score metrics
import evaluate
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

# Map categorical sentiment labels into integer labels
references = [1 if label == "POSITIVE" else 0 for label in real_labels]
predictions = [1 if label['label'] == "POSITIVE" else 0 for label in predicted_labels]

# Calculate accuracy and F1 score
accuracy_result_dict = accuracy.compute(references=references, predictions=predictions)
accuracy_result = accuracy_result_dict['accuracy']

# Fix for f1 calculation - handle the result safely whether it's a float or array
from sklearn.metrics import f1_score
f1_result = f1_score(references, predictions)
print(f"Accuracy: {accuracy_result}")
print(f"F1 result: {f1_result}")


Accuracy: 0.8
F1 result: 0.8571428571428571


In [None]:
car_reviews

In [19]:
def calculate_metrics(df):
    """
    Calculate F1-score and accuracy using NumPy operations.

    Args:
        df: DataFrame with 'Class' and 'predictions' columns

    Returns:
        DataFrame with added metric columns
    """
    # Convert 'Class' to binary (0/1)
    true_labels = np.where(car_reviews['Class'] == 'POSITIVE', 1, 0)
    predictions = np.where(car_reviews['predicted_labels'] == "POSITIVE", 1, 0)

    # Calculate True Positives, False Positives, False Negatives
    tp = np.sum((true_labels == 1) & (predictions == 1))
    fp = np.sum((true_labels == 0) & (predictions == 1))
    fn = np.sum((true_labels == 1) & (predictions == 0))

    # Calculate precision and recall
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0

    # Calculate F1-score
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    # Calculate accuracy
    accuracy = np.mean(true_labels == predictions)

    # Add metrics as new columns
    df['accuracy_result'] = accuracy
    df['f1_result'] = f1

    return df

# Apply the metrics calculation
car_reviews = calculate_metrics(car_reviews)

# Print metrics to verify
print(f"F1-Score: {car_reviews['f1_result'].iloc[0]:.3f}")
print(f"Accuracy: {car_reviews['accuracy_result'].iloc[0]:.3f}")

F1-Score: 1.000
Accuracy: 1.000


I would trust the output of the previous model more. Sometimes you get the same result as what you were aiming for in this situation the true_labels with the predicted labels. Making this a classic classification problem.

# Task 2: Translation

Looking good so far the reviews matching the original classes. But, it is never this easy for real world data. Now, let's look into a translation task.

In [20]:
# reuse the previous format text function
system_prompt = """You are an expert Spanish-English translator. Provide accurate translations while preserving complete meaning.

Examples:

Input: "El tiempo es oro."
Translation: "Time is gold."
Context: Direct equivalent of "time is money" saying

Input: "Estar entre la espada y la pared"
Translation: "To be between the sword and the wall"
Context: Equivalent to "between a rock and a hard place"

Input: "Cada loco con su tema"
Translation: "Each crazy person with their theme"
Context: Similar to "to each their own"

Instructions:
- Maintain original meaning
- Preserve tone and register
- Keep cultural context
- Include all source information
"""

# Create list outside the loop to store unique translations
translated_review = []
seen_sentences = set()  # Track unique sentences

# Read the file content
def calculate_bleu(reference, candidate):
    """
    Calculate a simplified BLEU score between reference and candidate translations.
    Uses unigram precision (word-level matching).

    Args:
        reference: Reference (ground truth) translation string
        candidate: Candidate (model) translation string
    Returns:
        Float between 0-1 representing translation quality
    """
    # Tokenize into words (simple split on spaces)
    ref_words = reference.lower().split()
    cand_words = candidate.lower().split()

    # Count matching words
    matches = sum(1 for word in cand_words if word in ref_words)

    # Calculate precision
    precision = matches / len(cand_words) if len(cand_words) > 0 else 0

    # Length penalty if candidate is too short
    brevity_penalty = min(1.0, len(cand_words) / len(ref_words)) if len(ref_words) > 0 else 0

    # Final BLEU score
    bleu = precision * brevity_penalty
    return bleu

# Use in translation loop
translated_review = []
seen_sentences = set()
bleu_scores = []

with open('data/reference_translations.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    sentences = content.split('.')

for sentence in sentences:
    sentence = sentence.strip()
    if sentence and sentence not in seen_sentences:
        seen_sentences.add(sentence)
        user_prompt = f"Translate this Spanish text to English: {sentence}"
        try:
            translation = run_hfgpt(system_prompt, user_prompt, client)
            score = calculate_bleu(sentence, translation)
            bleu_scores.append(score)
            translated_review.append(translation)

            print(f"Original: {sentence}")
            print(f"Translation: {translation}")
            print(f"BLEU Score: {score:.4f}\n")

        except Exception as e:
            print(f"Error: {e}")
            continue

print(f"Average BLEU score: {sum(bleu_scores)/len(bleu_scores)*.01 :.4f}")

Sending request to model: Qwen/Qwen2.5-1.5B-Instruct
Original: Estoy muy satisfecho con mi Nissan NV SL 2014
Translation: I am very satisfied with my 2014 Nissan NV SL.
BLEU Score: 0.3000

Sending request to model: Qwen/Qwen2.5-1.5B-Instruct
Original: Estoy muy satisfecho con mi Nissan NV SL 2014
Translation: I am very satisfied with my 2014 Nissan NV SL.
BLEU Score: 0.3000

Sending request to model: Qwen/Qwen2.5-1.5B-Instruct
Original: Utilizo esta camioneta para mis entregas comerciales y uso personal
Translation: I use this truck for both my commercial deliveries and personal use.
BLEU Score: 0.0833

Sending request to model: Qwen/Qwen2.5-1.5B-Instruct
Original: Utilizo esta camioneta para mis entregas comerciales y uso personal
Translation: I use this truck for both my commercial deliveries and personal use.
BLEU Score: 0.0833

Sending request to model: Qwen/Qwen2.5-1.5B-Instruct
Original: Uso esta furgoneta para mis entregas comerciales y uso personal
Translation: I use this truck

The second translation the model gave a prediction of truck instead of a van. I think when doing translations especially with models like these ones it might help to read the model cards more. Learning more about the support of the language of interest. Let's see what variables we have so far! You can use `%whos` don't do this publicly. It is a good way to see what variables you have in your workspace.

Let's try translating with a different model more attuned to translation from the `Helsinki-NLP` group. 

In [22]:
from transformers import pipeline
import gc

# 1. Load the model with memory optimizations for CPU
translator = pipeline(
    "translation",
    model="Helsinki-NLP/opus-mt-en-es",
    device=-1,  # Explicitly specify CPU
    model_kwargs={"low_cpu_mem_usage": True}  # Reduces memory footprint during loading
)

# 2. Process just the first review with CPU-friendly parameters
first_review = reviews[0]
translated_review = translator(
    first_review,
    max_length=27,
    batch_size=1  # Process one sample at a time
)[0]['translation_text']

print(f"Model translation:\n{translated_review}")

# 3. Clean up
del translator
gc.collect()

Device set to use cpu
Your input_length: 365 is bigger than 0.9 * max_length: 27. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 365 is bigger than 0.9 * max_length: 27. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


Model translation:
Estoy muy satisfecho con mi 2014 Nissan NV SL. Uso esta furgoneta para mis entregas de negocios y uso personal.


1820

In [24]:
# Load reference translations from file
with open("data/reference_translations.txt", 'r') as file:
    lines = file.readlines()
references = [line.strip() for line in lines]
print(f"Spanish translation references:\n{references}")

Spanish translation references:
['Estoy muy satisfecho con mi Nissan NV SL 2014. Utilizo esta camioneta para mis entregas comerciales y uso personal.', 'Estoy muy satisfecho con mi Nissan NV SL 2014. Uso esta furgoneta para mis entregas comerciales y uso personal.']


The translations were not so bad. The reviews were under 0.5. Now let's see what we can learn from a extractive QA LLM about the second review.

In [28]:
translated_review

'Estoy muy satisfecho con mi 2014 Nissan NV SL. Uso esta furgoneta para mis entregas de negocios y uso personal.'

In [None]:
#https://huggingface.co/deepset/minilm-uncased-squad2
# thanks for the documentation let's go
model_name = "deepset/minilm-uncased-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': "What did he like about the brand?",
    'context': translated_review[1]
}
res = nlp(QA_input)

# Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Perform inference and extract answer from raw outputs
with torch.no_grad():
  outputs = model(**inputs)
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits) + 1
answer_span = inputs["input_ids"][0][start_idx:end_idx]

# Decode and show answer
answer = tokenizer.decode(answer_span)
print("Answer: ", answer)

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Device set to use cuda:0
Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model tr

Answer:  ride quality, reliability


## Task: Summarize text

In [31]:
# Get original text to summarize upon car review
text_to_summarize = reviews[-1]
print(f"Original text:\n{text_to_summarize}")

# Load summarization pipeline and perform inference
model_name = "cnicu/t5-small-booksum"
summarizer = pipeline("summarization", model=model_name)
outputs = summarizer(text_to_summarize, max_length=53)
summarized_text = outputs[0]['summary_text']
print(f"Summarized text:\n{summarized_text}")


Original text:
I've been dreaming of owning an SUV for quite a while, but I've been driving cars that were already paid for during an extended period. I ultimately made the decision to transition to a brand-new car, which, of course, involved taking on new payments. However, given that I don't drive extensively, I was inclined to avoid a substantial financial commitment. The Nissan Rogue provides me with the desired SUV experience without burdening me with an exorbitant payment; the financial arrangement is quite reasonable. Handling and styling are great; I have hauled 12 bags of mulch in the back with the seats down and could have held more. I am VERY satisfied overall. I find myself needing to exercise extra caution when making lane changes, particularly owing to the blind spots resulting from the small side windows situated towards the rear of the vehicle. To address this concern, I am actively engaged in making adjustments to my mirrors and consciously reducing the frequency of la

Device set to use cuda:0


Summarized text:
the Nissan Rogue provides me with the desired SUV experience without burdening me with an exorbitant payment; the financial arrangement is quite reasonable. I have hauled 12 bags of mulch in the back with the seats down and could have held more.


In [34]:
# Using a model I found on HF
# Ranked the highest for summarization
# Get original text to summarize upon car review
text_to_summarize = reviews[-1]
print(f"Original text:\n{text_to_summarize}")

# Load summarization pipeline and perform inference
model_name = "facebook/bart-large-cnn"
summarizer = pipeline("summarization", model=model_name)
outputs = summarizer(text_to_summarize, max_length=65)
summarized_text = outputs[0]['summary_text']
print(f"Summarized text:\n{summarized_text}")

Original text:
I've been dreaming of owning an SUV for quite a while, but I've been driving cars that were already paid for during an extended period. I ultimately made the decision to transition to a brand-new car, which, of course, involved taking on new payments. However, given that I don't drive extensively, I was inclined to avoid a substantial financial commitment. The Nissan Rogue provides me with the desired SUV experience without burdening me with an exorbitant payment; the financial arrangement is quite reasonable. Handling and styling are great; I have hauled 12 bags of mulch in the back with the seats down and could have held more. I am VERY satisfied overall. I find myself needing to exercise extra caution when making lane changes, particularly owing to the blind spots resulting from the small side windows situated towards the rear of the vehicle. To address this concern, I am actively engaged in making adjustments to my mirrors and consciously reducing the frequency of la

Device set to use cuda:0


Summarized text:
The Nissan Rogue provides me with the desired SUV experience without burdening me with an exorbitant payment. Handling and styling are great; I have hauled 12 bags of mulch in the back with the seats down and could have held more. The engine delivers strong performance, and the ride is really smooth.
