# 🚀 AI Travel Planner - LLM Inference Notebook

This notebook demonstrates how to load our fine-tuned Llama-based LLM from HuggingFace and use it to generate interactive responses for travel-related queries. We use Gradio as UI component so that the model's capabilities can be experienced through a user-friendly interface and we use Gradio deploy to create a public URL.

**Note**: This approach is implemented as a workaround based on the TA's suggestion to address the issue of `AssertionError: Torch not compiled with CUDA enabled` encountered in HuggingFace Spaces due to the lack of GPU support. For more details, refer to the [discussion on Canvas](https://canvas.kth.se/courses/50172/discussion_topics/432284).

In [3]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
# !pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [4]:
!pip install -U bitsandbytes



In [1]:
!pip install gradio



In [12]:
from transformers import AutoModel, AutoTokenizer
max_seq_length = 2048
dtype = None

model_name_or_path = "Eugenius0/lora_model_tuned"

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name_or_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = True,
    )

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [9]:
from unsloth import FastLanguageModel
# model, tokenizer = FastLanguageModel.from_pretrained(
#       model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#       max_seq_length = max_seq_length,
#       dtype = dtype,
#       load_in_4bit = load_in_4bit,
#   )
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "In which state is Freiburg im Breisgau and name its most famous sight?."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Freiburg im Breisgau is in Germany, and its most famous sight is the Freiburger Münster, a medieval Gothic-style cathedral with a distinctive south tower.<|eot_id|>


In [16]:
# Measure Performance based on Human Judgement
from unsloth import FastLanguageModel
# model, tokenizer = FastLanguageModel.from_pretrained(
#       model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#       max_seq_length = max_seq_length,
#       dtype = dtype,
#       load_in_4bit = load_in_4bit,
#   )
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "How high is the Feldberg in the Black Forest?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The Feldberg, which is a mountain in the Black Forest (Schwarzwald) in Germany, has an elevation of 1,493 meters (4,893 feet).<|eot_id|>


In [19]:
# Measure Performance
import time
from transformers import TextIteratorStreamer

# Example test queries
test_queries = [
    {
        "query": "In which state is Freiburg im Breisgau and name its most famous sight?",
    },
    {
        "query": "What are the main attractions in Paris?",
    },
]

performance_results = []

for test in test_queries:
    query = test["query"]

    messages = [{"role": "user", "content": query}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    # Measure inference time
    start_time = time.time()
    streamer = TextIteratorStreamer(
        tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10.0
    )

    # Start generation in a separate thread
    import threading
    generation_thread = threading.Thread(
        target=model.generate,
        kwargs={
            "input_ids": inputs,
            "streamer": streamer,
            "max_new_tokens": 128,
            "use_cache": True,
            "temperature": 1.5,
            "min_p": 0.1,
        },
    )
    generation_thread.start()

    # Collect response as it streams
    response = ""
    for token in streamer:
        response += token

    end_time = time.time()

    # Log results
    performance_results.append({
        "Query": query,
        "Generated Output": response,
        "Inference Time (s)": round(end_time - start_time, 4),
    })

# Display results
import pandas as pd
pd.set_option("display.max_colwidth", None)
results_df = pd.DataFrame(performance_results)
print(results_df)

                                                                    Query  \
0  In which state is Freiburg im Breisgau and name its most famous sight?   
1                                 What are the main attractions in Paris?   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Generated Output  \
0                                                                                                                                                                                  

## Bleu Scores

In [10]:
!pip install nltk



In [11]:
import re
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from transformers import TextIteratorStreamer
import pandas as pd
import time
import threading

# Function to normalize text
def normalize_text(text):
    """Lowercase and remove punctuation from a given text."""
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

# Example test queries and references
test_queries = [
    {"query": "Translate this from English into German: Freiburg is in Baden-Württemberg and its most famous sight is the Freiburger Münster.",
     "reference": "Freiburg liegt in Baden-Württemberg und seine berühmteste Sehenswürdigkeit ist das Freiburger Münster."},
    {"query": "Translate this from English into French: The main attractions in Paris include the Eiffel Tower, the Louvre, and Notre-Dame Cathedral.",
     "reference": "Les principales attractions de Paris incluent la Tour Eiffel, le Louvre et la Cathédrale Notre-Dame."},
]

performance_results = []

# Smoothing function for BLEU score
smoothing_function = SmoothingFunction().method1

for test in test_queries:
    query = test["query"]
    reference = test["reference"]

    # Prepare input
    messages = [{"role": "user", "content": query}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    # Generate response
    response = ""
    streamer = TextIteratorStreamer(
        tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10.0
    )

    generation_thread = threading.Thread(
        target=model.generate,
        kwargs={
            "input_ids": inputs,
            "streamer": streamer,
            "max_new_tokens": 128,
            "use_cache": True,
            "temperature": 1.5,
            "min_p": 0.1,
        },
    )
    generation_thread.start()

    for token in streamer:
        response += token

    # Normalize the generated output and reference
    normalized_response = normalize_text(response)
    normalized_reference = normalize_text(reference)

    # Compute BLEU score
    bleu_score = sentence_bleu([normalized_reference.split()], normalized_response.split(), smoothing_function=smoothing_function)

    # Log results
    performance_results.append({
        "Query": query,
        "Generated Output": response.strip(),
        "BLEU Score": round(bleu_score, 4),
    })

# Display results
pd.set_option("display.max_colwidth", None)
results_df = pd.DataFrame(performance_results)
print(results_df)

                                                                                                                                    Query  \
0          Translate this from English into German: Freiburg is in Baden-Württemberg and its most famous sight is the Freiburger Münster.   
1  Translate this from English into French: The main attractions in Paris include the Eiffel Tower, the Louvre, and Notre-Dame Cathedral.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Generated Output  \
0                                                                            

In [13]:
import gradio as gr
from unsloth import FastLanguageModel
from transformers import TextIteratorStreamer
import torch
import threading

# Load the fine-tuned model and tokenizer
model_name_or_path = "Eugenius0/lora_model_tuned"
max_seq_length = 2048
dtype = None

# Detect and set the appropriate device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name_or_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Define the travel planner response generation logic
def generate_travel_plan(city, preferences, nb_days):
    try:
        prompt = (
            f"Create a travel plan to visit {city} during {nb_days} days, focusing on {preferences}. Include suggested activities, "
            f"landmarks to visit, and any local tips."
        )
        messages = [{"role": "user", "content": prompt}]
        inputs = tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
        ).to(device)  # Use the detected device

        # Generate the response in a single step
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=1024,
            use_cache=True,
            temperature=1.2,
            repetition_penalty=1.1,  # Avoid repetitive loops
            min_p=0.1,
        )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response

    except Exception as e:
        error_message = f"Error during response generation: {e}"
        print(error_message)
        return error_message

# Simplified Gradio UI for Travel Planner
interface = gr.Interface(
    fn=generate_travel_plan,
    inputs=[
        gr.Textbox(label="City", placeholder="Enter the city you want to visit"),
        gr.Textbox(label="Preferences", placeholder="E.g., historical sites, food, nightlife"),
        gr.Number(label="Trip Duration (Days)", value=1, interactive=True, minimum=1, maximum=7),
    ],
    outputs=gr.Textbox(label="Generated Travel Plan"),
    title="AI Travel Planner",
    description=(
        "Plan your trips with the help of an AI Travel Planner! "
        "Enter the city you want to visit, your preferences, and the duration of your trip, "
        "and get a personalized itinerary tailored to your interests."
    ),
    examples=[
        ["Paris", "art museums, romantic spots", 2],
        ["Tokyo", "anime culture, food, nightlife", 1],
        ["New York", "Broadway, Central Park, landmarks", 3],
    ],
)

# Launch Gradio app
interface.launch(share=True)

Using device: cuda
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://888fbed536d4fc2252.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


