**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [1]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

# Get the Watsonx API key
WX_API_KEY = os.getenv("WX_API_KEY")

# Print or use the key
print("Watsonx API Key:", WX_API_KEY)

Watsonx API Key: Kh-sgOv8q7LagG5N--kjCAOkfKhUcR9hOpfkjwVxpL4Z


In [3]:
# imports for the project

import pandas as pd
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

In [4]:
WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="75de42c2-d0c2-4185-b732-8bf50d64f880"
)

Connection test

In [5]:

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct",
)

Prompt test

In [6]:
prompt = "How do I ride a bicycle?"
generated_response = model.generate(prompt)

generated_response

{'model_id': 'ibm/granite-3-8b-instruct',
 'model_version': '1.1.0',
 'created_at': '2025-03-30T12:56:03.569Z',
 'results': [{'generated_text': '\n\n1. Find a safe, open area to practice.\n2. Adjust the seat to',
   'generated_token_count': 20,
   'input_token_count': 10,
   'stop_reason': 'max_tokens'}],
    'id': 'unspecified_max_new_tokens',
    'additional_properties': {'limit': 0,
     'new_value': 20,
     'parameter': 'parameters.max_new_tokens',
     'value': 0}}]}}

In [7]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

In [8]:
PARAMS = TextGenParameters(
    temperature=0.8,      # Higher temperature means more randomness
    max_new_tokens=500, # Maximum number of tokens to generate
    min_new_tokens=200, # Minimum number of tokens to generate
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct",
    params=PARAMS
)

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [12]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [13]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [14]:
response = model.generate(prompt)
response

{'model_id': 'ibm/granite-3-8b-instruct',
 'model_version': '1.1.0',
 'created_at': '2025-03-30T12:56:45.204Z',
 'results': [{'generated_text': "\n\nRiding a bicycle involves several steps. Here's a simple guide to help you get started:\n\n1. **Find a suitable bicycle**: Ensure the bike fits you properly. Your knees should have a slight bend when the pedal is at its lowest point.\n\n2. **Adjust the seat and handlebars**: Lower the seat so your leg is extended slightly when the pedal is at its lowest point. Adjust the handlebars to a comfortable height.\n\n3. **Put on safety gear**: Wear a helmet, gloves, and comfortable shoes.\n\n4. **Start in a safe area**: Find a flat, open space like a park or parking lot.\n\n5. **Learn to balance**: Sit on the bike, place your feet on the ground, and walk the bike forward a few steps.\n\n6. **Practice pedaling**: Lift your feet off the ground, while keeping your balance. Start pedaling slowly.\n\n   - **For beginners, it's easier to use training wh

In [15]:
print(response["results"][0]["generated_text"])



Riding a bicycle involves several steps. Here's a simple guide to help you get started:

1. **Find a suitable bicycle**: Ensure the bike fits you properly. Your knees should have a slight bend when the pedal is at its lowest point.

2. **Adjust the seat and handlebars**: Lower the seat so your leg is extended slightly when the pedal is at its lowest point. Adjust the handlebars to a comfortable height.

3. **Put on safety gear**: Wear a helmet, gloves, and comfortable shoes.

4. **Start in a safe area**: Find a flat, open space like a park or parking lot.

5. **Learn to balance**: Sit on the bike, place your feet on the ground, and walk the bike forward a few steps.

6. **Practice pedaling**: Lift your feet off the ground, while keeping your balance. Start pedaling slowly.

   - **For beginners, it's easier to use training wheels or a balance bike initially.**

7. **Steering**: To turn, lean your body and look in the direction you want to go.

8. **Braking**: Practice using the brake

In [16]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

In [20]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [21]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape # , train_df.shape

(760, 2)

In [22]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct",  # We could also try a larger model!
    params=PARAMS
)

In [25]:
SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [27]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [03:38<00:00,  3.47it/s]


In [28]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

                   0.00      0.00      0.00         0
  - Business       0.00      0.00      0.00         0
    Business       0.65      0.94      0.77       190
    Sci/Tech       0.90      0.55      0.68       190
      Sports       0.96      0.92      0.94       190
       World       0.88      0.82      0.84       190

    accuracy                           0.80       760
   macro avg       0.57      0.54      0.54       760
weighted avg       0.85      0.80      0.81       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [30]:
# Define a list of hyperparameter configurations to test
hyperparameter_configs = [
    {"temperature": 0.0, "max_new_tokens": 10},
    {"temperature": 0.5, "max_new_tokens": 20},
    {"temperature": 1.0, "max_new_tokens": 50},
]

# Define valid categories based on your label_map
valid_categories = set(label_map.values())

# Normalize model prediction
def normalize_prediction(pred):
    pred = pred.strip().lower()
    pred = pred.replace("-", "").replace(":", "").replace("[", "").replace("]", "")
    pred = pred.split("###")[0].strip()
    return pred.capitalize()

# Store results for each configuration
results = []

for config in hyperparameter_configs:
    # Update the parameters
    PARAMS = TextGenParameters(
        temperature=config["temperature"],
        max_new_tokens=config["max_new_tokens"],
        stop_sequences=["\n", ".", "\nCategory:"]
    )
    
    # Update the model with new parameters
    model = ModelInference(
        api_client=client,
        model_id="ibm/granite-3-8b-instruct",
        params=PARAMS
    )
    
    predictions = []
    true_labels = []

    # Generate predictions for the test set
    for text, label in tqdm(zip(test_df["text"], test_df["label"]), total=len(test_df)):
        prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
        response = model.generate(prompt)
        raw_pred = response["results"][0]["generated_text"].strip()
        clean_pred = normalize_prediction(raw_pred)

        if clean_pred in valid_categories:
            predictions.append(clean_pred)
        else:
            predictions.append("Unknown")  # You could also skip it instead

        true_labels.append(label)

    # Evaluate predictions — exclude Unknown for clean metric
    filtered_true = [t for t, p in zip(true_labels, predictions) if p != "Unknown"]
    filtered_pred = [p for p in predictions if p != "Unknown"]

    report = classification_report(filtered_true, filtered_pred, output_dict=True)
    results.append({
        "config": config,
        "report": report,
        "predictions": predictions
    })

# Print results per configuration
for result in results:
    print(f"\n🔧 Config: {result['config']}")
    print(classification_report(
        [t for t, p in zip(test_df.label, result["predictions"]) if p != "Unknown"],
        [p for p in result["predictions"] if p != "Unknown"]
    ))


100%|██████████| 760/760 [03:31<00:00,  3.59it/s]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
100%|██████████| 760/760 [03:35<00:00,  3.53it/s]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
100%|██████████| 760/760 [03:39<00:00,  3.46it/s]


🔧 Config: {'temperature': 0.0, 'max_new_tokens': 10}
              precision    recall  f1-score   support

    Business       0.64      0.97      0.77       183
    Sci/Tech       0.00      0.00      0.00        81
      Sports       0.96      0.93      0.94       188
       World       0.88      0.85      0.86       183

    accuracy                           0.80       635
   macro avg       0.62      0.69      0.64       635
weighted avg       0.72      0.80      0.75       635


🔧 Config: {'temperature': 0.5, 'max_new_tokens': 20}
              precision    recall  f1-score   support

    Business       0.60      0.97      0.74       179
    Sci/Tech       0.00      0.00      0.00        85
      Sports       0.95      0.90      0.92       174
       World       0.88      0.79      0.83       180

    accuracy                           0.77       618
   macro avg       0.61      0.67      0.62       618
weighted avg       0.70      0.77      0.72       618


🔧 Config: {'temperatu


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### LLM
The LLM performed well in zero-shot classification with minimal setup. Prompt engineering e.g. clean category lists, stop sequences, and prediction normalization was key to improving accuracy. While not as precise as BERT, the LLM offered flexibility and required no training, making it ideal for rapid experimentation.

### BoW
The BoW model was fast and simple but lacked contextual understanding. It struggled with nuanced or varied phrasing, resulting in lower accuracy. It served well as a baseline but was outperformed by the two other models.

### BERT
BERT achieved the best overall performance, thanks to its deep contextual understanding. It handled complex language and subtle distinctions effectively. However, it required more resources and training time. It’s the most reliable choice for high-stakes classification tasks.


