**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [24]:
# imports for the project
import pandas as pd
from decouple import config 
from sklearn.metrics import classification_report 
from tqdm import tqdm

from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

## Connecting to WatsonX.AI

In [58]:
# Loading api key from .env file
api_key = config('wx_api_key')

# Accessing the API
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com/",
    api_key = api_key
)

client = APIClient(
    credentials,
    project_id="ce1ea911-fc95-453c-9e8b-02ff019d04e8"    
)

### Testing the connection

In [61]:
model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)

In [62]:
prompt = "How do I make a cake?"
generated_response = model.generate(prompt)

generated_response

{'model_id': 'ibm/granite-13b-instruct-v2',
 'created_at': '2025-03-24T08:15:05.052Z',
 'results': [{'generated_text': 'Mix the ingredients together in a bowl. Pour the batter into a cake pan. Bake for 30 minutes',
   'generated_token_count': 20,
   'input_token_count': 7,
   'stop_reason': 'max_tokens'}],
    'id': 'unspecified_max_new_tokens',
    'additional_properties': {'limit': 0,
     'new_value': 20,
     'parameter': 'parameters.max_new_tokens',
     'value': 0}}]}}

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [88]:
# Loading the dataset
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(test.head())

                                                text  label
0  Fears for T N pension after talks Unions repre...      2
1  The Race is On: Second Private Team Sets Launc...      3
2  Ky. Company Wins Grant to Study Peptides (AP) ...      3
3  Prediction Unit Helps Forecast Wildfires (AP) ...      3
4  Calif. Aims to Limit Farm-Related Smog (AP) AP...      3


In [89]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
# del test

test_df.shape, # train_df.shape, 

((760, 2),)

### Setting parameters for WatsonX AI Model

In [90]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

# Collection of models to train on
model_ids = [
    "ibm/granite-13b-instruct-v2",
    "meta-llama/llama-3-405b-instruct"
]

### Creating a system prompt

In [119]:
SYSTEM_PROMPT = """You task is to classify news stories into one of four pre-fixed categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

EXAMPLES:
TEXT: "Apple just launched a new AI-powered MacBook."  
Category: Sci/Tech  

TEXT: "The stock market crashed after the latest interest rate hike."  
Category: Business  

TEXT: "A major earthquake has struck Japan, causing widespread damage."  
Category: World  

TEXT: "Manchester United won their latest match with a last-minute goal."  
Category: Sports  

Category:
"""

### Making predictions

In [124]:
# Print categories
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())
print(CATEGORIES)

- Business
- Sci/Tech
- Sports
- World


In [120]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

# Array to store predictions for each model
predictions_ibm = []

model = ModelInference(
    api_client=client,
    model_id=model_ids[0],  # IBM Model id
    params=PARAMS
)

# Train on all models in model ids
for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)

    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the correct list of predictions
    predictions_ibm.append(prediction)
 
    



100%|██████████| 760/760 [04:10<00:00,  3.04it/s]


In [121]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

# Array to store predictions for each model
predictions_llama = []


model = ModelInference(
    api_client=client,
    model_id=model_ids[1],  # Llama model id
    params=PARAMS
)

# Train on all models in model ids
for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)

    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the correct list of predictions
    predictions_llama.append(prediction)
 
    

100%|██████████| 760/760 [05:38<00:00,  2.25it/s]


### Evaluating Performance

In [122]:
print("IBM Prediction Evaluation: ")
print(classification_report(test_df.label, predictions_ibm))
print("Llama Prediction Evaluation: ")
print(classification_report(test_df.label, predictions_llama))

IBM Prediction Evaluation: 
              precision    recall  f1-score   support

    Business       0.54      0.71      0.61       190
    Sci/Tech       1.00      0.02      0.03       190
      Sports       0.39      0.94      0.55       190
       World       0.75      0.23      0.35       190

    accuracy                           0.47       760
   macro avg       0.67      0.47      0.39       760
weighted avg       0.67      0.47      0.39       760

Llama Prediction Evaluation: 
              precision    recall  f1-score   support

    Business       0.77      0.93      0.84       190
    Sci/Tech       0.91      0.72      0.81       190
      Sports       0.97      0.97      0.97       190
       World       0.95      0.85      0.90       190
         ```       0.00      0.00      0.00         0

    accuracy                           0.87       760
   macro avg       0.72      0.69      0.70       760
weighted avg       0.90      0.87      0.88       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Reflections

#### Model Reflections
Significant improvements are seen in the Llama model compared to the baseline IBM Granite model with an overall accuracy improvement from 0.47 to 0.87.
When training the models the Llama model was significantly more sensitive towards the prompting technique, as numorous iterations was needed to find the best prompting technique to ensure that the model did not create new categories that best suited the text. Long and descriptive prompts had a lower performance than the resulting prompt which is clear and has few instructions to follow.

Still, it can be seen from the Classification Report that the Llama model struggled to fit all texts into one of the four pre-defined categories.

#### Prompt Technique Reflections
Few-Shot Learning technique was added to the prompt to enhance performance, which significantly enhanced the model's ability to only utilise the existing categories for classification.

- By adding examples for each category, the models seemed to perform better at classifying the texts and provided clear guidelines for the task to avoid creation of new categories.