**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [1]:
# imports for the project

import pandas as pd

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [3]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """
    
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

### 2 Connecting to WatsonX and testing the connection


In [4]:
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

In [5]:
WX_API_KEY = config('WX_API_KEY')

In [6]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="41a10143-6d5a-4d10-adfb-44ef608e0012"
)

In [7]:
model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)

In [8]:
prompt = "How do I make a pizza?"
generated_response = model.generate(prompt)

generated_response

{'model_id': 'ibm/granite-13b-instruct-v2',
 'created_at': '2025-03-19T13:41:44.525Z',
 'results': [{'generated_text': 'To make a pizza, you will need flour, water, yeast, salt, olive oil, and',
   'generated_token_count': 20,
   'input_token_count': 7,
   'stop_reason': 'max_tokens'}],
    'id': 'unspecified_max_new_tokens',
    'additional_properties': {'limit': 0,
     'new_value': 20,
     'parameter': 'parameters.max_new_tokens',
     'value': 0}}]}}

### 3 Setting the parameters

In [9]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

In [10]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

PARAMS = TextGenParameters(
    temperature=0.4,
    max_new_tokens=10,
    stop_sequences=[".", "\n"],             
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  
    params=PARAMS
)

In [11]:
response = model.generate(prompt)

print(response["results"][0]["generated_text"])

To make a pizza, you will need flour,


### 4 Create system prompt

In [12]:
SYSTEM_PROMPT = """Your task is to classify news stories into one of the following five categories:

CATEGORIES:
{categories}

TEXT:
{text}

Read the text carefully and assign the correct category to the text. Answer with the correct category name only, and nothing else.

EXAMPLES:
1. If the text is about a recent sports event, the category should be "Sports".
2. If the text is about a new scientific discovery, the category should be "Sci/Tech".

Please provide the category in the following format:

Category: <category_name>
"""

### 5 Generating prediction

In [13]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [04:12<00:00,  3.01it/s]


### 6 Evaluating performance

In [14]:
print(classification_report(test_df.label, predictions, zero_division=0))

              precision    recall  f1-score   support

                   0.00      0.00      0.00         0
    Business       0.27      0.96      0.43       190
    Sci/Tech       0.93      0.21      0.34       190
      Sports       0.67      0.01      0.02       190
       World       0.70      0.16      0.26       190

    accuracy                           0.34       760
   macro avg       0.51      0.27      0.21       760
weighted avg       0.64      0.34      0.26       760



The result above shows that the model is good at predicting both Sci/tech, Sports and World articles, showing good precision on these categories, but struggles with business articles. On the other side, the model manages to identify almost all the business articles, but struggles at identifying the other categories, showing low recall here.


comparing it to the other models, this one performs significantly worse, with an overall accuracy of 34% compared to the BERT achieving 88% and BoW achieving 72%. This can indicate that the other models might be better for the given task. Still, further improvements and fine tuning of the llm model you help ehance its performance. 