**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [21]:
# imports for the project

import pandas as pd

from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

In [22]:
import os
from decouple import config, Config, RepositoryEnv

# Get the absolute path to the .env file
env_path = os.path.join(os.getcwd(), '.env')
print("Looking for .env file at:", env_path)

# Create a Config object with the explicit path
config = Config(RepositoryEnv(env_path))

# Try to load the API key
try:
    WX_API_KEY = config('WX_API_KEY')
    print("API key loaded successfully!")
except Exception as e:
    print("Error loading API key:", str(e))

Looking for .env file at: c:\Users\Frede\Code\AIML25\mas\ma2\ma2\.env
API key loaded successfully!


In [23]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="292b3d04-e9e8-4874-a7f6-a1de426bde30"
)

In [24]:

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)

In [25]:
prompt = "How do I make a cake?"
generated_response = model.generate(prompt)

generated_response

{'model_id': 'ibm/granite-13b-instruct-v2',
 'created_at': '2025-03-23T20:32:02.914Z',
 'results': [{'generated_text': 'Mix the ingredients together in a bowl. Pour the batter into a cake pan. Bake for 30 minutes',
   'generated_token_count': 20,
   'input_token_count': 7,
   'stop_reason': 'max_tokens'}],
    'id': 'unspecified_max_new_tokens',
    'additional_properties': {'limit': 0,
     'new_value': 20,
     'parameter': 'parameters.max_new_tokens',
     'value': 0}}]}}

In [26]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

# Set parameters

In [27]:
PARAMS = TextGenParameters(
    temperature=0.8,      # Higher temperature means more randomness
    max_new_tokens=500, # Maximum number of tokens to generate
    min_new_tokens=200, # Minimum number of tokens to generate
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
    params=PARAMS
)

In [28]:
response = model.generate(prompt)
response

{'model_id': 'ibm/granite-13b-instruct-v2',
 'created_at': '2025-03-23T20:32:08.874Z',
 'results': [{'generated_text': 'To make a cake, you will need flour, baking powder, salt, eggs, milk, oil, and sugar. Preheat your oven to 350 degrees. Mix together the flour, baking powder, and salt in a large bowl. In a separate bowl, mix together the eggs, milk, oil, and sugar. Add the wet ingredients to the dry ingredients and mix together until well combined. Pour the batter into a greased and floured cake pan. Bake the cake for 30 to 40 minutes, or until a toothpick inserted in the center comes out clean. Let the cake cool before frosting. icing icing ideas : Vanilla Buttercream Icing icing ideas : Chocolate Buttercream Icing icing ideas : Cream Cheese Icing icing ideas : Strawberry Buttercream Icing icing ideas : Lemon Buttercream Icing icing ideas : Carrot Cake Icing icing ideas : Red Velvet Icing icing ideas : Banana Caramel Icing icing ideas : Blueberry Icing icing ideas : Cream Cheese Fro

In [29]:
print(response["results"][0]["generated_text"])

To make a cake, you will need flour, baking powder, salt, eggs, milk, oil, and sugar. Preheat your oven to 350 degrees. Mix together the flour, baking powder, and salt in a large bowl. In a separate bowl, mix together the eggs, milk, oil, and sugar. Add the wet ingredients to the dry ingredients and mix together until well combined. Pour the batter into a greased and floured cake pan. Bake the cake for 30 to 40 minutes, or until a toothpick inserted in the center comes out clean. Let the cake cool before frosting. icing icing ideas : Vanilla Buttercream Icing icing ideas : Chocolate Buttercream Icing icing ideas : Cream Cheese Icing icing ideas : Strawberry Buttercream Icing icing ideas : Lemon Buttercream Icing icing ideas : Carrot Cake Icing icing ideas : Red Velvet Icing icing ideas : Banana Caramel Icing icing ideas : Blueberry Icing icing ideas : Cream Cheese Frosting icing ideas : Chocolate Frosting icing ideas : Peanut Butter Frosting icing ideas : White Chocolate Buttercream 


In [30]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [31]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [32]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

  .apply(lambda x: x.sample(frac=frac, random_state=seed))


((760, 2),)

In [55]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

System prompt

In [56]:
SYSTEM_PROMPT = """You are a news classification expert. Let's classify this article with detailed context for each category:

CATEGORY DEFINITIONS AND EXAMPLES:

1. World News:
   - International politics and diplomacy
   - Global conflicts and peace
   - International organizations (UN, WHO, etc.)
   - Cross-border issues (climate change, migration)
   Example: "Global leaders meet at UN summit to discuss climate change policies"
   Example: "International trade agreement signed between major economies"

2. Sports:
   - Professional and amateur athletics
   - Team and individual sports
   - Tournaments and championships
   - Sports organizations and leagues
   Example: "Local team wins championship after dramatic final match"
   Example: "Olympic athlete breaks world record in swimming"

3. Business:
   - Companies and corporations
   - Financial markets and trading
   - Economic indicators and trends
   - Industry developments
   Example: "Tech company announces record quarterly earnings"
   Example: "Stock market reaches new heights as investors show confidence"

4. Sci/Tech:
   - Scientific discoveries and research
   - Technological innovations
   - Space exploration
   - Medical breakthroughs
   Example: "Scientists discover new species in deep ocean"
   Example: "Breakthrough in quantum computing research"

CATEGORIES:
{categories}

TEXT:
{text}

IMPORTANT: After analyzing the article against these detailed categories, respond with ONLY ONE WORD - the category name.
Example valid responses: "Business", "Sports", "Sci/Tech", "World"

Category:"""

In [57]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [04:18<00:00,  2.94it/s]


In [58]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.63      0.90      0.74       190
    Sci/Tech       0.69      0.64      0.66       190
      Sports       0.82      0.94      0.88       190
       World       0.91      0.46      0.61       190

    accuracy                           0.73       760
   macro avg       0.76      0.73      0.72       760
weighted avg       0.76      0.73      0.72       760



# Reflections. 
I was unable to increase the performance of the LLLM beyond the bert and BoW models. I approached differnet system prompts, trying to give the LLM more context, and encourage chain of thought reasoning, different base models, but the performance was never quite impressive enough. I would imagine that this is an  ill suited task for an LLM, since the amount of mental arithmetic in compressing the entire article down to a single token (or atleast a single word) is quite  high. This should still be feasible somehow. Perhaps the BERT and BoW models perform better because they are more apt at calculating, and because they are more specalized for such a task,  compared to a more general purpose LLM. Prompt engineering was unsurprisingly incredibly important for performance, and I wound up using an LLM to generate a high context prompt for this purpose (something that I found is  extremely useful for improving LLM outputs, for example in deep research). 

              precision    recall  f1-score   support

    Business       0.63      0.90      0.74       190
    Sci/Tech       0.69      0.64      0.66       190
      Sports       0.82      0.94      0.88       190
       World       0.91      0.46      0.61       190

    accuracy                           0.73       760
   macro avg       0.76      0.73      0.72       760
weighted avg       0.76      0.73      0.72       760