**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [15]:
# imports for the project

import pandas as pd
from tqdm import tqdm
from sklearn.metrics import classification_report 
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

In [5]:
# Load the environment variables using python-decouple
# The .env file should be in the root of the project
# The .env file should NOT be committed to the repository
import json

json_file_path = "/Users/madswolff/Desktop/CBS/Master/AIML25/apikey.json"

with open(json_file_path, "r") as file:
    data = json.load(file)

WX_API_KEY = data.get("apikey")

if WX_API_KEY:
    print("API Key loaded successfully!")
else:
    print("Error: API Key not found in JSON file.")

API Key loaded successfully!


In [6]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com/",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="aabcb37b-71f8-4a4c-b5c7-ca8e3c1fc62c"
)

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [7]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [8]:
train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])

In [9]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

test_df.shape, train_df.shape, 

((760, 2), (1200, 2))

In [11]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

In [47]:
ZERO_SHOT_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [57]:
FEW_SHOT_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Here are some examples of the categories:

EXAMPLE1: 
TEXT: "Serena Williams wins the US Open"
CATEGORY: Sports

EXAMPLE2:
TEXT: "Apple announces record profits"
CATEGORY: Business

EXAMPLE3:
TEXT: "New COVID variant discovered"
CATEGORY: Sci/Tech

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [58]:
CHAIN_OF_THOUGHT_PROMPT = """You are an AI assistant tasked with classifying news articles into one of four categories:

CATEGORIES:
{categories}

TEXT:
{text}

Think step by step:
1. Identify the main topic of the article 
2. Determine what it is discussing (e.g. sports, business, etc.)
3. Find the most relevant category from the {categories} and match the article to it

Through applying these steps, assign the correct category to the text, answering with the correct category and nothing else. 
DO NOT HALLUCINATE CATEGORIES THAT DO NOT EXIST IN THE LIST OF PROVIDED CATEGORIES EVEN IF YOU FEEL THAT A NOVEL CATEGORY WOULD BE MORE WELLSUITED.

Category:
"""

In [59]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions_zeroshot = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = ZERO_SHOT_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions_zeroshot.append(prediction)

100%|██████████| 760/760 [03:59<00:00,  3.18it/s]


In [60]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions_fewshot = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = FEW_SHOT_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions_fewshot.append(prediction)

100%|██████████| 760/760 [03:46<00:00,  3.35it/s]


In [61]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions_chainofthough = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = CHAIN_OF_THOUGHT_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions_chainofthough.append(prediction)

100%|██████████| 760/760 [03:56<00:00,  3.21it/s]


In [62]:
print(classification_report(test_df.label, predictions_zeroshot))

              precision    recall  f1-score   support

    Business       0.54      0.91      0.68       190
    Sci/Tech       0.89      0.35      0.50       190
      Sports       0.96      0.91      0.94       190
       World       0.80      0.78      0.79       190

    accuracy                           0.74       760
   macro avg       0.80      0.74      0.73       760
weighted avg       0.80      0.74      0.73       760



In [64]:
print(classification_report(test_df.label, predictions_fewshot))

              precision    recall  f1-score   support

    Business       0.52      0.08      0.14       190
    Sci/Tech       0.74      0.19      0.31       190
      Sports       1.00      0.10      0.18       190
       World       0.28      0.97      0.44       190

    accuracy                           0.34       760
   macro avg       0.63      0.34      0.27       760
weighted avg       0.63      0.34      0.27       760



In [63]:
print(classification_report(test_df.label, predictions_chainofthough))

              precision    recall  f1-score   support

    Business       0.54      0.92      0.68       190
    Sci/Tech       0.93      0.29      0.45       190
      Sports       0.96      0.92      0.94       190
       World       0.78      0.79      0.78       190

    accuracy                           0.73       760
   macro avg       0.80      0.73      0.71       760
weighted avg       0.80      0.73      0.71       760



<h3> REFLECTIONS <h3>

Overall, in comparison with BoW and Bert, LLMs actually perfomed the worst, achieving a max accuracy of 74% with the zero-shot approach. I suspect that poor structuring of the prompts on my end is the cause of the performance deficit. The chain of though prompt was only able to achieve an accuracy of 34% which is defnitely resulting from this prompting issue.