**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [4]:
# imports for the project

import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

### 1. Connect to WatsonX

In [5]:
import json

json_file_path = "/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/apikey.json"

with open(json_file_path, "r") as file:
    data = json.load(file)

WX_API_KEY = data.get("apikey")

if WX_API_KEY:
    print("API Key loaded successfully!")
else:
    print("Error: API Key not found in JSON file.")

API Key loaded successfully!


In [6]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="a8a394fd-a1b7-4dbe-b947-3a0684bbd947"
)

### 2. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [7]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [8]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

### 3. Set Parameters

In [9]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

### 4. Prompts to Test

In [10]:
ZERO_SHOT_PROMPT = """Your task is to classify news stories into one of four categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [11]:
ONE_SHOT_PROMPT = """You are an AI assistant tasked with classifying news articles into one of four categories:

CATEGORIES:
- Business
- Sports
- Sci/Tech
- World

Here is an example:

EXAMPLE:
TEXT: "Tesla announces record quarterly earnings as electric vehicle sales surge worldwide."
Category: Business

Now classify the following article:

TEXT:
{text}

Please assign the correct category to the text. Answer with only the category name.

Category:
"""

In [12]:
FEW_SHOT_PROMPT = """You are an AI assistant trained to classify news articles into one of the four categories: 

CATEGORIES:
- Business
- Sports
- Sci/Tech
- World

Below are some examples:

EXAMPLE 1:
TEXT: "Apple announces a new MacBook with an advanced M2 chip."
Category: Sci/Tech

EXAMPLE 2:
TEXT: "The stock market saw a major drop today as investors reacted to global economic uncertainty."
Category: Business

EXAMPLE 3:
TEXT: "The FIFA World Cup quarter-finals ended in a dramatic penalty shootout."
Category: Sports

EXAMPLE 4:
TEXT: "World leaders meet at the UN summit to discuss climate change policies."
Category: World

Now classify the following article:

TEXT:
{text}

Please assign the correct category to the text. Answer with only the category name.

Category:
"""

In [13]:
CHAIN_OF_THOUGHT_PROMPT = """You are an AI assistant tasked with classifying news articles into one of four categories:

CATEGORIES:
- Business
- Sports
- Sci/Tech
- World

TEXT:
{text}

Let's think step by step:
1. Identify the key subject of the article.
2. Determine what the article is primarily discussing (e.g., economy, technology, politics, or sports).
3. Match it with the most relevant category from the list above.

Explanation:
"""

# After generating an explanation, append:
"""
Based on the reasoning above, the correct category is:

Category:
"""

'\nBased on the reasoning above, the correct category is:\n\nCategory:\n'

### 5. Generate Predictions

In [14]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

PROMPT_TEMPLATES = {
    "zero_shot": ZERO_SHOT_PROMPT,
    "one_shot": ONE_SHOT_PROMPT,
    "few_shot": FEW_SHOT_PROMPT,
    "chain_of_thought": CHAIN_OF_THOUGHT_PROMPT
}

all_predictions = {key: [] for key in PROMPT_TEMPLATES.keys()}

for prompt_type, prompt_template in PROMPT_TEMPLATES.items():
    predictions = []
    for text in tqdm(test_df["text"]):
        # format the prompt with the categories and the text
        prompt = prompt_template.format(categories=CATEGORIES, text=text)
        
        # generate the response from the model
        response = model.generate(prompt)
        
        # extract the generated text from the response
        prediction = response["results"][0]["generated_text"].strip()
        
        # append the prediction to the list of predictions
        predictions.append(prediction)
    
    all_predictions[prompt_type] = predictions

  0%|          | 0/760 [00:00<?, ?it/s]

100%|██████████| 760/760 [04:00<00:00,  3.16it/s]
100%|██████████| 760/760 [04:01<00:00,  3.15it/s]
100%|██████████| 760/760 [04:03<00:00,  3.12it/s]
100%|██████████| 760/760 [04:41<00:00,  2.70it/s]


### 6. Evaluate the Performance

In [15]:
for prompt_type, predictions in all_predictions.items():
    print(f"Classification Report for {prompt_type} prompt:")
    print(classification_report(test_df["label"], predictions))

Classification Report for zero_shot prompt:
                precision    recall  f1-score   support

      Business       0.53      0.92      0.67       190
      Sci/Tech       0.87      0.36      0.51       190
Space Sci/Tech       0.00      0.00      0.00         0
        Sports       0.96      0.91      0.93       190
         World       0.84      0.76      0.80       190

      accuracy                           0.74       760
     macro avg       0.64      0.59      0.58       760
  weighted avg       0.80      0.74      0.73       760

Classification Report for one_shot prompt:
              precision    recall  f1-score   support

    Business       0.64      0.89      0.74       190
         Law       0.00      0.00      0.00         0
    Lawsuits       0.00      0.00      0.00         0
  Motorsport       0.00      0.00      0.00         0
    Sci/Tech       0.89      0.53      0.66       190
    Software       0.00      0.00      0.00         0
      Sports       0.95    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 7. Evaluation of the LLM

**Performance Comparison**

- Zero-Shot Prompting: Achieved 74% accuracy. The model classified text without any prior examples, performing decently but struggling with ambiguous cases.  
- One-Shot Prompting: Achieved 79% accuracy. Providing a single example improved performance, helping the model understand the expected classification.  
- Few-Shot Prompting: Achieved 82% accuracy. The presence of multiple examples increased the model’s contextual awareness, reducing misclassification errors.  
- Chain-of-Thought Prompting: Achieved 85% accuracy. By guiding the model to reason through its classification, it demonstrated improved comprehension, especially in edge cases.  

**Comparison to BoW and BERT**

- BoW (LogReg/SVM) – 79% Accuracy  
  - Strengths: Simple, interpretable, fast
  - Weaknesses: Lacks deep semantic understanding, relies on word frequency

- Pre-trained BERT – 74% Accuracy  
  - Strengths: Captures deeper language meaning compared to BoW  
  - Weaknesses: Requires more computation, limited adaptation without fine-tuning

- Fine-tuned BERT – 91% Accuracy  
  - Strengths: Highly adaptable to the dataset, best classification performance 
  - Weaknesses: Computationally expensive, requires substantial training 

- LLM (Zero-Shot) – 74% Accuracy  
  - Strengths: No training required, generalizes well across different topics
  - Weaknesses: Struggles with nuanced or ambiguous categories

- LLM (One-Shot) – 79% Accuracy  
  - Strengths: Improves performance with a single example, fast inference. 
  - Weaknesses: Limited contextual understanding, not as robust as few-shot. 

- LLM (Few-Shot) – 82% Accuracy  
  - Strengths: More context-aware, reduces misclassification errors  
  - Weaknesses: Higher latency, still not as fine-tuned as a trained model

- LLM (Chain-of-Thought) – 85% Accuracy  
  - Strengths: Encourages logical reasoning, improves accuracy on complex cases
  - Weaknesses: Slower inference, requires more processing steps

For real-world applications, Few-Shot or Chain-of-Thought prompting balances accuracy and efficiency, making them the preferred choices when training a model isn’t feasible.
