<a href="https://colab.research.google.com/github/sbuergers/llm-hackathon/blob/add_gpt4_api/llm_challenge_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the Large Language Models Challenge!

This is a template notebook. __Copy it__ and start hacking away if you like. We suggest removing the `_template` suffix and replacing it with your team name. If you do not have a team name yet, have a look below. There are already a few useful snippets of code that might help you in your quest, including a random team name generator!

But first some pre-requisites. Download the [cpv-master.zip](https://stllmchallenge2024.blob.core.windows.net/data/cpv-master.zip?sp=r&st=2024-09-09T06:03:17Z&se=2024-09-11T14:03:17Z&spr=https&sv=2022-11-02&sr=b&sig=uI7Y00VjgiqcfX96imRFootgC5J2SYDKoDkd%2BVLAGJs%3D) file and install it with the following cell. You will also need to download these three files and put them in the `./data/enriched` folder in your google colab files tab (on the left): [file 1](https://stllmchallenge2024.blob.core.windows.net/data/data_tenderned.csv?sp=r&st=2024-09-09T06:57:54Z&se=2024-09-11T14:57:54Z&spr=https&sv=2022-11-02&sr=b&sig=YMNvh9YrqFctLj7TYQNu%2FTutJ%2FxzzkFI%2FAVFgivlsRg%3D), [file 2](https://stllmchallenge2024.blob.core.windows.net/data/test_data_tenderned_clean.csv?sp=r&st=2024-09-09T06:58:54Z&se=2024-09-11T14:58:54Z&spr=https&sv=2022-11-02&sr=b&sig=W0yBbrllJHGa5etPsiRCFV%2Fz16Khq4pwmRANCZLltrs%3D), [file 3](https://stllmchallenge2024.blob.core.windows.net/data/train_data_tenderned_and_ted_clean.csv?sp=r&st=2024-09-09T06:59:30Z&se=2024-09-12T14:59:30Z&spr=https&sv=2022-11-02&sr=b&sig=cE6WBsHxwjij9%2FM%2FTOYfZAAoFncECUkXJr70qka9wvo%3D). Finally, create an empty `./output/models` folder and the run the following two cells.

In google colab, on the left, there is a files tab, for the code here to work you need to add the two above files there.

In [None]:
!pip install cpv-master.zip

In [None]:
import os
from pathlib import Path
import logging

import numpy as np
import pandas as pd
import dill as pickle

from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

import spacy


def assess_model_performance(y_hat, y_test):
    acc = metrics.accuracy_score(y_test, y_hat)
    prec = metrics.precision_score(y_test, y_hat, average="weighted")
    rec = metrics.recall_score(y_test, y_hat, average="weighted")
    f1 = metrics.f1_score(y_test, y_hat, average="weighted")
    print("accuracy: {0:.2g}".format(acc))
    print("precision: {0:.2g}".format(prec))
    print("recall: {0:.2g}".format(rec))
    print("f1-score: {0:.2g}".format(f1))


class LemmaTokenizerNoStopWords:
    """Performas lemmatization and excludes stopwords using spacy.
    """

    def __init__(self):
        self.nlp = spacy.load("nl_core_news_sm")
        self.stopwords = self.nlp.Defaults.stop_words

    def __call__(self, doc):
        tokens = [token.lemma_.lower().strip() for token in self.nlp(doc)]
        return [t for t in tokens if t not in self.stopwords]


def remove_categories_not_in_tenderned(
    data: pd.DataFrame,
    data_path: str,
    y_label: str = 'afdeling',
) -> pd.DataFrame:
    """Using the Tenderned dataset as ground truth, remove all categories
    at the y_label level that are not in it (e.g. when using TED data as well).

    Parameters
    ----------
    data : pd.DataFrame
        Input data (e.g. from Tenderned + TED)
    data_path : str
    y_label : str

    Returns
    -------
    data_pruned : pd.DataFrame
        Output data containing only labels from tenderned in y_label column
    """
    data_tenderned = pd.read_csv(data_path)
    tenderned_labels = data_tenderned.loc[:, y_label].unique()
    data = data.loc[data.loc[:, y_label].isin(tenderned_labels), :]

    return data


def remove_classes_with_fewer_than_n_observations(
    df: pd.DataFrame, y_label: str, n: int = 49,
) -> pd.DataFrame:
    """If we perform classificaiton at the groep, class, category or
    description level, we often do not have sufficient observations to
    properly train or even split into train and test set. Remove those entries.

    Parameters
    ----------
    df : pd.DataFrame
        Input data
    n : int (default=49)
        Minimum number of class observations to keep class in dataset

    Returns
    -------
    df : pd.DataFrame
        Data without observations from classes with less than n entries
    """
    nrows = df.shape[0]
    cpv_freq_table = df[y_label].value_counts()
    cpv_codes_to_drop = (
        pd.DataFrame(cpv_freq_table).index[cpv_freq_table < n].values
    )
    df = df.loc[~df[y_label].isin(cpv_codes_to_drop), :]
    df = df.dropna(subset=y_label, axis=0)

    logging.info(
        f"Removed {nrows-df.shape[0]} entries, because they"
        f'have less than {n} observations in target column "{y_label}"'
    )
    return df


def remove_classes_with_nan(df: pd.DataFrame, y_label: str) -> pd.DataFrame:
    """Remove all entries with nan in target variable

    Parameters
    ----------
    df : pd.DataFrame
        Input data
    y_label : str

    Returns
    -------
    df : pd.DataFrame
        Data without nans in target variable
    """
    return df.dropna(axis=0, subset=y_label)


def fit_and_cache_simple_pipeline(
    filename: str, y_label: str, n_min_observations: int=49, use_zwolle_codes: bool=False
):
    vectorizer = TfidfVectorizer(
        sublinear_tf=True,
        max_df=0.5,
        min_df=5,
        tokenizer=LemmaTokenizerNoStopWords(),
        token_pattern=None,
        max_features=None,
    )
    estimator = RidgeClassifier(alpha=0.75, solver="auto")

    pipeline = Pipeline(
        [
            ("vectorizer", vectorizer),
            ("estimator", estimator),
        ]
    )

    model_path = Path("./output/models")
    model_name = f"small_model_{filename}_lvl_{y_label}_minobs_{n_min_observations}.pkl"
    
    # Fit model on train set
    train_data_clean = pd.read_csv(f"./data/enriched/train_{filename}_clean.csv")
    X_train = remove_categories_not_in_tenderned(
        train_data_clean.copy(),
        data_path="./data/enriched/data_tenderned.csv",
        y_label=y_label,
    )
    X_train = remove_classes_with_fewer_than_n_observations(
        X_train,
        y_label,
        round(n_min_observations * 0.6),
    )
    X_train = remove_classes_with_nan(X_train, y_label)
    y_train = X_train.loc[:, y_label].copy()
    
    pipeline.fit(X_train.loc[:, 'Korte beschrijving aanbesteding'].values, y_train.values)

    # Evaluate on test set
    test_data_clean = pd.read_csv(f"./data/enriched/test_data_tenderned_clean.csv")

    X_test = test_data_clean.copy().loc[:, 'Korte beschrijving aanbesteding'].tolist()
    y_test = test_data_clean.loc[:, y_label].copy()
    
    y_hat = pipeline.predict(X_test)
    
    try:
        assess_model_performance(y_hat, y_test)
    except:
        print('Cannot perform model assessment')

    # Cache model
    with open(model_path / model_name, "wb") as file:
        pickle.dump(pipeline, file)

# Running the classical model training

Now we are ready to run the training of the classical ML models at different cpv-code levels. This code takes a long time to run, so consider starting only with the `Omschrijving` level. I am currently fitting all models on my laptop, once done I will step by step upload the pickle files to the blob store and provide you with the download links. Then you can add them to the `./output/models` folder directly from there.

In [None]:
filename = 'data_tenderned_and_ted'
for y_label in ['afdeling', 'groep', 'klasse', 'categorie', 'Omschrijving']:
    print(f'Fit model at the {y_label} level.')
    fit_and_cache_simple_pipeline(filename, y_label)

In [None]:
import random

# Function to generate a random team name
def generate_team_name():
    adjectives = ["Agile", "Brave", "Clever", "Daring", "Energetic", "Fearless", "Gallant", "Heroic", "Innovative", "Jovial", "Keen", "Loyal", "Mighty", "Noble", "Optimistic", "Persistent", "Quick", "Resilient", "Strong", "Tenacious", "Unyielding", "Valiant", "Wise", "Xenial", "Youthful", "Zealous"]
    nouns = ["Antelopes", "Bears", "Cheetahs", "Dolphins", "Elephants", "Foxes", "Giraffes", "Hawks", "Iguanas", "Jaguars", "Kangaroos", "Lions", "Monkeys", "Nightingales", "Owls", "Penguins", "Quails", "Rabbits", "Snakes", "Tigers", "Unicorns", "Vultures", "Wolves", "Xiphosuran", "Yaks", "Zebras"]
    adjective = random.choice(adjectives)
    noun = random.choice(nouns)
    return adjective.lower() + "_" + noun.lower()

# Generate a random team name
team_name = generate_team_name()
print("Generated team name: ", team_name)


Generated team name:  noble_monkeys


## Why not scrape some additional info from the web?

In [None]:
from bs4 import BeautifulSoup
import requests

# Function to extract first text paragraph after first h2 tag from website.
# (If you use https://cpvcodes.eu/en/{cpv_code}-cpv/, you get a detailed description for that code)
def extract_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the first <h2> tag
    first_h2 = soup.find('h2')

    if first_h2:
        # Find the next <p> tag after the first <h2> tag
        first_p_after_h2 = first_h2.find_next('p')
        if first_p_after_h2:
            return first_p_after_h2.text
        else:
            return "No <p> tag found after the first <h2> tag."
    else:
        return "No <h2> tag found in the webpage."


# Test the function
url = "https://cpvcodes.eu/en/03211000-cpv/"
print(extract_text(url))

The Cereals category includes various types of grains that are commonly used as food sources. These grains are essential ingredients in many food products and are widely consumed worldwide. The subcategories within this category consist of Wheat, Maize (corn), Rice, Barley, Rye, Oats, Malt, and Grain products. Wheat is a versatile grain used in bread, pasta, and pastries. Maize is primarily used for animal feed and as a raw material for various food products. Rice is a staple food in many cultures and is consumed in various forms. Barley is often used in brewing and as a nutritious grain. Rye and oats are commonly used in bread, cereals, and other baked goods. Malt is a key ingredient in beer production. Grain products encompass a wide range of processed goods derived from grains, such as flour, cereals, and snacks.


In [2]:
!pip install openai

Collecting openai
  Downloading openai-1.37.1-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.37.1-py3-none-any.whl (337 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.0/337.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-an

## We use a GPT-4o mini API from openai in Azure AI services.

Be aware that we pay for every 1000 tokens sent (€0.00014)	and received (€0.0006)!

Feel free to adjust the below code for your hackathon project!

In [9]:
from google.colab import userdata

from openai import AzureOpenAI


# may change in the future
# https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
api_version = "2024-06-01"

# gets the API Key from environment variable AZURE_OPENAI_API_KEY
client = AzureOpenAI(
    api_version=api_version,
    api_key=userdata.get('AZURE_OPENAI_API_KEY'),
    # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
    azure_endpoint=userdata.get('OPENAI_ENDPOINT'),
)

completion = client.chat.completions.create(
    model="hackathonllms2024-gpt4o",
    messages=[
        {
            "role": "user",
            "content": "Wat is de CPV code om een brug te bouwen uit beton?",
        },
    ],
)
print(completion.to_json())

{
  "id": "chatcmpl-9qcw4TRasAVkDHQoDL80WmGs9GVwV",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "De Common Procurement Vocabulary (CPV) code voor het bouwen van een brug uit beton valt onder de categorieën voor bouw- en constructiewerkzaamheden. De specifieke CPV-code voor bruggenbouw is:\n\n- **45221100-3** - Constructiewerk voor bruggen\n\nDeze code is niet gespecificeerd voor het materiaal beton, maar beschrijft in brede termen het constructiewerk voor bruggen. Voor nog meer specificiteit kan het nuttig zijn om aanvullende codes te gebruiken die relevante aspecten van betonconstructies aanduiden:\n\n- **45223200-8** - Bouwwerkzaamheden voor constructies van beton\n\nHet combineren van deze codes in uw aanbestedingsdocumenten kan helpen om duidelijkheid te verschaffen over de specifieke aard van de werkzaamheden die moeten worden uitgevoerd.\n\nHet is altijd goed om de specifieke context van uw pr

# Scoring the solution

Please, make sure that your solution can support the following. It should be able to take a dataframe with the columns as provided in the example below, and return CPV-codes as predictions - one for each input row. This is necessary to run the scoring algorithm. You can get the scoring data [here](https://stllmchallenge2024.blob.core.windows.net/data/scoring_dataset.csv?sp=r&st=2024-09-08T13:38:32Z&se=2024-09-11T21:38:32Z&spr=https&sv=2022-11-02&sr=b&sig=m19SyuKrSEl1nYqaV1z00Cg9MYmSwlBfX2kSuCFd%2FHY%3D).

In [None]:
import pandas as pd

# pd.read_csv("scoring_dataset.csv")
X_test = pd.DataFrame(
    {
        "description": ["tender description 1", "tender description 2", "tender description 3"],
    }
)

def predict(X: pd.DataFrame) -> pd.DataFrame:
  # Take a DataFrame with "description" column as input, use it to predict CPV codes
  # and put them in a "prediction" column of an output dataframe!
  return pd.DataFrame({"prediction": ["12345678-0", "12345678-0", "12345678-0"]})

predict(X_test)

In [None]:
!pip install git+https://github.com/RoyalHaskoningDHV/llm-hackathon2024@master

In [None]:
from llm_hackathon.scoring import score_solution

score_solution(predictions)  # The first value is the F1 score, the second value is accuracy weighted by detail and importance to zwolle