# Class and genre classification


This note book focuses on classifying the corrected documents into genre and IPTC class

The approach taken in this notebook is that classifying the text can be performed by a much simpler model, as such the train set will be used to generate a silver label set of data on which a BERT model will be trained. There will be two models

- BERT G: Classifies genres
- BERT T: Classifies Topics

In [1]:
#import config  # Import your config.py file this contains you openai api key
import pandas as pd
import numpy as np
import os
from llm_comparison_toolkit import RateLimiter, get_response_openai, get_response_anthropic,  create_config_dict_func, use_df_to_call_llm_api, compare_request_configurations
from evaluation_funcs import evaluate_correction_performance, evaluate_correction_performance_folders, get_metric_error_reduction
import seaborn as sns
import matplotlib.pyplot as plt

from helper_functions import files_to_df_func

dev_transcripts = 'data/dev_data_transcript'

#load the dev and test sets for prompt development and selection
dev_data_df = pd.read_csv('data/dev_data_raw.csv')
test_data_df = pd.read_csv('data/test_data_raw.csv')

## Genre classes

This paper uses the below classes to categorise the article types. As can be seen definition is left up to the LLM


In [2]:
genre_prompt = """
Read following article.
:::
{truncated_text}
:::

You are a machine that classifies newspaper articles. Your response  is limited to choices from the following json
        {0: 'news report',
        1: 'editorial',
        2: 'letter',
        3: 'advert',
        4: 'review',
        5: 'poem/song/story',
        6: 'other'}
        you will respond using a single digit.

    For example given the text "Mr Bronson died today, he was a kind man" your answer would be
    6
    
    Alternatively given the text "The prime minster spoke at parliament today" your answer would be
    0
    """



## IPTC classes

THe below prompt is for the IPTC Topic classes using the level one classification nomenclature

In [3]:
IPTC_prompt = """
Read following article.
:::
{truncated_text}
:::

You are a machine that classifies newspaper articles. Your response  is limited to choices from the following json
    {0: 'arts, culture, entertainment and media',
    1: 'crime, law and justice',
    2: 'disaster, accident and emergency incident',
    3: 'economy, business and finance',
    4: 'education',
    5: 'environment',
    6: 'health',
    7: 'human interest',
    8: 'labour',
    9: 'lifestyle and leisure',
    10: 'politics',
    11: 'religion',
    12: 'science and technology',
    13: 'society',
    14: 'sport',
    15: 'conflict, war and peace',
    16: 'weather'}
    you will respond using a numeric python list.

For example given the text "The War with spain has forced schools to close" your answer would be
[15,4]

Alternatively given the text "The prime minster spoke at parliament today" your answer would be
[10]
"""

# Creating model configs

For simplicity the same functions used for cleaning the ocr will be used here, this means that each text will create it's own file containing at most a few numbers. There are obviously more efficient ways of doing this

In [4]:
folder_path = dev_transcripts

df = files_to_df_func(folder_path)

In [5]:
groq_alt_endpoint = {'alt_endpoint':{'base_url':'https://api.groq.com/openai/v1',
                     'api_key':os.getenv("GROQ_API_KEY")}}

basic_model_configs = pd.DataFrame({
    'get_response_func': [get_response_openai, get_response_openai, get_response_anthropic, get_response_anthropic, 
                          get_response_openai, get_response_openai, get_response_openai], 
    'engine': ['gpt-3.5-turbo', 'gpt-4-turbo-preview', "claude-3-haiku-20240307", "claude-3-opus-20240229", 
               'mixtral-8x7b-32768', 'llama2-70b-4096', 'gemma-7b-it'],
    'additional_args': [
        {}, {}, {}, {}, 
        groq_alt_endpoint, 
        groq_alt_endpoint, 
        groq_alt_endpoint
    ]
})

genre_model_configs= []

for index, row in basic_model_configs.iterrows():
    #modify the response name for the type
    row['additional_args']['response_name'] = 'genre'
    genre_model_configs.append(

        create_config_dict_func(
    get_response_func = row['get_response_func'],
    rate_limiter = RateLimiter(40000),
    engine = row['engine'],
    system_message_template = "",
    prompt_template =  genre_prompt,
    additional_args=row['additional_args']
    )

    )

compare_request_configurations(df, genre_model_configs, folder_path='./data/dev_genre')
   

KeyError: 'id'