# Text Classification Exercise - Topic Detection
This exercise demonstrates how to use a language model (LLM) to classify text into predefined topics and determine how well it is performing. We'll explore topic detection using a simple dataset and employ a confusion matrix to evaluate the performance of the model.

## Step 1: Environment Setup
We begin by loading the required environment variables using the `dotenv` package. This allows us to securely load API keys or other configuration settings from a `.env` file.

In [None]:
%load_ext dotenv
%dotenv ../../.env

## Step 2: Preparing the Dataset
The first step when evaluating the performance of a model is to ensure sufficient ground truth exists. Ground truth refers to the actual, correct labels or values in a dataset that serve as the standard for evaluating the model's performance. It is crucial because it provides a baseline for comparison, allowing us to assess how accurately a model's predictions align with real-world outcomes. Without ground truth, it would be impossible to measure the model's correctness or identify areas for improvement.

Ground truth data is often produced through manual labelling tasks where human experts annotate data based on established criteria. While this process generally produces high-quality, accurate labels it is both time-consuming and expensive, particularly when working with large datasets, as significant effort and attention is required. 

For the purpose of this exercise, we will be assessing the performance of a simple, LLM-powered classification model. To generate the ground truth required to assess its performance, you will perform some manual labelling in the cell below. 

A dataset of fifteen sentences has been defined which we will attempt to classify into one of four categories based on the primary topic of discussion. For each sentence, fill in the empty string with the label you feel best represents the topic being discussed.

In [None]:
labels = ['Technology','Sports','Language','Food']

# From the given labels, add a ground truth topic label to each statement below. 
# The first statement has been labelled for you.
dataset = [ # [Topic, Text to classify]
    ['Technology', "Technology shapes our lives, from smartphones to algorithms. It drives innovation and connects us in ways we couldn't imagine"],
    ['', "AI ethics is a critical consideration in developing responsible algorithms."],
    ['', "Language is the expression of ideas through speech-sounds and words."], 
    ['', "Words are combined into sentences, answering to ideas into thoughts."], 
    ['', "Content moderation on social media platforms detect and filter out inappropriate language and harmful content to maintain a respectful and safe online environment"],
    ['', "Speech-to-text software has become crucial for accessibility, allowing users to transcribe spoken language into written text efficiently"],
    ['', "Language is a dynamic system of communication that evolves over time, reflecting cultural, social, and historical changes in society."],
    ['', "Golden State Warriors seek a second star alongside Stephen Curry."], 
    ['', "San Francisco 49ers maintain a successful offensive strategy."], 
    ['', "In the case of food establishments, like most sports, the first line of defense are the players in the game, which are the industry that produces the products."],
    ['', "After a thrilling soccer match, fans celebrate with stadium hot dogs and cold beverages."],
    ['', "Athletes know that proper nutrition is as crucial as their training regimen."],
    ['', "In the culinary Olympics, the gold medal goes to the chef who masters flavor balance."],
    ['', "Basketball players fuel up with protein-packed meals before hitting the court."],
    ['', "The marathon of cooking competitions leaves chefs both exhausted and exhilarated."],
]

for topic, text in dataset:
    if not topic:
        topic_sentence: str = ', '.join(labels)
        raise ValueError(f'Ensure that each sentence is labelled with one of the following topics: {topic_sentence}')

import pandas as pd
df = pd.DataFrame(dataset, columns=['Topic', 'Text'])

display(df)

## Step 3: Running the Topic Detection Task
Next we define the classification model that we'll be evaluating. We're using Langchain, a widely-used framework for building LLM-powered applications, with Open AI's GPT 3.5 Turbo. 

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

tagging_prompt = ChatPromptTemplate.from_template(
    """
    Extract the desired information from the following passage.

    Only extract the properties mentioned in the 'Classification' function.

    Passage:
    {input}
    """
)

class Classification(BaseModel):
    Topic: str = Field(description="Choose at most one topic from this list: " + ''.join(labels) + " that are related to the content")


llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo").with_structured_output(
    Classification
)

tagging_chain = tagging_prompt | llm

### Labelling our Dataset
Now that we've built our model, we can use it to predict labels for our dataset. These labels can then be compared against the ground truth labels we defined earlier to assess the performance of the model.

In [None]:
results = []
expected_result = []
for idx, topic, text in df.itertuples():
    expected_result.append(topic)
    result = tagging_chain.invoke({"input": text})
    results.append(result.Topic)

### Viewing the Results
Here, we display the ground truth and the predicted topic for each text entry. This will help us visually assess how well the model performs.

In [None]:
combined_data = list(zip(expected_result, results, df['Text']))
df2 = pd.DataFrame(combined_data, columns=['ground truth', 'predicted','text'])
pd.set_option('display.max_colwidth', None)
display(df2)

### Confusion Matrix and Performance Evaluation
We now evaluate the model's performance in greater detail using a confusion matrix.

A confusion matrix is a table used to evaluate the performance of a classification model. It compares the actual labels (ground truth) with the predicted labels from the model, providing a detailed breakdown of the model's performance across different classes.

In the case of a non-binary (multiclass) classifier, a confusion matrix provides a detailed view of how well the model is performing across multiple classes, rather than just two. It is an extension of the binary confusion matrix to handle more than two possible labels. The matrix layout changes as follows:

* The rows represent the actual (true) classes.
* The columns represent the predicted classes.
* Each cell in the matrix represents the number of instances where a specific true class was predicted as a specific predicted class.

For a non-binary classifier with n classes, the confusion matrix will be an n×n grid. 

A confusion matrix helps identify where a model makes errors, such as mistaking one class for another. This analysis is crucial for understanding the strengths and weaknesses of a classifier beyond simple accuracy, enabling more refined metrics like precision, recall, and F1-score.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

# Sample true labels and predicted labels
y_true = np.array(expected_result)
y_pred = np.array(results)

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=labels)

report = classification_report(y_true, y_pred, labels=labels, target_names=labels, zero_division=0)

# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()

plt.show()

### Detailed Analysis Using TP, FP, TN, and FN
We can dig deeper into the performance of the model by thoroughly evaluating performance on each class. We do this by turning performance on each class into a binary classification task. We then produce a distinct confusion matrix for each class and calculate metrics like accuracy, precision, recall and f1 scores. In order to understand these metrics, we must first discuss understand how binary classification tasks are assessed.

* True Positives (TP): The model correctly predicted the positive class.
* False Positives (FP): The model incorrectly predicted the positive class when it was actually negative.
* True Negatives (TN): The model correctly predicted the negative class.
* False Negatives (FN): The model incorrectly predicted the negative class when it was actually positive.

These data points can be summarised by the following metrics.

* Accuracy measures the overall correctness of a model by calculating the proportion of correctly predicted instances (both true positives and true negatives) out of all predictions. It is useful when classes are balanced but can be misleading if they are imbalanced.

* Precision (or Positive Predictive Value) focuses on the quality of positive predictions. It calculates the proportion of correctly predicted positive instances (true positives) out of all instances predicted as positive (true positives + false positives). High precision means fewer false positives.

* Recall (or Sensitivity) measures the model's ability to correctly identify all positive instances. It calculates the proportion of actual positives that were correctly predicted (true positives) out of all actual positive cases (true positives + false negatives). High recall means fewer false negatives.

* F1 score is a metric that balances precision and recall into a single number, providing a more comprehensive view of a model's performance, particularly when there is an imbalance between the two as it penalizes models that perform well in precision but poorly in recall, or vice versa. It is defined as the harmonic mean of precision and recall.

In [None]:
from sklearn.metrics import multilabel_confusion_matrix

def calculate_score(TN, FP, FN, TP):
    # Sensitivity, hit rate, recall, or true positive rate
    TPR = TP/(TP+FN)
    # Specificity or true negative rate
    TNR = TN/(TN+FP) 
    # Precision or positive predictive value
    PPV = TP/(TP+FP)
    # Negative predictive value
    NPV = TN/(TN+FN)
    # Fall out or false positive rate
    FPR = FP/(FP+TN)
    # False negative rate
    FNR = FN/(TP+FN)
    # False discovery rate
    FDR = FP/(TP+FP)

    # Overall accuracy
    ACC = (TP+TN)/(TP+FP+FN+TN)

    return ACC, PPV, TPR

mcm = multilabel_confusion_matrix(y_true, y_pred,labels=labels)
count = 0
calculation_result = []
# Display the confusion matrix
for cm_i in mcm:
    disp2 = ConfusionMatrixDisplay(confusion_matrix=cm_i, display_labels=['Not '+ labels[count], labels[count]])
    disp2.plot()

    tn, fp, fn, tp = cm_i.ravel()
    acc, prec, rec = calculate_score(tn, fp, fn, tp)
    calculation_result.append([labels[count],tn, fp, fn, tp, acc, prec, rec])
    count = count + 1

df2 = pd.DataFrame(calculation_result, columns=['Topic','TN', 'FP','FN', 'TP', 'Accuracy','Precision','Recall'])
pd.set_option('display.max_colwidth', None)
display(df2)

plt.show()