# Comparative Analysis of Zero-Shot Learning and Supervised Learning for Spam Detection: A Study Using GPT-4 and LSTM Models

## Introduction

In the realm of Machine Learning and Natural Language Processing (NLP),
various approaches exist to solve classification problems. Three
commonly used methods are zero-shot learning, few-shot learning, and
traditional supervised learning. In this analysis, we aim to compare
these approaches, particularly in the context of a text classification
problem.

### Dataset and Classification Problem

Our task revolves around the SMS Spam Collection Dataset, a public set
of SMS labeled messages collected for mobile phone spam research. It
includes a collection of more than 5,000 SMS messages, which have been
manually labeled as either 'spam' or 'ham' (a term used to denote a
non-spam message). The aim is to create models that can accurately
classify new, unseen messages into these categories.

> **Spam Classification Dataset:** The UCI Machine Learning Repository
contains a SMS Spam Collection dataset that can be used for binary text
classification: spam or ham (non-spam). The dataset is available
[here](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).


### Issue with Imbalanced Data

One significant aspect of this dataset, and many real-world datasets, is
class imbalance. That is, the number of 'ham' messages significantly
outweighs the number of 'spam' messages. This imbalance can present
challenges in model training, as models may become biased towards
predicting the majority class. In our case, a model might lean towards
predicting 'ham' more often as it could still achieve a seemingly high
accuracy that way. We will address this issue and present strategies for
handling such class imbalances in our analysis.

### Model Assessment Metrics and Evaluation

To evaluate the performance of our models, we use three key metrics:
Precision, Recall, and F1 Score. Precision measures the proportion of
true positive predictions among all positive predictions, while Recall
(also known as Sensitivity) measures the proportion of true positives
that were correctly identified. The F1 Score is the harmonic mean of
Precision and Recall and gives us a single metric that takes both false
positives and false negatives into account.

Traditionally, the Area Under the Curve (AUC) is often used as an
evaluation metric, which depicts the model's performance across all
classification thresholds. However, since we are using the OpenAI API
for zero-shot and few-shot learning, and it does not provide the
probability scores needed to compute the AUC, we will not use the AUC in
this analysis. This limitation does not impact the efficacy of our
comparison as Precision, Recall, and F1 Score offer robust measures to
evaluate and compare the performance of our models.

With the dataset, problem, challenges, and evaluation metrics defined,
our aim is to compare the efficacy of zero-shot learning, few-shot
learning, and traditional supervised learning techniques in solving this
text classification problem. This comparison will provide insights into
the strengths and weaknesses of each approach and guide us towards the
most suitable method for this particular task.

## Zero-Shot In-Context Learning Using GPT-4 Model

In the realm of machine learning, in-context learning refers to a method
where a model leverages contextual information to generate responses or
predictions. In the case of GPT-4, a state-of-the-art transformer-based
language model developed by OpenAI, it uses a context window to inform
its output based on provided prompts.

In our scenario, we employ what is known as zero-shot learning in
conjunction with in-context learning. Zero-shot learning is a type of
machine learning where the model is asked to make predictions on data
categories it has not explicitly seen during training. In other words,
the model uses the generalized understanding it has developed during its
pre-training phase to make predictions on unseen data. It is an
appealing approach, especially in situations where labelled training
data for specific tasks might be scarce or unavailable.

In the context of our task, a text classification problem involving
categorizing SMS messages as 'spam' or 'ham', the prompt for zero-shot
learning using GPT-4 takes the following general structure:

```text
    prompt = (
        "Question: Given the following content, is it of type A or type B?\n"
        f"Content: {content}\n"
        "Answer Choices: (A) Type A, (B) Type B."
    )
```

In our case, 'Type A' corresponds to 'spam' and 'Type B' corresponds to
'ham', and `{content}` represents the SMS message we want the model to
classify.

One challenge we faced when employing this approach was the class
imbalance in our dataset. In our dataset, 'ham' messages significantly
outnumber 'spam' messages. To give an idea of the imbalance,
approximately 87% of the messages are 'ham', leaving only about 13% as
'spam'. Such an imbalance can lead to a bias in model predictions, with
the model favoring the majority class, in this case, 'ham'.

To address this issue, we utilized a balanced sample for model
assessment. This was achieved by randomly sampling an equal number of
'spam' and 'ham' messages, ensuring equal representation of both classes
in our test set.

The performance of zero-shot in-context learning with GPT-4 was
evaluated using Precision, Recall, and F1 Score. Each metric provides
insights into the model's performance considering both true positives
and negatives, and false positives and negatives. Despite the OpenAI API
limitations preventing us from calculating the AUC, these metrics
provide a comprehensive evaluation of the effectiveness of the zero-shot
in-context learning approach.

In conclusion, zero-shot in-context learning with GPT-4 offers an
exciting approach to our text classification task, demonstrating
versatility in handling class imbalances, and effectiveness in
generalizing without explicit task-related training. However, as we will
explore in the following sections, other learning methods can also be
applied and compared to this task.

In [1]:
import pandas as pd
import requests
import zipfile
from io import BytesIO

In [2]:
# URL of the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

# Send a HTTP request to the URL of the zipfile
r = requests.get(url)

# Create a zipfile object from the content of the HTTP response
z = zipfile.ZipFile(BytesIO(r.content))

# Extract the content of the zipfile
z.extractall()

# Read the dataset
df = pd.read_csv("SMSSpamCollection", sep="\t", names=["label", "message"])

# Print the first few rows to check if the data has been read correctly
print(df.head(10))

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...
8  spam  WINNER!! As a valued network customer you have...
9  spam  Had your mobile 11 months or more? U R entitle...


In [3]:
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

completion = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {
            "Question": "Given the following email content, is the email spam or not spam?",
            "Email Content": "Dear user, you have won a million dollars! Click here to claim your prize.",
            "Answer Choices": "(A) Spam, (B) Not Spam.",
        },
    ],
)

print(completion.choices[0].message)

In [18]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Get the first 100 messages and labels
# messages = df['message'][:100]
# labels = df['label'][:100]
# Randomly sample 100 messages and labels
# sample = df.sample(n=200, random_state=1)
# messages = sample['message']
# labels = sample['label']

# Separate the dataset into spam and ham
spam_df = df[df["label"] == "spam"]
ham_df = df[df["label"] == "ham"]

# Sample 100 messages from each
spam_sample = spam_df.sample(n=100, random_state=1)
ham_sample = ham_df.sample(n=100, random_state=1)

# Concatenate the samples to create a balanced sample of 200
balanced_sample = pd.concat([spam_sample, ham_sample])

# Shuffle the sample to ensure randomness
balanced_sample = balanced_sample.sample(frac=1, random_state=1)

# Balanced messages and labels
messages = balanced_sample["message"]
labels = balanced_sample["label"]


openai.api_key = os.getenv("OPENAI_API_KEY")

predictions = []

for i, (msg, true_label) in enumerate(zip(messages, labels)):
    prompt = (
        "Question: Given the following email content, is the email spam or ham?\n"
        f"Email Content: {msg}\n"
        "Answer Choices: (A) spam, (B) ham."
    )

    completion = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant capable of identifying spam emails.",
            },
            {"role": "user", "content": prompt},
        ],
    )

    print(f"Message {i+1}: {msg}")
    response = completion.choices[0].message["content"].lower()
    print("Response:", response)

    # Extract prediction from response
    predicted_label = "spam" if "spam" in response else "ham"
    predictions.append(predicted_label)

# Calculate precision, recall, and F1 score
precision = precision_score(labels, predictions, pos_label="spam")
recall = recall_score(labels, predictions, pos_label="spam")
f1 = f1_score(labels, predictions, pos_label="spam")

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Message 1: HMV BONUS SPECIAL 500 pounds of genuine HMV vouchers to be won. Just answer 4 easy questions. Play Now! Send HMV to 86688 More info:www.100percent-real.com
Response: (a) spam
Message 2: EASTENDERS TV Quiz. What FLOWER does DOT compare herself to? D= VIOLET E= TULIP F= LILY txt D E or F to 84025 NOW 4 chance 2 WIN £100 Cash WKENT/150P16+
Response: (a) spam
Message 3: Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged
Response: (a) spam
Message 4: Shall i come to get pickle
Response: (b) ham
Message 5: That's the trouble with classes that go well - you're due a dodgey one … Expecting mine tomo! See you for recovery, same time, same place 
Response: (b) ham
Message 6: Going thru a very different feeling.wavering decisions and coping up with the same is the same individual.time will heal everything i believe.
Response: (b) ham
Message 7: Hello darling how are you today? I 

### Performance of Zero-Shot In-Context Learning with GPT-4

The performance of the GPT-4 model for our SMS spam classification
problem was evaluated using three metrics: precision, recall, and F1
score. Here is a table summarizing the obtained values:

|       | Precision | Recall | F1 Score |
|-------|-----------|--------|----------|
| GPT-4 | 0.92      | 0.97   | 0.95     |

Let's interpret these results:

1. **Precision:** Precision is a measure of the model's accuracy considering only the predicted positive instances. In the context of our problem, a precision of 0.92 means that when GPT-4 predicts a message to be 'spam', it is correct 92% of the time. This level of precision indicates a relatively low rate of false positive classifications (i.e., 'ham' messages incorrectly classified as 'spam').

2. **Recall:** Recall, also known as sensitivity or true positive rate, measures the proportion of actual positives that are correctly identified. In our scenario, a recall of 0.97 means that the model identifies 97% of all actual 'spam' messages correctly. However, it also indicates a 3% rate of false negatives (i.e., 'spam' messages incorrectly classified as 'ham').

3. **F1 Score:** The F1 score is the harmonic mean of precision and recall, and it gives a balanced measure of these two metrics. An F1 score of 0.95 indicates a balance between precision and recall, demonstrating the model's robustness in handling both false positives and false negatives.

The performance of the GPT-4 model on this task is quite remarkable
given it is a zero-shot learning scenario. However, we can further
improve this performance, or at least attempt to, by employing other
learning techniques, such as few-shot learning and supervised learning.
The following sections will delve into these other approaches.

## Comparing In-Context Learning with Supervised Learning using LSTM

In our analysis, we have employed two distinct methodologies for text
classification, namely in-context learning using GPT-4 and a supervised
learning approach using an LSTM model. Both of these methods have shown
promising results, and now we aim to compare them and understand their
differences and respective advantages.

### In-Context Learning with GPT-4

In the in-context learning scenario, we used OpenAI's GPT-4, a
state-of-the-art language model that utilizes Transformer-based neural
networks. For this task, we trained GPT-4 with a few examples of spam
and ham email contents and asked the model to predict whether a new
email is spam or ham based on the content. We used the responses from
GPT-4 to generate classification labels and evaluated the performance
using precision, recall, and F1 score. 

The use of GPT-4 simplifies the process since it does not require
explicit feature engineering or manual training of a machine learning
model. The downside, however, is that GPT-4 is a black-box model with
limited interpretability and could be sensitive to the prompt design and
wording, which could potentially affect the model's predictions.

### Supervised Learning using LSTM

On the other hand, we employed a Long Short-Term Memory (LSTM) model, a
type of recurrent neural network, in a supervised learning context.
LSTMs are particularly useful for sequence prediction problems as they
can store past information, making them ideal for text classification
tasks like ours.

Our LSTM model was trained on preprocessed sequences of text from the
emails and their corresponding labels (spam or ham). We used an
Embedding layer for converting words to vectors, an LSTM layer for
learning the sequences within the text, and a Dense layer for output. We
then trained the model using the Adam optimizer and binary cross-entropy
loss function, given the binary nature of our classification task.

The LSTM model has shown high precision, recall, and F1 Score,
demonstrating its effectiveness in text classification tasks. However,
it's worth noting that building and training the LSTM model required
more effort compared to using GPT-4, including preprocessing the text
data, designing the model architecture, and training and tuning the
model.

### Addressing Imbalanced Data

We noticed that the dataset is imbalanced, with 'ham' messages
significantly outnumbering 'spam' messages. This can cause the model to
be biased towards the majority class, potentially leading to poor
performance for the minority class. However, despite this potential
issue, our LSTM model achieved high precision, recall, and F1 Score,
indicating effective performance.

For models that are affected by class imbalance, strategies such as
resampling the data, assigning class weights, or using different
evaluation metrics can be applied. However, these strategies were not
required in our case as the LSTM model demonstrated high performance
despite the class imbalance.

In conclusion, both in-context learning with GPT-4 and supervised
learning with LSTM have shown to be effective for our text
classification task. The choice between these methods will depend on the
specific requirements and constraints of your task, such as the
availability of labeled data, computational resources, and the
importance of model interpretability.

In [20]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
)
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100

tokenizer = Tokenizer(
    num_words=MAX_NB_WORDS,
    filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
    lower=True,
)
tokenizer.fit_on_texts(df["message"].values)
word_index = tokenizer.word_index
print("Found %s unique tokens." % len(word_index))

X = tokenizer.texts_to_sequences(df["message"].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print("Shape of data tensor:", X.shape)

Y = pd.get_dummies(df["label"]).values
print("Shape of label tensor:", Y.shape)

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.10, random_state=42
)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

Found 9011 unique tokens.
Shape of data tensor: (5572, 250)
Shape of label tensor: (5572, 2)
(5014, 250) (5014, 2)
(558, 250) (558, 2)


In [21]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation="softmax"))
model.compile(
    loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

epochs = 5
batch_size = 64

history = model.fit(
    X_train,
    Y_train,
    epochs=epochs,
    batch_size=batch_size,
    validation_split=0.1,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [24]:
import numpy as np

Y_pred = model.predict(X_test)

Y_test_class = np.argmax(Y_test, axis=1)
Y_pred_class = np.argmax(Y_pred, axis=1)

precision = precision_score(Y_test_class, Y_pred_class)
recall = recall_score(Y_test_class, Y_pred_class)
f1 = f1_score(Y_test_class, Y_pred_class)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Precision: 0.97
Recall: 0.96
F1 Score: 0.97


### Performance of Supervised Learning with LSTM

The performance of the Long Short-Term Memory (LSTM) model, a type of
recurrent neural network, for our SMS spam classification problem was
evaluated using three key metrics: precision, recall, and the F1 score.
The following table summarizes the obtained results:

|      | Precision | Recall | F1 Score |
|------|-----------|--------|----------|
| LSTM | 0.97      | 0.96   | 0.97     |

Here is an interpretation of these results:

1. **Precision:** A precision of 0.97 for the LSTM model indicates that when the model predicts a message to be 'spam', it is correct about 97% of the time. This represents a high degree of accuracy in the prediction of spam messages, indicating a minimal rate of 'ham' messages incorrectly classified as 'spam' (false positives).

2. **Recall:** A recall value of 0.96 implies that the LSTM model correctly identifies 96% of all actual 'spam' messages. However, it also suggests a 4% rate of 'spam' messages incorrectly classified as 'ham' (false negatives).

3. **F1 Score:** The F1 score, which is the harmonic mean of precision and recall, stands at 0.97. This score illustrates a solid balance between precision and recall, showing the model's effectiveness in managing both false positives and false negatives.

The LSTM model demonstrates a strong performance for this text
classification problem. Despite being a more complex and time-consuming
approach compared to in-context learning with GPT-4, the LSTM model
presents its own advantages. Particularly, it tends to perform better
with larger datasets and when more control over the model's parameters
is required. The LSTM model's performance is especially remarkable
considering the imbalanced nature of the dataset, emphasizing the
robustness of this supervised learning technique.

## Summary and Comparison of Zero-Shot Learning and Supervised Learning 

In this study, we utilized the Spam Classification dataset to compare
two distinctive approaches for classifying messages as either "spam" or
"ham" (non-spam): zero-shot learning using the GPT-4 model, and
supervised learning using a LSTM-based deep learning model.

The zero-shot learning approach involves utilizing an AI model,
specifically the GPT-4 model, which has not been explicitly trained on
the task at hand (spam detection), but leverages its general language
understanding abilities to make predictions. On the other hand, the LSTM
model is a deep learning model that is trained on a labeled dataset to
predict whether a message is spam or not, thus representing a
traditional supervised learning approach.

We faced a notable challenge in this experiment: the unbalanced
distribution of spam and ham messages in the dataset. To address this,
we made sure to sample equal numbers of each class when evaluating the
performance of the models.

We evaluated the performance of the models using three metrics:
precision, recall, and F1 score. Each of these metrics provides a
different perspective on the model's performance. Precision measures how
many of the predicted spam messages were actually spam. Recall, also
known as sensitivity, measures how many of the actual spam messages were
correctly identified by the model. The F1 score combines precision and
recall to provide a single metric that balances both considerations.

Here are the final results:

| Model Type        | Precision | Recall | F1 Score |
|-------------------|-----------|--------|----------|
| GPT-4 (Zero-Shot) | 0.92      | 0.97   | 0.95     |
| LSTM              | 0.97      | 0.96   | 0.97     |

While both models performed quite well, the LSTM model demonstrated a
slightly superior performance in terms of precision and F1 score. This
result highlights the value of targeted, supervised learning for tasks
such as spam detection. However, it is noteworthy that the GPT-4 model
performed remarkably well even without any explicit training on this
specific task, underscoring the power of zero-shot learning and the
versatility of advanced language models like GPT-4.

This comparison sheds light on the potential of both traditional
supervised learning models and more recent zero-shot learning techniques
in text classification tasks. The choice between these approaches
depends on several factors, including the availability of labeled data,
computational resources, and the specific requirements of the task at
hand.


## Few-Shot In-Context Learning with GPT-4

In the previous sections, we explored both zero-shot in-context learning
with GPT-4 and a supervised learning approach using an LSTM model for
our SMS spam classification problem. However, another learning paradigm
exists that combines elements from both: few-shot learning. This method
aims to make predictions after seeing only a handful of examples, hence
its name. 

In the case of the GPT-4 model, it refers to in-context learning, where
the model makes predictions based on the conversation history provided,
with a small set of examples included as part of the context. The
structure of these examples, or 'shots', plays a crucial role in guiding
the model to produce the desired output. These 'shots' could be seen as
miniature lessons that teach the model how to perform a task without
explicitly programming it.

In the context of our SMS spam classification problem, the general
structure of a prompt for few-shot learning would look like this:

```text
    prompt = (
        "I am a model trained to identify spam and non-spam emails. Here are some examples of my training:\n"
        "Example 1:\n"
        "Question: Given the following email content, is the email spam or ham?\n"
        "Email Content: 'You have won a lottery! Claim your prize now.'\n"
        "Answer: Spam\n"
        "...\n"
        "Now, a new example to classify:\n"
        "Question: Given the following email content, is the email spam or ham?\n"
        f"Email Content: {msg}\n"
    )
```

Similar to zero-shot learning, the model has never seen the specific
task before during its training. However, unlike zero-shot learning, it
is provided with a few examples of the task in the conversation history.
This way, the model learns from these examples to make accurate
predictions for the new instances.

It's important to note that our dataset is unbalanced, with the 'ham'
messages outnumbering the 'spam' messages. The proportion of each class,
as mentioned before, is approximately 87% 'ham' and 13% 'spam'. When
creating our random sample for few-shot learning, we'll need to keep
this class imbalance in mind. While our previous approach addressed this
issue by creating a balanced sample, few-shot learning might behave
differently, and we may need to adjust our approach accordingly. We will
explore the results and address any imbalance issues as needed.

As in the previous sections, we will be using precision, recall, and F1
score to assess the performance of few-shot in-context learning.
However, just like in zero-shot learning, AUC isn't applicable due to
the inherent nature of the GPT-4 model. The following sections will dive
deeper into the implementation and results of few-shot in-context
learning with GPT-4.

In [4]:
import os
import openai
import pandas as pd
import random
import time

# Load the dataset
df = pd.read_csv("SMSSpamCollection", sep="\t", names=["label", "message"])

# Get 10 random examples
examples = df.sample(3)

# Set up examples
example_prompts = [
    {
        "role": "system",
        "content": "You are a helpful assistant capable of identifying spam emails.",
    },
]

for _, row in examples.iterrows():
    example_prompt = {
        "role": "user",
        "content": (
            "Question: Given the following email content, is the email spam or ham?\n"
            f"Email Content: {row['message']}\n"
            f"Answer: {row['label']}"
        ),
    }
    example_prompts.append(example_prompt)

# Get 200 random samples, ensuring balance between classes
ham_samples = df[df["label"] == "ham"].sample(50)
spam_samples = df[df["label"] == "spam"].sample(50)
samples = pd.concat([ham_samples, spam_samples])

openai.api_key = os.getenv("OPENAI_API_KEY")

correct_predictions = 0

for _, row in samples.iterrows():
    prompt = (
        "Question: Given the following email content, is the email spam or ham?\n"
        f"Email Content: {row['message']}"
    )
    example_prompts.append({"role": "user", "content": prompt})

    completion = openai.ChatCompletion.create(
        model="gpt-4",
        messages=example_prompts,
    )

    print(f"Message: {row['message']}")
    response = completion.choices[0].message["content"]
    print("Response:", response)
    predicted_label = "spam" if "A" in response else "ham"
    correct_predictions += int(predicted_label == row["label"])
    time.sleep(30)  # sleep 1 second between API requests

accuracy = correct_predictions / len(samples)
print(f"Accuracy: {accuracy}")

Message: In which place do you want da.
Response: Answer: ham
Message: K ill drink.pa then what doing. I need srs model pls send it to my mail id pa.
Response: Answer: ham
Message: Good! No, don‘t need any receipts—well done! (…) Yes, please tell . What‘s her number, i could ring her
Response: Answer: ham
Message: I am back. Bit long cos of accident on a30. Had to divert via wadebridge.I had a brilliant weekend thanks. Speak soon. Lots of love
Response: Answer: ham
Message: Ok ok take care. I can understand.
Response: Answer: ham
Message: The greatest test of courage on earth is to bear defeat without losing heart....gn tc
Response: Answer: ham
Message: K so am I, how much for an 8th? Fifty?
Response: Answer: ham
Message: Can you tell Shola to please go to college of medicine and visit the academic department, tell the academic secretary what the current situation is and ask if she can transfer there. She should ask someone to check Sagamu for the same thing and lautech. Its vital she 

ServiceUnavailableError: The server is overloaded or not ready yet.

https://help.openai.com/en/articles/6897202-ratelimiterror

In [29]:
import os
import openai
import pandas as pd
import random
import time

# Load the dataset
df = pd.read_csv("SMSSpamCollection", sep="\t", names=["label", "message"])

# Get 10 random examples
examples = df.sample(3)

# Set up examples
example_prompts = [
    {
        "role": "system",
        "content": "You are a helpful assistant capable of identifying spam emails.",
    },
]

for _, row in examples.iterrows():
    example_prompt = {
        "role": "user",
        "content": (
            "Question: Given the following email content, is the email spam or ham?\n"
            f"Email Content: {row['message']}\n"
            f"Answer: {row['label']}"
        ),
    }
    example_prompts.append(example_prompt)

# Get 200 random samples, ensuring balance between classes
ham_samples = df[df["label"] == "ham"].sample(10)
spam_samples = df[df["label"] == "spam"].sample(10)
samples = pd.concat([ham_samples, spam_samples])

openai.api_key = os.getenv("OPENAI_API_KEY")

correct_predictions = 0

# Create a dataframe to store predictions and true labels
prediction_df = pd.DataFrame(columns=['true_label', 'predicted_label'])

for index, row in samples.iterrows():
    prompt = (
        "Question: Given the following email content, is the email spam or ham?\n"
        f"Email Content: {row['message']}"
    )
    example_prompts.append({"role": "user", "content": prompt})

    completion = openai.ChatCompletion.create(
        model="gpt-4",
        messages=example_prompts,
    )

    print(f"Message: {row['message']}")
    response = completion.choices[0].message["content"]
    print("Response:", response)
    predicted_label = "spam" if "spam" in response else "ham"
    correct_predictions += int(predicted_label == row["label"])

    # Add the result to the dataframe
    prediction_df.loc[index] = [row["label"], predicted_label]

    # Write the dataframe to CSV
    prediction_df.to_csv('predictions.csv', index=False)

    time.sleep(1)  # sleep 1 second between API requests

accuracy = correct_predictions / len(samples)
print(f"Accuracy: {accuracy}")


Message: Another month. I need chocolate weed and alcohol.
Response: Answer: ham
Message: All boys made fun of me today. Ok i have no problem. I just sent one message just for fun
Response: Answer: ham
Message: Sorry da. I gone mad so many pending works what to do.
Response: Answer: ham
Message: Pls i wont belive god.not only jesus.
Response: Answer: ham
Message: You're gonna have to be way more specific than that
Response: Answer: ham
Message: Gain the rights of a wife.dont demand it.i am trying as husband too.Lets see
Response: Answer: ham
Message: Have a safe trip to Nigeria. Wish you happiness and very soon company to share moments with
Response: Answer: ham
Message: Ok darlin i supose it was ok i just worry too much.i have to do some film stuff my mate and then have to babysit again! But you can call me there.xx
Response: Answer: ham
Message: Ok enjoy . R u there in home.
Response: Answer: ham
Message: Storming msg: Wen u lift d phne, u say "HELLO" Do u knw wt is d real meaning of

### One-shot learning assessment metrics

Calculate precision, recall and F1-score. The dataset being perfectly balanced (same number of 'ham' and 'spam' examples) 

In [21]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd

# Load the predictions from CSV
df = pd.read_csv('predictions_text_one_shot.csv')

# Get the true and predicted labels
true_labels = df['true_label']
predicted_labels = df['predicted_label']

# Calculate scores
precision = precision_score(true_labels, predicted_labels, pos_label='spam')
recall = recall_score(true_labels, predicted_labels, pos_label='spam')
f1 = f1_score(true_labels, predicted_labels, pos_label='spam')

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')


Precision: 0.971830985915493
Recall: 0.971830985915493
F1 Score: 0.971830985915493


Calculate the confusion matrix:

In [22]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(true_labels, predicted_labels, labels=['spam', 'ham'])
print(conf_mat)


[[ 69   2]
 [  2 131]]


This output from the confusion matrix suggests:

- True Positives (TP): 69 - These are cases where your model predicted 'spam' and the true label was also 'spam'.
- False Positives (FP): 2 - These are cases where your model predicted 'spam' but the true label was 'ham'.
- False Negatives (FN): 2 - These are cases where your model predicted 'ham' but the true label was 'spam'.
- True Negatives (TN): 131 - These are cases where your model predicted 'ham' and the true label was also 'ham'.

Given these values, you can manually calculate precision, recall, and the F1-score as follows:

- Precision (for 'spam') = TP / (TP + FP) = 69 / (69 + 2) = 0.9718
- Recall (for 'spam') = TP / (TP + FN) = 69 / (69 + 2) = 0.9718
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.9718 * 0.9718) / (0.9718 + 0.9718) = 0.9718

So, in this case, your precision, recall, and F1-score are indeed the same, which is rare but possible, as mentioned earlier. It happens because the number of false positives and false negatives is the same.


### Few-shot learning assessment metrics

Calculate precision, recall and F1-score. The dataset being perfectly balanced (same number of 'ham' and 'spam' examples) 

In [31]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd

# Load the predictions from CSV
df = pd.read_csv('predictions_text_few_shot.csv')

# Get the true and predicted labels
true_labels = df['true_label']
predicted_labels = df['predicted_label']

# Calculate scores
precision = precision_score(true_labels, predicted_labels, pos_label='spam')
recall = recall_score(true_labels, predicted_labels, pos_label='spam')
f1 = f1_score(true_labels, predicted_labels, pos_label='spam')

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Precision: 0.9852941176470589
Recall: 0.9571428571428572
F1 Score: 0.9710144927536232


In [32]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(true_labels, predicted_labels, labels=['spam', 'ham'])
print(conf_mat)

[[ 67   3]
 [  1 143]]
