<a href="https://colab.research.google.com/github/Joshuajee/AI-ML-PROJECTS/blob/master/Catch%20the%20LLM%20Text%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification Articles written by LLMs

## Setup

In [None]:
!pip install tensorflow==2.15.0 tensorflow-hub keras==2.15.0

In [None]:
import numpy as np
import pandas as pd
import requests
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

## Download the data from Github

The dataset is available on https://github.com/Joshuajee/AI-ML-PROJECTS/master/. The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [None]:
def get_model_data_from_github(model):
  file_path = "https://raw.githubusercontent.com/Joshuajee/AI-ML-PROJECTS/master/data/llms/"
  reponse = requests.get(file_path + model)
  if reponse.status_code == 200:
    with open(model, "wb") as file:
      file.write(reponse.content)
  else:
    raise Exception("Error: downloading", model, reponse.status_code)

In [None]:
# Download the model data from github this is faster
get_model_data_from_github("Meta-Llama-3-8B-Instruct.Q4_0.gguf.csv")
get_model_data_from_github("Phi-3-mini-4k-instruct.Q4_0.gguf.csv")
get_model_data_from_github("Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf.csv")
get_model_data_from_github("orca-mini-3b-gguf2-q4_0.gguf.csv")

## Explore the data

Exploring the data to better understand the response gotten from the LLMs.

From the exploration Meta-Llama-3-8B-Instruct tend to use this "\*\*" character to represent heading, this will help to easily recognise the response gotten from Meta-Llama-3-8B-Instruct but because "\*\*" is not a word they will be removed from our data.

In [None]:
# Read and explore Meta-Llama-3-8B-Instruct data
meta = pd.read_csv("Meta-Llama-3-8B-Instruct.Q4_0.gguf.csv")
meta.head(10)

In [None]:
# Read and explore Phi-3-mini-4k-instruct data
phi = pd.read_csv("Phi-3-mini-4k-instruct.Q4_0.gguf.csv")
phi.head(10)

In [None]:
# Read and explore Nous-Hermes-2-Mistral-7B-DPO.Q4_0 data
nous = pd.read_csv("Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf.csv")
nous.head(10)

In [None]:
# Read and explore orca-mini-3b-gguf2-q4_0.gguf data
orca = pd.read_csv("orca-mini-3b-gguf2-q4_0.gguf.csv")
orca.head(10)

Get the total count of the DataFrames in other to balance the classes

In [None]:
print(phi.count())
print(meta.count())
print(nous.count())
print(orca.count())

## Preprocessing

Very few text cleaning is needed here because of the text embedding, but I will remove the "**" character that is in most of Meta-Llama-3-8B-Instruct responses.


In [None]:
meta[' Response'] = meta[' Response'].str.replace('**', ' ').str.strip()
meta

Spliting into training, validation, and testing sets.

In [None]:
# function to generate train, val, test split
# Train.    : 60%
# Validation: 20%
# Test      : 20%

def split_data(data, category):
  # preprocess
  data['Features'] = data[' Response']

  # convert label to vector
  data['Labels'] = category

  data_shuffled = data.sample(frac=1, random_state=42).reset_index(drop=True)

  # Define split sizes
  train_size = int(0.6 * len(data_shuffled))   # 60%
  val_size = int(0.2 * len(data_shuffled))     # 20%

  train_data = data_shuffled[:train_size][["Features", "Labels"]]
  val_data   = data_shuffled[train_size:train_size+val_size][["Features", "Labels"]]
  test_data  = data_shuffled[train_size+val_size:][["Features", "Labels"]]

  return train_data, val_data, test_data

In [None]:
(meta_train, meta_val, meta_test) = split_data(meta, 0)
(phi_train, phi_val, phi_test) = split_data(phi, 1)
(nous_train, nous_val, nous_test) = split_data(nous, 2)
(orca_train, orca_val, orca_test) = split_data(orca, 3)

print(len(meta_train), len(meta_val), len(meta_test))
print(len(phi_train), len(phi_val), len(phi_test))
print(len(nous_train), len(nous_val), len(nous_test))
print(len(orca_train), len(orca_val), len(orca_test))

meta_train

Joining the Dataframes together

In [None]:
train_data = pd.concat([meta_train, phi_train, nous_train, orca_train], ignore_index=True)
val_data = pd.concat([meta_val, phi_val, nous_val, orca_val], ignore_index=True)
test_data = pd.concat([meta_test, phi_test, nous_test, orca_test], ignore_index=True)

train_data

In [None]:
# Converting to Tensors
train_example = tf.convert_to_tensor( train_data['Features'].values, dtype=tf.string) # Convert 'Features' to tf.string tensor
train_labels = tf.convert_to_tensor(to_categorical(train_data['Labels'].values, num_classes=4), dtype=tf.int64)   # Convert 'Labels' to tf.int64 tensor

val_example = tf.convert_to_tensor(val_data['Features'].values, dtype=tf.string) # Convert 'Features' to tf.string tensor
val_labels = tf.convert_to_tensor(to_categorical(val_data['Labels'].values, num_classes=4), dtype=tf.int64)   # Convert 'Labels' to tf.int64 tensor

test_example = tf.convert_to_tensor(test_data['Features'].values, dtype=tf.string) # Convert 'Features' to tf.string tensor
test_labels = tf.convert_to_tensor(to_categorical(test_data['Labels'].values, num_classes=4), dtype=tf.int64)   # Convert 'Labels' to tf.int64 tensor

## Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

* How to represent the text?
* How many layers to use in the model?
* How many *hidden units* to use for each layer?

In this project, the input data consists of sentences. The labels to predict are either 0, 1, 2, 3.

To represent the text, the sentences will be converted into embeddings vectors.

Pre-trained text embedding will be used as the first layer, which will have two advantages:
*   we don't have to worry about text preprocessing,
*   we can benefit from transfer learning.

For this example I will use the following model.

1. [TensorFlow Hub](https://www.tensorflow.org/hub) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).

2. [google/nnlm-en-dim50-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2) - same as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with additional text normalization to remove punctuation. This can help to get better coverage of in-vocabulary embeddings for tokens on your input text.

3. [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - A larger model with an embedding dimension of 128 instead of the smaller 50.

The layers are stacked sequentially to build the classifier:

1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. We are using the three models above to splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`.
2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
3. The last layer is densely connected with a four output node. This outputs logits: the log-odds of the true class, according to the model.

Let's now build the full model:

In [None]:
def build_model(embedding_model):
    model = tf.keras.Sequential()
    model.add(hub.KerasLayer(embedding_model, input_shape=[], dtype=tf.string, trainable=True))
    model.add(tf.keras.layers.Dense(16, activation='relu'))
    model.add(tf.keras.layers.Dense(4))
    model.compile(optimizer='adam', loss=tf.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
    model.summary()
    return model

In [None]:
nnlm_50_dim = build_model("https://tfhub.dev/google/nnlm-en-dim50/2")

In [None]:
nnlm_50_dim_norm = build_model("https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2")

In [None]:
nnlm_128_dim_norm = build_model("https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2")

The layers are stacked sequentially to build the classifier:

1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The model that we are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`.
2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
3. The last layer is densely connected with a four output node. This outputs logits: the log-odds of the true class, according to the model.

### Hidden units

The above model has two intermediate or "hidden" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called *overfitting*, and we'll explore it later.

### Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a categorical classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the `categorical_crossentropy` loss function.


## Train the models

Train the model for 40 epochs in mini-batches of 100 samples. This is 40 iterations over all samples in the `x_train` and `y_train` tensors.

In [None]:
def train_model(model, train_example, train_labels, val_example, val_labels):
  return model.fit(train_example, train_labels, epochs=40, batch_size=100, validation_data=(val_example, val_labels), verbose=1)

In [None]:
nnlm_50_dim_history = train_model(nnlm_50_dim, train_example, train_labels, val_example, val_labels)

In [None]:
nnlm_50_dim_norm_history = train_model(nnlm_50_dim_norm, train_example, train_labels, val_example, val_labels)

In [None]:
nnlm_128_dim_norm_history = train_model(nnlm_128_dim_norm, train_example, train_labels, val_example, val_labels)

## Evaluate the models

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
def evaluate_model(model, test_example, test_labels):
  results = model.evaluate(test_example, test_labels)
  print(f"test loss: {results[0]}, test acc: {results[1]}")
  return results

In [None]:
nnlm_50_dim_results = evaluate_model(nnlm_50_dim, test_example, test_labels)

In [None]:
nnlm_50_dim_norm_results = evaluate_model(nnlm_50_dim_norm, test_example, test_labels)

In [None]:
nnlm_128_dim_norm_results = evaluate_model(nnlm_128_dim_norm, test_example, test_labels)

Model predict, this is assuming that we feed a brand new data to our trained system

In [None]:
# results_pred = model.predict(test_example)
# classes_x=np.argmax(results_pred ,axis=1)

This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

## Create a graph of accuracy and loss over time

`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training:

In [None]:
def plot_history(history):
  history_dict = history.history
  history_dict.keys()
  acc = history_dict['accuracy']
  val_acc = history_dict['val_accuracy']
  loss = history_dict['loss']
  val_loss = history_dict['val_loss']

  epochs = range(1, len(acc) + 1)

  # "bo" is for "blue dot"
  plt.plot(epochs, loss, 'bo', label='Training loss')
  # b is for "solid blue line"
  plt.plot(epochs, val_loss, 'b', label='Validation loss')
  plt.title('Training and validation loss')
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.legend()
  plt.show()

In [None]:
plot_history(nnlm_50_dim_history)

In [None]:
plot_history(nnlm_50_dim_norm_history)

In [None]:
plot_history(nnlm_128_dim_norm_history)

There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

In [None]:
# acc = history_dict['accuracy']
# val_acc = history_dict['val_accuracy']
# loss = history_dict['loss']
# val_loss = history_dict['val_loss']

# epochs = range(1, len(acc) + 1)

# # "bo" is for "blue dot"
# plt.plot(epochs, loss, 'bo', label='Training loss')
# # b is for "solid blue line"
# plt.plot(epochs, val_loss, 'b', label='Validation loss')
# plt.title('Training and validation loss')
# plt.xlabel('Epochs')
# plt.ylabel('Loss')
# plt.legend()

# plt.show()

In [None]:
# plt.clf()   # clear figure

# plt.plot(epochs, acc, 'bo', label='Training acc')
# plt.plot(epochs, val_acc, 'b', label='Validation acc')
# plt.title('Training and validation accuracy')
# plt.xlabel('Epochs')
# plt.ylabel('Accuracy')
# plt.legend()

# plt.show()

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy—they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.

For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback.