<a href="https://colab.research.google.com/github/Joshuajee/AI-ML-PROJECTS/blob/master/Catch%20the%20LLM%20Text%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification Articles written by LLMs

## Setup

In [None]:
!pip install tensorflow==2.15 tensorflow-hub==0.15

In [None]:
import numpy as np
import pandas as pd
import requests
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder


print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

## Download the data from Github

The dataset is available on https://github.com/Joshuajee/AI-ML-PROJECTS/master/. The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [None]:
def get_model_data_from_github(model):
  file_path = "https://raw.githubusercontent.com/Joshuajee/AI-ML-PROJECTS/master/data/llms/"
  reponse = requests.get(file_path + model)
  if reponse.status_code == 200:
    with open(model, "wb") as file:
      file.write(reponse.content)
  else:
    raise Exception("Error: downloading", model, reponse.status_code)

In [None]:
# Download the model data from github this is faster
get_model_data_from_github("Meta-Llama-3-8B-Instruct.Q4_0.gguf.csv")
get_model_data_from_github("Phi-3-mini-4k-instruct.Q4_0.gguf.csv")
get_model_data_from_github("Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf.csv")
get_model_data_from_github("orca-mini-3b-gguf2-q4_0.gguf.csv")

## Explore the data

Exploring the data to better understand the response gotten from the LLMs.

From the exploration Meta-Llama-3-8B-Instruct tend to use this "\*\*" character to represent heading, this will help to easily recognise the response gotten from Meta-Llama-3-8B-Instruct but because "\*\*" is not a word they will be removed from our data.

In [None]:
# Read and explore Meta-Llama-3-8B-Instruct data
meta = pd.read_csv("Meta-Llama-3-8B-Instruct.Q4_0.gguf.csv")
meta.head(10)

In [None]:
# Read and explore Phi-3-mini-4k-instruct data
phi = pd.read_csv("Phi-3-mini-4k-instruct.Q4_0.gguf.csv")
phi.head(10)

In [None]:
# Read and explore Nous-Hermes-2-Mistral-7B-DPO.Q4_0 data
nous = pd.read_csv("Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf.csv")
nous.head(10)

In [None]:
# Read and explore orca-mini-3b-gguf2-q4_0.gguf data
orca = pd.read_csv("orca-mini-3b-gguf2-q4_0.gguf.csv")
orca.head(10)

Get the total count of the DataFrames in other to balance the classes

In [None]:
print(phi.count())
print(meta.count())
print(nous.count())
print(orca.count())

In [None]:
complete_df = pd.concat([meta, phi, nous, orca], ignore_index=True)
complete_df

In [None]:
# Remove trailing and leading whitespaces on the dataframe columns
complete_df.columns = complete_df.columns.str.strip()

### Calculating and plotting the average words per model

In [None]:
# Function to count words
complete_df['word_count'] = complete_df['Response'].apply(lambda x: len(x.split()))

# Calculate average words per Model
avg_words_per_model = complete_df.groupby('Model', as_index=False)['word_count'].mean()

avg_words_per_model

In [None]:
plt.figure(figsize=(20, 8))
sns.barplot(x='Model', y='word_count', data=avg_words_per_model, palette='viridis')

# Customization
plt.xlabel('Model')
plt.ylabel('Average Word Count')
plt.title('Average Word Count per Model')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()

## Preprocessing



Very few text cleaning is needed here because of the text embedding, but I will remove the "**" character that is in most of Meta-Llama-3-8B-Instruct responses.

In [None]:
complete_df['Response'] = complete_df['Response'].str.replace('**', ' ').str.strip()
complete_df

### Encoding
Computers don’t understand text, so we need to convert our labels (the model) into numbers. The label encoder is a good library for this task.


In [None]:
complete_df['Features'] = complete_df['Response']
encoder = LabelEncoder()
complete_df['Labels'] = encoder.fit_transform(complete_df['Model'])
complete_df

In [None]:
# Create a new DataFrame with only the Features and Labels columns
feature_df = complete_df[['Features', 'Labels']]
feature_df

## Splitting Data
Spliting into training, validation, and testing sets.

In [None]:
train_data, temp_data = train_test_split(feature_df, test_size=0.4, stratify=feature_df['Labels'], random_state=42)
train_data

In [None]:
# Further split the temp_data into validation and test sets
val_data, test_data = train_test_split(temp_data, test_size=0.5, stratify=temp_data['Labels'], random_state=42)
val_data

In [None]:
test_data

### Converting to Tensors
In order to work with Tensorflow we have to convert our train, validation and test datasets into Tensors.

In [None]:
# Converting to Tensors
train_example = tf.convert_to_tensor(train_data['Features'].values, dtype=tf.string) # Convert 'Features' to tf.string tensor
train_labels = tf.convert_to_tensor(to_categorical(train_data['Labels'].values, num_classes=4), dtype=tf.int64)   # Convert 'Labels' to tf.int64 tensor

val_example = tf.convert_to_tensor(val_data['Features'].values, dtype=tf.string) # Convert 'Features' to tf.string tensor
val_labels = tf.convert_to_tensor(to_categorical(val_data['Labels'].values, num_classes=4), dtype=tf.int64)   # Convert 'Labels' to tf.int64 tensor

test_example = tf.convert_to_tensor(test_data['Features'].values, dtype=tf.string) # Convert 'Features' to tf.string tensor
test_labels = tf.convert_to_tensor(to_categorical(test_data['Labels'].values, num_classes=4), dtype=tf.int64)   # Convert 'Labels' to tf.int64 tensor

## Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

* How to represent the text?
* How many layers to use in the model?
* How many *hidden units* to use for each layer?

In this project, the input data consists of sentences. The labels to predict are either 0, 1, 2, 3.

To represent the text, the sentences will be converted into embeddings vectors.

Pre-trained text embedding will be used as the first layer, which will have two advantages:
*   we don't have to worry about text preprocessing,
*   we can benefit from transfer learning.

For this example I will use the following model.

1. [TensorFlow Hub](https://www.tensorflow.org/hub) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).

2. [google/nnlm-en-dim50-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2) - same as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with additional text normalization to remove punctuation. This can help to get better coverage of in-vocabulary embeddings for tokens on your input text.

3. [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - A larger model with an embedding dimension of 128 instead of the smaller 50.

The layers are stacked sequentially to build the classifier:

1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. We are using the three models above to splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`.
2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
3. The last layer is densely connected with a four output node. This outputs logits: the log-odds of the true class, according to the model.

Let's now build the full model:

In [None]:
def build_model(embedding_model):
    model = tf.keras.Sequential()
    model.add(hub.KerasLayer(embedding_model, input_shape=[], dtype=tf.string, trainable=True))
    model.add(tf.keras.layers.Dense(16, activation='relu'))
    model.add(tf.keras.layers.Dense(4, activation='softmax'))
    model.compile(optimizer='adam', loss=tf.losses.CategoricalCrossentropy(), metrics=['accuracy'])
    model.summary()
    return model

In [None]:
universal_sen_encoder = build_model("https://tfhub.dev/google/universal-sentence-encoder/4")

In [None]:
nnlm_50_dim_norm = build_model("https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2")

In [None]:
nnlm_128_dim_norm = build_model("https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2")

The layers are stacked sequentially to build the classifier:

1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The model that we are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`.
2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
3. The last layer is densely connected with a four output node. This outputs logits: the log-odds of the true class, according to the model.

### Hidden units

The above model has two intermediate or "hidden" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called *overfitting*, and we'll explore it later.

### Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a categorical classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the `categorical_crossentropy` loss function.


## Train the models

Train the model for 40 epochs in mini-batches of 100 samples. This is 15 iterations over all samples in the `x_train` and `y_train` tensors.

In [None]:
def train_model(model, train_example, train_labels, val_example, val_labels):
  return model.fit(train_example, train_labels, epochs=15, batch_size=100, validation_data=(val_example, val_labels), verbose=1)

In [None]:
universal_sen_encoder_history = train_model(universal_sen_encoder, train_example, train_labels, val_example, val_labels)

In [None]:
nnlm_50_dim_norm_history = train_model(nnlm_50_dim_norm, train_example, train_labels, val_example, val_labels)

In [None]:
nnlm_128_dim_norm_history = train_model(nnlm_128_dim_norm, train_example, train_labels, val_example, val_labels)

## Evaluate the models

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
def evaluate_model(model, test_example, test_labels):
  results = model.evaluate(test_example, test_labels)
  print(f"test loss: {results[0]}, test acc: {results[1]}")
  return results

In [None]:
universal_sen_encoder_results = evaluate_model(universal_sen_encoder, test_example, test_labels)

In [None]:
nnlm_50_dim_norm_results = evaluate_model(nnlm_50_dim_norm, test_example, test_labels)

In [None]:
nnlm_128_dim_norm_results = evaluate_model(nnlm_128_dim_norm, test_example, test_labels)

Model predict, this is assuming that we feed a brand new data to our trained system

In [None]:
def plot_confusion_matrix(model):
    class_names = ['Meta-Llama-3-8B-Instruct', 'Phi-3-mini-4k-instruct', 'Nous-Hermes-2-Mistral-7B-DPO', 'orca-mini-3b-gguf2-q4_0']
    y_pred = model.predict(test_example)
    y_pred_classes = np.argmax(y_pred, axis=1)
    y_true_classes = np.argmax(test_labels, axis=1)
    cm = confusion_matrix(y_true_classes, y_pred_classes)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title('Confusion Matrix')
    plt.show()
    return cm

In [None]:
universal_sen_encoder_cm = plot_confusion_matrix(universal_sen_encoder)

In [None]:
nnlm_50_dim_norm_cm = plot_confusion_matrix(nnlm_50_dim_norm)

In [None]:
nnlm_128_dim_norm_cm = plot_confusion_matrix(nnlm_128_dim_norm)

## Create a graph of accuracy and loss over time

`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training:

In [None]:
def plot_history(history):
  history_dict = history.history
  history_dict.keys()
  acc = history_dict['accuracy']
  val_acc = history_dict['val_accuracy']
  loss = history_dict['loss']
  val_loss = history_dict['val_loss']

  epochs = range(1, len(acc) + 1)

  # "bo" is for "blue dot"
  plt.plot(epochs, loss, 'bo', label='Training loss')
  # b is for "solid blue line"
  plt.plot(epochs, val_loss, 'b', label='Validation loss')
  plt.title('Training and validation loss')
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.legend()
  plt.show()
  return epochs, acc, val_acc, loss, val_loss

In [None]:
(universal_sen_epoch, universal_sen_acc, universal_sen_val_acc, _, _) = plot_history(universal_sen_encoder_history)

In [None]:
(nnlm_50_dim_norm_epoch, nnlm_50_dim_norm_acc, nnlm_50_dim_norm_val_acc, _, _) = plot_history(nnlm_50_dim_norm_history)

In [None]:
(nnlm_128_dim_norm_epoch, nnlm_128_dim_norm_acc, nnlm_128_dim_norm_val_acc, _, _) = plot_history(nnlm_128_dim_norm_history)

There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

In [None]:
def plot_accuracy(epochs, acc, val_acc):
  plt.clf()   # clear figure
  plt.plot(epochs, acc, 'bo', label='Training acc')
  plt.plot(epochs, val_acc, 'b', label='Validation acc')
  plt.title('Training and validation accuracy')
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.legend()

  plt.show()

In [None]:
plot_accuracy(universal_sen_epoch, universal_sen_acc, universal_sen_val_acc)

In [None]:
plot_accuracy(nnlm_50_dim_norm_epoch, nnlm_50_dim_norm_acc, nnlm_50_dim_norm_val_acc)

In [None]:
plot_accuracy(nnlm_128_dim_norm_epoch, nnlm_128_dim_norm_acc, nnlm_128_dim_norm_val_acc)

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.