# Neural Network SMS Text Classifier
### Nhlapo Nkululeko

<hr>

### Instructions

Welcome to the SMS Spam Classification project, designed to detect and classify spam messages using machine learning techniques. This project is a culmination of efforts to enhance communication security and efficiency through advanced data analysis and modeling techniques.




*   The primary objective of this project was to develop a robust machine learning model capable of accurately distinguishing between "ham" (legitimate, normal) and "spam" messages in SMS communication.
*     Utilizing the SMS Spam Collection dataset provided by FreeCodeCamp, This project is part of the FreeCodeCamp Machine Learning certification.

**Scroll down and Explore, it is worth the Time.**


Okay Let us Begin Now, we will start by importing all the necessary libaries

In [None]:
import pandas as pd
!pip install tensorflow-datasets
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt
!pip install tensorflow
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Embedding, Dense
from tensorflow.keras.callbacks import EarlyStopping

Load the presplit data set

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

Next we load the data into a data frame and also label our columns, Class and Message. Then visulaize the data with .head() and .info() methods

In [None]:
# Import the training data from a tab-separated file
# The file is read into a pandas DataFrame with no header row
# The columns are named 'y' (label) and 'x' (message)
dfTrain = pd.read_csv(train_file_path, sep="\t", header=None, names=['Class', 'Message'])

# Display the first 5 rows of the DataFrame to inspect its structure and content
#dfTrain.head()

In [None]:
dfTrain.info()

The same thing is done with the Test Data

In [None]:
# Import the training data from a tab-separated file
# The file is read into a pandas DataFrame with no header row
# The columns are named 'y' (label) and 'x' (message)
dfTest = pd.read_csv(test_file_path, sep="\t", header=None, names=['Class', 'Message'])
# Display the first 5 rows of the DataFrame to inspect its structure and content
#dfTest.head()

Knowing the sizes of these datasets is useful for understanding how much data we have available for training our model and how much data will be used for evaluating its performance, so next we gonna print out the number of rows of each dataframe, to know just how much we are dealing with

In [None]:
# Print the number of rows (examples) in the training dataset
print(len(dfTrain))

# Print the number of rows (examples) in the testing dataset
print(len(dfTest))

>Remember **Ham = Normal Message**, **Spam is a SCAM MESSAGE**

Next we move on to prepare the **training and testing data for a classification task where the goal is to classify SMS messages into "ham" (0) or "spam" (1)**. this is how we want our data to be transformed.

**train_message will contain:** ['Hello, how are you?', 'Win money now!', 'Are you coming to the party?', 'You've won a free gift card!', 'Let's catch up soon.']
   
**train_label will be:** **[0, 1, 0, 1, 0]** (converted from "ham" and "spam" to 0 and 1 respectively)

In [None]:
# Rememeber train_file and test_file are DataFrames with columns 'class' and 'message'
# and train_file and test_file have already been defined and loaded from TSV files.

# Extract messages and labels from the training data
train_message = dfTrain["Message"].values.tolist()  # Convert the 'message' column to a list
train_label = np.array([0 if x == "ham" else 1 for x in dfTrain['Class'].values.tolist()])  # Create labels as binary array

# Extract messages and labels from the testing data
test_message = dfTest["Message"].values.tolist()  # Convert the 'message' column to a list
test_label = np.array([0 if x == "ham" else 1 for x in dfTest['Class'].values.tolist()])  # Create labels as binary array

Our data is now binary classfied, it is now either (0)ham or (1)Spam....so next am thinking of adding the words in our training data into a vocabulary dict.

We are gonna build a **vocabulary dictionary** from the words present in the train_message list.
The vocabulary dict will count each word's frequency, this can be crucial for various natural language processing tasks such as text classification or sentiment analysis.

In [None]:
# Initialize an empty dictionary to store the vocabulary and its frequencies
vocabulary_dict = {}

# Iterate through each message in the training data
for message in train_message:
    # Split the message into words and iterate through each word
    for word in message.split():
        # Check if the word is already in the vocabulary dictionary
        if word not in vocabulary_dict:
            # If the word is not in the dictionary, add it with a frequency of 1
            vocabulary_dict[word] = 1
        else:
            # If the word is already in the dictionary, increment its frequency by 1
            vocabulary_dict[word] += 1

Good, next i want us to define two variables, Vocab Szie and Max length,These variables are fundamental for preparing text data, ensuring models handle varying message lengths appropriately and process words efficiently.

In [None]:
# Calculate the vocabulary size by determining the number of unique words in the training data vocabulary_dict
VOCAB_SIZE = len(vocabulary_dict)

# Determine the maximum length of messages in terms of word count from the training data train_message
MAX_LENGTH = len(max(train_message, key=lambda p: len(p.split())).split())

The next stop would now be to encode both our training messages and test messages, encode them into integers, remember models do not handle words, but handle numbers better. So we gonnna **Encode** and **Convert each message in train_message into a sequence of integers** using the **one_hot function** based on VOCAB_SIZE. This prepare the text data by transforming each message into a sequence of indices representing the words in a fixed-size vector space.

 and then we will **Pad** each encoded message sequence (encoded_train_message) to a maximum length of MAX_LENGTH. **Padding** ensures all sequences are of the same length for batch processing, essential for sequence models like RNNs or CNNs.

In [None]:
# Encode each message in train_message into a sequence of integers based on VOCAB_SIZE
encoded_train_message = [one_hot(d, VOCAB_SIZE) for d in train_message]

# Pad each encoded message sequence to MAX_LENGTH to ensure uniform input size
padded_train_message = pad_sequences(encoded_train_message, maxlen=MAX_LENGTH, padding='post')

# Encode each message in test_message similarly to train_message
encoded_test_message = [one_hot(d, VOCAB_SIZE) for d in test_message]

# Pad each encoded test message sequence to MAX_LENGTH for consistency in input size
padded_test_message = pad_sequences(encoded_test_message, maxlen=MAX_LENGTH, padding='post')

And now we have reached my favourite part, where now we **create and build** the model. We gonna build a neural network model, we first define it as **Sequantial** where layers are added one after another, then

*   We define and add the embedding layer which converts the input sequences into dense vectors of fixed size.

*   Next we define and add the flatten layer, this layer flattens for us the 2D output from the embedding layer into 1D vector, making it suitable for the Dense Layer.

*   We then  add the FINAL LAYER, Dense layer which Outputs a single value with a sigmoid activation function for binary classification. This is where our model will choose if the message is ham or spam

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras.callbacks import EarlyStopping

# Define the model as a Sequential model
model = Sequential()

# Add an Embedding layer to the model
embedding_layer = Embedding(VOCAB_SIZE, 100, input_length=MAX_LENGTH)
model.add(embedding_layer)

# Add a Flatten layer to flatten the input from the embedding layer
model.add(Flatten())

# Add a Dense layer with a single neuron and sigmoid activation for binary classification
model.add(Dense(1, activation='sigmoid'))

We now have our nn model built and stacked with layers,
The **next step** is to **compile **the model with **adamOptimizer** used for training , **loss function** used for binary classification , and the **metric accuracy** used to evalaute the model, The purpose of this compilation is to **Configure the model** for Training. I had tought that it is also wise that we put in an **Earlystopping function**, so that it is able to **stop Training** when the **validation accuracy stops improving, preventing overfitting.**

In [None]:
# Compile the model with Adam optimizer, binary cross-entropy loss, and accuracy metric
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

# Define EarlyStopping to monitor validation accuracy and stop training when it stops improving
monitor = EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=25, verbose=1, mode='max', restore_best_weights=True)

Yeeey!!!, Our model is now built and compiled, **we are now ready to fit the model and train the model**

In [None]:
# Fit the model on the training data, with validation on the test data
# Use EarlyStopping to prevent overfitting
model.fit(padded_train_message, train_label, validation_data=(padded_test_message, test_label), callbacks=[monitor], epochs=1000, verbose=2)

Nice, we now have a fully and complete trained model, the only thing left before tests is to DEFINE a [function predict_message] that is **designed to predict whether a given SMS message is "ham" or "spam"** using the trained model. The function encodes and pads the input message, **uses the model to make a prediction**, and then maps the prediction to the corresponding class(Ham/Spam) label.

*This is why we build models right?, To make USEFUL PREDICTIONS AND MORE!!*

In [None]:
import numpy as np
# Function to predict messages based on model
def predict_message(pred_text):
    class_dict = {
        0: "ham",  # Map 0 to 'ham'
        1: "spam"  # Map 1 to 'spam'
    }

    # Encode the input message into a sequence of integers based on VOCAB_SIZE
    encoded_message = [one_hot(pred_text, VOCAB_SIZE)]

    # Pad the encoded message sequence to MAX_LENGTH to ensure uniform input size
    padded_message = pad_sequences(encoded_message, maxlen=MAX_LENGTH, padding='post')

    # Predict the probability of the message being 'spam'
    prediction_prob = model.predict(padded_message)[0][0]

    # Convert the probability to a percentage
    prediction_percentage = prediction_prob * 100

    # Determine the predicted class ('ham' or 'spam') based on the probability
    predicted_class = class_dict[np.round(prediction_prob)]

    # Create a nice message
    if predicted_class == "ham":
        message = f"The message is likely 'ham' with a probability of {100 - prediction_percentage:.2f}%."
    else:
        message = f"The message is likely 'spam' with a probability of {prediction_percentage:.2f}%."

    # Return the probability percentage, predicted class, and the message
    return [prediction_percentage, predicted_class, message]

Our function is defined and ready to take in messages, just as we are ready to TEST, TEST AND TEST :)

In [None]:
# Example input message to predict
pred_text = "wow, is your arm alright. that happened to me one time too"

# Get the prediction for the input message
prediction = predict_message(pred_text)

# Print the prediction result
print(prediction)

We have tested our model and it kicks, it works. It gives us the percentage probability and tell us if it is a ham(Not a Spam) or spam

Let's run more tests, now we check if we are indeed successful or not?

In [None]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge, 7/7 predictions. Great job!")
  else:
    print("The model got 6/7 predictions correct, one was wrong, it was a spam, and it classified it as Ham\n This one:sale today! to stop texts call 98912460324 ")

test_predictions()