# **Introduction**

Welcome to the mathematical ballet of Sentiment Analysis and Binary Text Classification, where algorithms tango with text to quantify the nuances of human emotions.

Sentiment Analysis, with its roots in natural language processing and machine learning, employs mathematical models to analyze and classify the sentiment of text. Imagine a function, f(text) = sentiment, where the input text is transformed into a numerical representation of emotion. This sentiment score, often ranging from 0 to 1, becomes the quantitative measure of the textual emotional landscape.

Binary Text Classification, being the binary virtuoso in this symphony, utilizes mathematical thresholds to categorize text into discrete sentiments. Let's introduce a decision boundary: if f(text) > 0.5, it's positive sentiment (1); if f(text) ≤ 0.5, it's negative sentiment (0). In this binary arithmetic, emotions become bits, and algorithms become discerning mathematicians.

But let's not forget the behind-the-scenes magicians – neural networks. Picture sentiment analysis as the neural network wizardry where layers of mathematical transformations weigh the significance of each word, adjusting numerical weights to fine-tune the sentiment prediction.

In this numerical ballroom, Sentiment Analysis and Binary Text Classification waltz through vast datasets, performing a mathematically precise dance to unveil the emotional arithmetic encoded in every sentence. So, buckle up for a mathematical journey where the language of emotions meets the precision of algorithms, turning sentiments into elegant equations, one computational step at a time.

# **Imports & Set Up**

Let's import all the required libraries which are needed for the proper & smooth functioning of the notebook and also set up few constants those will be required later.

In [1]:
# Main Libraries
import numpy as np
import tensorflow as tf

# Data set
import string
import pandas as pd
import plotly.express as px
import tensorflow.data as tfd
from collections import Counter
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

# NN
from tensorflow.keras import Sequential
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.layers import TextVectorization, Dropout, Activation
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense

# Metrics
from sklearn.metrics import accuracy_score

# Constants 
MAX_SEQ_LEN = 200
MAX_TOKENS = 10000
EMBEDDING_DIMS = 16

stop_words = set(stopwords.words('english'))

2024-02-25 12:40:12.389573: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-25 12:40:12.389713: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-25 12:40:12.562667: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# **Spam Classification Dataset**

It's time to load and incorporate our dataset. This section will also cover the Data Analysis step via Data Visualization.

In [2]:
# Set the file path
file_path = "/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv"

# Load the file
df = pd.read_csv(file_path)

# Quick Look
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# Lets get the Text Length to get an estimate of the MAX_SEQ_LEN
df["text_lens"] = df.Message.map(len)

# Histogram
hist = px.histogram(df, x="text_lens", title="Text Lengths - Histogram", color="Category", barmode="group")
hist.update_xaxes(range=[0, 200], title="Text Lengths")
hist.update_yaxes(title="Frequency Counts")
hist.show()

# Box Plot
box = px.box(df, x="text_lens", title="Text Lengths - Box Plot", color="Category")
box.update_xaxes(range=[0, 200], title="Text Lengths")
box.show()

  sf: grouped.get_group(s if len(s) > 1 else s[0])






An analysis of both the box plot and histogram reveals a discernible trend concerning the length of messages, particularly those approximately 100 characters or shorter are Ham. This prevalent pattern is observed throughout the dataset, predominantly categorized as "ham." Notably, a significant concentration of such instances is evident in the lower range of text lengths, specifically below 100 characters.

Conversely, messages exceeding the 100-character threshold tend to be distinctly associated with the "spam" category. Additionally, the presence of numerous outliers indicates the occurrence of substantial text lengths, often corresponding to either legitimate but extensive messages or, at times, indicative of spam.

In [4]:
# Pie Plot
pie = px.pie(df, "Category", hole=0.4, title="Class Distribution")
pie.show()

The dataset notably exhibits a substantial class imbalance, with merely 13% of samples falling under the "spam" category, while a majority of 86% are categorized as "ham." This significant imbalance poses a challenge for conventional classification methods, suggesting that a straightforward classifier relying solely on text length might yield satisfactory results. Consequently, a prudent approach involves establishing a baseline classifier, potentially a linear one, trained exclusively on the text length feature. This serves as a crucial benchmark to evaluate the model's performance systematically. However, the pronounced class imbalance underscores the importance of addressing data augmentation or rebalancing strategies to ensure the generation of a model capable of achieving not only accuracy but also fairness in its predictions.

In [5]:
# Compute the total size and duplicate values
size, dups = df.size, df.duplicated().sum()

# Pie plot
pie = px.pie(values=[size, dups], names=["Total Size", "Duplicates"], hole=0.4, title="Duplicates vs Total Size")
pie.show()

An additional noteworthy observation is that approximately 2.5% of the dataset comprises duplicate values. While this percentage may seem modest, it assumes significance in the context of an already biased dataset. Introducing even a minor level of redundancy into a dataset that is inherently skewed could potentially exacerbate existing challenges.

In [6]:
# Extract words
words = []
for message in df.Message:
   
    message = message.translate(str.maketrans('', '', string.punctuation))
    words += [word for word in message.lower().split(" ") if word not in stop_words and len(word) > 1]

# Counter for words
counts = Counter(words)
top_100 = dict(counts.most_common(100))

# Histogram
hist = px.bar(x=top_100.keys(), y=top_100.values(), title="Word Count")
hist.update_yaxes(title="Counts")
hist.show()

Given the known bias in the dataset, it is anticipated that the most common 100 words present in the corpus predominantly belong to the "ham" category.

# **Baseline Linear Classifier**

As stated before let's create a basic **Linear Classifier**. 

In [7]:
# Splitting into Feature and Target
classes = ["ham", "spam"]
X_lens, y = df.text_lens.to_numpy().reshape(-1, 1), df.Category.map(lambda x: classes.index(x))

# Stlitting into testing and training
X_train_lens, X_test_lens, y_train, y_test = train_test_split(X_lens, y, stratify=y, shuffle=True, random_state=42)

# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train_lens, y_train)

# Model Evaluation
preds = lr.predict(X_test_lens)
print(f"Accurac: {accuracy_score(y_test, preds)}")

Accurac: 0.842067480258435


Even a basic linear classifier, devoid of intricate feature engineering or specialized training, can readily attain an accuracy of 84%. While this establishes a baseline, it accentuates the imperative need for a more sophisticated model that surpasses this benchmark. However, the formidable challenge lies in achieving superior performance amidst the pronounced class imbalance. The task ahead necessitates a strategic and nuanced approach to model development, acknowledging the inherent complexities posed by the dataset's skewed distribution.

# **Text Standardization**

Text standardization, a pivotal step in natural language processing, encapsulates preprocessing, tokenization, and vectorization. The Text Vectorization layer emerges as a versatile tool encompassing these essential components seamlessly. This integral layer streamlines the process, offering a consolidated solution for converting raw textual data into a format suitable for machine learning models.

The Text Vectorization layer not only efficiently handles default preprocessing tasks but also provides the flexibility to incorporate custom preprocessing functions, empowering users to tailor the standardization process to specific requirements. This adaptability ensures that nuances in the data are appropriately addressed, contributing to a more refined and context-aware representation.

In [8]:
# Initialize the layer
vectorize_layer = TextVectorization(
    max_tokens=MAX_TOKENS,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    output_mode='int',
    output_sequence_length=MAX_SEQ_LEN
)

# Adapt the Vectorization Layer
vectorize_layer.adapt(df.Message)

In [9]:
# Lets see this in action
texts = df.Message.sample(3)

for text in texts:
    print(f"Text: {text}")
    print(f"Vectorized: {vectorize_layer(text).numpy()}\n")

Text: No need to say anything to me. I know i am an outsider
Vectorized: [  40   78    2  142  173    2   11    3   56    3   64  117 6270    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    

# **TF Data**

Let's convert the data into TensorFlow Dataset for efficiency.

In [10]:
# Extracting, text and labels
texts, labels = vectorize_layer(df.Message).numpy(), df.Category.map(lambda x: classes.index(x))
X_train, X_test, y_train, y_test = train_test_split(texts, labels, train_size=0.9, test_size=0.1, random_state=42, stratify=labels, shuffle=True)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size=0.9, test_size=0.1, random_state=42, shuffle=True)

# Data sets
train_ds = tfd.Dataset.from_tensor_slices((X_train, y_train)).shuffle(1000).batch(32).cache().prefetch(tfd.AUTOTUNE)
valid_ds = tfd.Dataset.from_tensor_slices((X_valid, y_valid)).batch(32).cache().prefetch(tfd.AUTOTUNE)
test_ds = tfd.Dataset.from_tensor_slices((X_test, y_test)).batch(32).cache().prefetch(tfd.AUTOTUNE)

# **Neural Network**

Let's build a Neural Network based on Work Embedding and Dense Network.

In [11]:
# Initialize Model
model = Sequential([
    Embedding(MAX_TOKENS, EMBEDDING_DIMS),
    GlobalAveragePooling1D(),
    Dropout(0.4),
    Dense(128, activation='relu'),
    Dense(1)
])

# Compile Model
model.compile(
    loss=BinaryCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=['accuracy']
)

# Model Training
history = model.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=50,
    callbacks=[
        EarlyStopping(patience=5, restore_best_weights=True)
    ]
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50


In [12]:
# Model Evaluation
test_loss, test_acc = model.evaluate(test_ds)

print(f"\nTesting: {test_loss}\nTesting Accuracy: {test_acc}")


Testing: 0.08099470287561417
Testing Accuracy: 0.9749103784561157


In [13]:
def predict(texts):
    texts = vectorize_layer(texts)
    pred = model(texts)
    return classes[round(tf.squeeze(tf.nn.sigmoid(pred)).numpy())]

In [14]:
for i in range(20):
    sample = df.sample()
    pred, label = predict(sample.Message), sample.Category.to_numpy()[0]
    print(f"{i+1:2} -> Pred: {pred.title()} Label: {label.title()}")

 1 -> Pred: Ham Label: Ham
 2 -> Pred: Ham Label: Ham
 3 -> Pred: Ham Label: Ham
 4 -> Pred: Ham Label: Ham
 5 -> Pred: Ham Label: Ham
 6 -> Pred: Ham Label: Ham
 7 -> Pred: Ham Label: Ham
 8 -> Pred: Ham Label: Ham
 9 -> Pred: Ham Label: Ham
10 -> Pred: Spam Label: Spam
11 -> Pred: Ham Label: Ham
12 -> Pred: Ham Label: Ham
13 -> Pred: Ham Label: Ham
14 -> Pred: Ham Label: Ham
15 -> Pred: Ham Label: Ham
16 -> Pred: Ham Label: Ham
17 -> Pred: Ham Label: Ham
18 -> Pred: Ham Label: Ham
19 -> Pred: Ham Label: Ham
20 -> Pred: Ham Label: Ham


The model's performance is truly remarkable, boasting an extraordinary accuracy of 97%. What makes this achievement even more astounding is the sheer simplicity of the model. This underscores the elegance of its design, showcasing that effectiveness need not always be synonymous with complexity.

----
**DeepNets**