# **Business problem**

My company is a shipping company in charge of shipping different goods from different companies and different customers. Alot of email messaging is used internally and externally with customers. In a day the tracking department deals with more than 1000 emails. Some of these emails are not important emails which takes time to read. Also as time is wasted on unimportant emails it delays responding time to customers of which they are complaining about. As the data scientist in the company, I will design a system to make is easier to categorize and seperate important from unimportant emails.

**Benefits to the company**

*   Response to customers will be faster which will improve the image of the company.
*   Increase productivity
*   Categorizing into spam can block fraud attempts.
*   Improve customer satisfaction.


**Formulate as NLP**

Statistics, python, etc. skills will be used. It will be approached as text classification task which will be preprocessed, trained using a traditional method and deep learning method before testing and evaluting. This task can be used in messaging platforms.


Data is taken from kaggle.

# **Components of NLP system**

The main components of this task includes the following

*   Text preprocessing which involves tokenization, coverting letters to lower case, stopword removal and lemmatization. This step is essential as it cleans the raw data for analysis.
*   Text representation which transforms the cleaned data into numerical features for traditional models and deep learning models.
* Machine learning(logistic regression) and deep learning models(LSTM) will be used predict and perform task. Since this is to detect spam and not spam these models will be used.  





# **Loading libraries**

In [None]:
import pandas as pd
!pip install nltk
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from tensorflow.keras.layers import Input,Embedding, LSTM, Bidirectional, Dropout, Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

The libraries are loaded

# **Load dataset**

In [None]:
data = pd.read_csv("/content/Spam Email raw text for NLP.csv")

In [None]:
print(data.columns)

Index(['CATEGORY', 'MESSAGE', 'FILE_NAME'], dtype='object')


In [None]:
print(data.shape)

(5796, 3)


The data is loaded and it contains 3 columns.

# **Text preprocessing**

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(data):
    try:
        data = re.sub(r'[^a-z\s]', '', data.lower())
        tokens = word_tokenize(data)
        tokens = [word for word in tokens if word not in stop_words]
        tokens = [lemmatizer.lemmatize(word) for word in tokens]

        return ' '.join(tokens)
    except AttributeError:
        print(f"Skipping non-string value: {data}")
        return ""

data['cleaned_data_text'] = data['MESSAGE'].apply(preprocess_text)


This coverts all uppercase letters to lowercase and removes non-alphabetic characters. It is then split into tokens. Followed by the lemmatization step which take words to the their root form.

This is process is important because it cleans the data, removes irrelevant words in a standard form for a better model training

In [None]:
print(data[['MESSAGE', 'cleaned_data_text']].head())

                                             MESSAGE  \
0  Dear Homeowner,\n\n \n\nInterest Rates are at ...   
1  ATTENTION: This is a MUST for ALL Computer Use...   
2  This is a multi-part message in MIME format.\n...   
3  IMPORTANT INFORMATION:\n\n\n\nThe new domain n...   
4  This is the bottom line.  If you can GIVE AWAY...   

                                   cleaned_data_text  
0  dear homeowner interest rate lowest point year...  
1  attention must computer user newspecial packag...  
2  multipart message mime format nextpartcdccbfa ...  
3  important information new domain name finally ...  
4  bottom line give away cd free people like one ...  


In [None]:
X = data['cleaned_data_text']
y = data['CATEGORY']

Split the data into features and labels where X represents features and y the labels.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train size: {len(X_train)}")
print(f"Test  size: {len(X_test)}")


Train size: 4636
Test  size: 1160


The data is split into train(80%) and test(20%).

In [None]:
vectorizer = CountVectorizer(max_features=5000)

X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

This coverts the data into a numeric bag-of-words(which captures how frequently a word appears). This will be used on the traditional method.

# **Traditional method**

In [None]:
para_gri = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}
log_re = LogisticRegression()
grid_srh = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_srh.fit(X_train_vectorized, y_train)

model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

Best Parameters: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


For the traditional training method logistics method is used. A grid search with cross evaluation is performed. This is done to get the best parameters to build the best model for better results.

In [None]:
y_pred = best_model.predict(X_test_vectorized)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.9922


The model with the best parameters are then evaluated on the test set to get the evaluation metrics which is the accuracy.

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)


[[756   6]
 [  3 395]]


A confusion metrics is performed to evaluate the performance. This results shows that the model correctly predicted  756 emails were spam and 395 emails were not spam. The value 6 shows that 6 emails where predicted as not spam but are actually spam.The value 3 predicted emails as spam but its not spam.

Since there is less false positive and false negtive values, this shows that model can categorize the classes.

# **Deep learning method**

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(data['cleaned_data_text'])

This converts the words in the data into tokens for deep learning models.

In [None]:
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

The training and test data is transformed into a sequence of numerical value which is important for training the LSTM model.

In [None]:
max_length = max(len(seq) for seq in X_train_sequences)
X_train_sequences = pad_sequences(X_train_sequences, maxlen=max_length, padding='post')
X_test_sequences = pad_sequences(X_test_sequences, maxlen=max_length, padding='post')

The x train and test sequences are padded. This means that all sequences will have the same length and shape.

In [None]:
y_train = np.array(y_train)
y_test = np.array(y_test)


The labels are converted to numpy arrays.

In [None]:
model = Sequential([
    Input(shape=(None,)),
    Embedding(input_dim=5000, output_dim=16),
    Bidirectional(LSTM(16, return_sequences=False)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()


An LSTM model is used to train the data. This model has an embedded layer which considers a words of size 5000, with each word being represented as a vector of size 32. The bidirectional layer which contain 32 dimensions and output only the last hidden state. Dropout is perforemed to prevent overfitting with a dense layer of 1 dimenstion since the task is to classify into spam or not spam.
The model is then compiled using the loss function binary crossentropy which is used for binary classification tasks.

In [None]:
model.fit(X_train_sequences, y_train, epochs=5, batch_size=32, validation_split=0.2)

Epoch 1/5
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1591s[0m 13s/step - accuracy: 0.7204 - loss: 0.5801 - val_accuracy: 0.9688 - val_loss: 0.1962
Epoch 2/5
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1530s[0m 13s/step - accuracy: 0.9858 - loss: 0.1535 - val_accuracy: 0.9806 - val_loss: 0.0876
Epoch 3/5
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1532s[0m 13s/step - accuracy: 0.9921 - loss: 0.0761 - val_accuracy: 0.9720 - val_loss: 0.0887
Epoch 4/5
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1573s[0m 13s/step - accuracy: 0.9970 - loss: 0.0465 - val_accuracy: 0.9828 - val_loss: 0.0724
Epoch 5/5
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1550s[0m 13s/step - accuracy: 0.9970 - loss: 0.0329 - val_accuracy: 0.9828 - val_loss: 0.0709


<keras.src.callbacks.history.History at 0x7dabf04c0d60>

The accuracy for this model is 0.9970.

# **Evaluation**

In [None]:
loss, accuracy = model.evaluate(X_test_sequences, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 742ms/step - accuracy: 0.9826 - loss: 0.0690
Test Loss: 0.0530
Test Accuracy: 0.9862


The model is evaluated using the test data.

# **Model prediction**

In [None]:
message = ["I cannot track my package."]
message_seq = tokenizer.texts_to_sequences(message)
message_padded = pad_sequences(message_seq, maxlen=max_length, padding='post')

prediction = model.predict(new_text_padded)
print(f"Prediction: {'Spam' if prediction > 0.5 else 'Not Spam'}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 390ms/step
Prediction: Not Spam


Using the trained model, the message above is predicted as not a spam as the predicted value is less than 0.5.

# **Conclusion**

**Overall Pipeline**

This is a text classification task. This involves data loading, text preprocessing, model building and training using a traditional method(Logistics regression) and a deep learning method(LSTM), testing, evaluating and predicting. This  is possible using sklearn, nltk and TensorFlow.

**Limitations**

*   Training the LSTM model is time consumming.

**Strengths**

*   Logistics regression is simple and fast.
*   LSTM performs better.

**Implication**

*  This shows that emails can be separated to spam and not spam which will reduce the time wasted to read every mail.
*   Less complains from customers on delayed response.

**Recommendation**

Gather information old emails and group them as important and not important. Preprocess and train a model.

**Reference**

https://www.kaggle.com/datasets/chandramoulinaidu/spam-classification-for-basic-nlp/data