<a href="https://colab.research.google.com/github/AryunGupta/NLP-Spam-Classifier/blob/main/NLP_Spam_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import nltk # natural language toolkit for NLP
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [4]:
df = pd.read_csv("Spam Email raw text for NLP.csv") # takes long to load be patient

df.head()
# df["CATEGORY"].value_counts()

Unnamed: 0,CATEGORY,MESSAGE,FILE_NAME
0,1,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",00249.5f45607c1bffe89f60ba1ec9f878039a
1,1,ATTENTION: This is a MUST for ALL Computer Use...,00373.ebe8670ac56b04125c25100a36ab0510
2,1,This is a multi-part message in MIME format.\n...,00214.1367039e50dc6b7adb0f2aa8aba83216
3,1,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,00210.050ffd105bd4e006771ee63cabc59978
4,1,This is the bottom line. If you can GIVE AWAY...,00033.9babb58d9298daa2963d4f514193d7d6


0- not spam, 1-spam

So, there are 3900 not spam and 1896 spam emails in this dataset.

I applied some information retrieval concepts to make each email more concise, thus helping in classifying as spam/not spam.

I used lemmatization instead of stemming as lemmatization considers context. This way it will be more useful in the machine learning part of this project.

In [5]:
stopwords = stopwords.words('english')
def tokenizer(s):
  # to tokenize- remove unnecessary stuff like brackets, commas, etc
  sentence_tokenizer = nltk.RegexpTokenizer(r"\w+")
  # lemmatizing instead of stemming
  lemmatizer = WordNetLemmatizer()

  # tokenize
  tokens = sentence_tokenizer.tokenize(s)
  # turn each token lowercase
  lowercased_tokens = [token.lower() for token in tokens]
  # lemmatize each lowercase token
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in lowercased_tokens]
  # add the word tokens to a list if it is not a stop word
  tokens = [token for token in lemmatized_tokens if token not in stopwords]

  return tokens

# testing out the above function with a random string
test_message = "HeY,, lMnOPq feet it going? <HTML>!bad? bads 'randoms' badly"
tokenizer(test_message)

['hey', 'lmnopq', 'foot', 'going', 'html', 'bad', 'bad', 'randoms', 'badly']

I partitoned the dataframe into train and test data. 80% of the data can be used for training and the rest for testing. To do so, I shuffled the dataset first.

In [6]:
# shuffling the dataframe
df = df.sample(frac=1, random_state=1)
df.reset_index(drop="True") # in case the indexes got shuffled too

# getting the 80%th point in our dataframe where we will split for train/test
split_index = int(len(df) * 0.8)

train_df = df[:split_index]
test_df = df[split_index:]

# indexes may get messed up so we need to fix that
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# train_df, test_df

A good to predict whether an email is spam is by checking which words occur frequently in spam emails. So, I made dictionary where the key represents the tokenized word and value is the number of times it occurs in our dataframe.

In [7]:
token_counter = {} # a dictionary that keeps tokens and the number of times they appear

for message in train_df["MESSAGE"]:
  # each message in our dataframe gets tokenized
  tokenized_message = tokenizer(message)

  for token in tokenized_message:
    # if the token in the tokenized message is not in the dictionary, we add it to it
    if token not in token_counter:
      token_counter[token] = 1
    # if it is in the dictionary, we increment its frequency by 1
    else:
      token_counter[token] += 1


len(token_counter)

86415

These are too many tokens. Only a handful of these will actually be useful (the ones that appear most frequently). To do so, I set an arbitrary threshold value. If a token appears more than the threshold number of times, it is kept as it may be an indicator for spam.

In [8]:
def pass_threshold(token, threshold):

  # if a token is not even in the dictionary, we don't need it
  if token not in token_counter:
    return False
  else:
    return (token_counter[token] > threshold)

spam_detectors = set()

'''
now we pass each token in the dictionary into the above function to get only the tokens that pass the threshold
we can toy around with the threshold to see which gives good outputs
the model accuracy in the end works better with 8000 than with some other thresholds
'''
for token in token_counter:
  if pass_threshold(token, 8000):
    spam_detectors.add(token)

spam_detectors = list(spam_detectors) # this will order the elements and make it easier to use
spam_detectors

['br',
 'font',
 'p',
 'size',
 'com',
 'face',
 'td',
 'http',
 'b',
 'tr',
 'width',
 '0',
 'color',
 '1',
 '3d',
 'nbsp']

The machine learning model can be based on the number of times each element in spam_detectors appears.

In [9]:
index_detector = {t:i for t, i in zip(spam_detectors, range(len(spam_detectors)))}

index_detector

{'br': 0,
 'font': 1,
 'p': 2,
 'size': 3,
 'com': 4,
 'face': 5,
 'td': 6,
 'http': 7,
 'b': 8,
 'tr': 9,
 'width': 10,
 '0': 11,
 'color': 12,
 '1': 13,
 '3d': 14,
 'nbsp': 15}

The idea is that for each tokenized message in our dataframe, we check how many times the above words appear. If they appear too many times, the message is likely to be spam.

I implemented this using a vector.

In [None]:
# The intuition can be seen as follows:
# tokenizer("3d b <br> .com bad font font com randoms")

# The output for it will be:
# ['3d', 'b', 'br', 'com', 'bad', 'font', 'font', 'com', 'randoms']

# ->  br  b  size  3d  com  font  p  http   td   tr      -> spam detector words
# ->  0    1    2    3   4    5    6    7   8   9        -> index of the spam detector words
# -> [1,   1,   0,   1,  2,   2,   0,   0,  0,  0]       -> number of times those words appeared in the text


We want to get the above kind of vector for a message

In [10]:
def get_vector(message):

  # initializing a numpy vector of 0s as at the beginning we haven't seen any words
  vector = np.zeros(len(spam_detectors))

  tokenized_message = tokenizer(message)

  for token in tokenized_message:
    if token in spam_detectors:
      # if the token is a potential spam, we get the index of that word and update the count in our vector
      index = index_detector[token]
      vector[index] += 1
    else:
      # if the token isn't a potential spam word we just move on
      continue
      
  return vector
  
get_vector("3d b <br> .com bad font font com randoms")

array([1., 2., 0., 0., 2., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.])

The above output tells me that the message isn't spam as the spam detector words almost never appear.

Testing on some of the emails in the dataframe-

In [11]:
get_vector(train_df['MESSAGE'].iloc[0])

array([33.,  4.,  0.,  2.,  9.,  1.,  0.,  6.,  2.,  0.,  0.,  1.,  3.,
        0.,  0.,  1.])

This looks like spam as a lot of the spam detector words appear so many times. Can check if this is correct by calling the first message and seeing if it is indeed spam.

In [12]:
train_df.iloc[0]

CATEGORY                                                     1
MESSAGE      \n\n<HTML><FONT  BACK="#ffffff" style="BACKGRO...
FILE_NAME               00118.141d803810acd9d4fc23db103dddfcd9
Name: 0, dtype: object

The category is 1, meaning it is indeed spam.

In [13]:
get_vector(train_df['MESSAGE'].iloc[10])

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.])

Not spam as only of the spam detector words appears and that too just once.

In [14]:
train_df.iloc[10]

CATEGORY                                                     0
MESSAGE      \n\n>\n\n>hi i was just wondering if anyone ex...
FILE_NAME               02464.95f59bae730edc01ce4f88d98791ffca
Name: 10, dtype: object

Confirmed not spam

Now the machine learning part begins

We want to get a vector like above for each message in the dataframe and match up each vector with whether it is spam or not spam.

X represents a matrix of inputs. Each row is a vector

y represents whether that vector corresponds to spam or not spam

In other words, we now get the X and y values for training and testing from the dataframe.

In [15]:
def df_to_X_y(df):

  # list that will keep vectors of each message
  vectors = []
  # y is spam/not spam, i.e., 1 or 0
  y = df['CATEGORY'].to_numpy().astype(int)
  messages = df["MESSAGE"]

  for message in messages:
    vector = get_vector(message)
    vectors.append(vector)
  
  # X is the numpy array of vectors
  X = np.array(vectors).astype(int)
  return X, y

Now we can split the X and y to training and testing data

In [16]:
X_train, y_train = df_to_X_y(train_df)
X_test, y_test = df_to_X_y(test_df)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((4636, 16), (4636,), (1160, 16), (1160,))

Shapes check out. Training X and y have 4636 rows as that is the number of emails in the training dataset. Columns are the number of elements in spam detectors. A similar explanation for testing.

In [17]:
# Scaling so the model can learn better

scaler = MinMaxScaler().fit(X_train)
X_train, X_test = scaler.transform(X_train), scaler.transform(X_test)

X_train

array([[0.04064039, 0.00245851, 0.        , ..., 0.        , 0.        ,
        0.00176367],
       [0.0270936 , 0.00860479, 0.04065041, ..., 0.00363636, 0.        ,
        0.        ],
       [0.        , 0.        , 0.00406504, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.00406504, ..., 0.        , 0.        ,
        0.        ]])

Random Forest Classifier

In [18]:
random_forest = RandomForestClassifier().fit(X_train, y_train)

print(classification_report(y_test, random_forest.predict(X_test)))

              precision    recall  f1-score   support

           0       0.85      0.96      0.90       788
           1       0.89      0.64      0.75       372

    accuracy                           0.86      1160
   macro avg       0.87      0.80      0.82      1160
weighted avg       0.86      0.86      0.85      1160



In [19]:
linear_regression = LogisticRegression().fit(X_train, y_train)

print(classification_report(y_test, linear_regression.predict(X_test)))

              precision    recall  f1-score   support

           0       0.76      1.00      0.86       788
           1       0.99      0.34      0.51       372

    accuracy                           0.79      1160
   macro avg       0.88      0.67      0.69      1160
weighted avg       0.84      0.79      0.75      1160



The Random Forest Classifier gives better scores then Linear Regression. Therefore it may be a better indicator.