In [2]:
! pip install tensorflow-text

Collecting tensorflow-text
  Downloading tensorflow_text-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow<2.14,>=2.13.0 (from tensorflow-text)
  Downloading tensorflow-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (524.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.1/524.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting keras<2.14,>=2.13.1 (from tensorflow<2.14,>=2.13.0->tensorflow-text)
  Downloading keras-2.13.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.14,>=2.13 (from tensorflow<2.14,>=2.13.0->tensorflow-text)
  Downloading tensorboard-2.13.0-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 

In [3]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import pandas as pd

In [6]:
df=pd.read_csv("spam.csv")
df.head(3)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [7]:
df['Category'].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

we have 4825 ham emails and 747 spam emails. The ham email has a significantly higher number.

The ratio of the two categories is shown below:

In [8]:
747/4825

0.15481865284974095

This result implies that about 15% are spam emails and 85% of ham emails. This indicates a class imbalance. We need to balance the two classes to reduce bias during model training.

**15% spam emails, 85% ham emails: This indicates class imbalance**

In [12]:
df_spam=df[df['Category']=='spam']
df_spam.shape

(747, 2)

In [13]:
df_ham=df[df['Category']=='ham']
df_ham.shape

(4825, 2)

Now that we have created the two data frames, we will reduce the number of the ham class to be equal to the spam class.


In [14]:
df_ham_downsampled = df_ham.sample(df_spam.shape[0])#Balancing dataset
df_ham_downsampled.shape

(747, 2)

In [16]:
df_balanced = pd.concat([df_ham_downsampled, df_spam])
df_balanced.shape

(1494, 2)

In [17]:
df_balanced['Category'].value_counts()

ham     747
spam    747
Name: Category, dtype: int64

In [19]:
df_balanced['spam']=df_balanced['Category'].apply(lambda x: 1 if x=='spam' else 0)
df_balanced.sample(3)

Unnamed: 0,Category,Message,spam
3002,spam,This message is free. Welcome to the new & imp...,1
1892,ham,Probably earlier than that if the station's wh...,0
333,spam,Call Germany for only 1 pence per minute! Call...,1


In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_balanced['Message'],df_balanced['spam'] , stratify=df_balanced['spam'])
#stratify to ensure equal distribution of classes in the train and test
print(f"Train Data: {X_train.shape},{y_train.shape}")
print(f"Test Data: {X_test.shape},{y_test.shape}")

Train Data: (1120,),(1120,)
Test Data: (374,),(374,)


In [23]:
X_train.head()

1915    New TEXTBUDDY Chat 2 horny guys in ur area 4 j...
2249                      will you like to be spoiled? :)
67      Urgent UR awarded a complimentary trip to Euro...
2053    Call 09094100151 to use ur mins! Calls cast 10...
2954    URGENT! Your mobile was awarded a £1,500 Bonus...
Name: Message, dtype: object

##Getting started with BERT


In [26]:
bert_preprocess= hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")#performs processing
bert_encoder=hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")#performs encoding

In [30]:
# Bert layers
text_input=tf.keras.layers.Input(shape=(),dtype=tf.string,name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# final model
model= tf.keras.Model(inputs=[text_input], outputs=[l])

Now we will print the model summary.

In [31]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 text (InputLayer)           [(None,)]                    0         []                            
                                                                                                  
 keras_layer_2 (KerasLayer)  {'input_mask': (None, 128)   0         ['text[0][0]']                
                             , 'input_word_ids': (None,                                           
                              128),                                                               
                              'input_type_ids': (None,                                            
                             128)}                                                                
                                                                                              

The all the input and output layers we have initialized for our model. The output also shows the total params, trainable params, and non-trainable params.

Total params: It represents all the parameters in our model.

Trainable params: It represents the parameters that we will train.

Non-trainable params: These parameters are from the BERT model. They are already trained.

In [32]:
len(X_train)

1120

In [37]:
METRICS=[
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall')
    ]

model.compile(
optimizer = 'adam' ,
loss='binary_crossentropy',
metrics =METRICS)

In [38]:
model.fit(X_train,y_train,epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79836b1999f0>

In [39]:
model.evaluate

<bound method Model.evaluate of <keras.src.engine.functional.Functional object at 0x798377836170>>

In [40]:
y_predicted=model.predict(X_test)
y_predicted.flatten()#FLATTENED T GET 1D ARRAY



array([0.94469887, 0.17659217, 0.06888145, 0.0388144 , 0.9339365 ,
       0.01814251, 0.9603059 , 0.9245909 , 0.94165826, 0.93513775,
       0.0616117 , 0.3238535 , 0.22647534, 0.8225668 , 0.16531307,
       0.605555  , 0.09415074, 0.5505536 , 0.45783496, 0.05685103,
       0.80340207, 0.0805951 , 0.92507803, 0.38398638, 0.13786834,
       0.13105926, 0.86255234, 0.10998893, 0.97920614, 0.9062267 ,
       0.98374116, 0.9299963 , 0.16676484, 0.96931034, 0.27345997,
       0.06706671, 0.6187291 , 0.20365651, 0.89868027, 0.925813  ,
       0.16158167, 0.89868027, 0.96931034, 0.17568043, 0.6939809 ,
       0.8898933 , 0.9706071 , 0.71193993, 0.6701719 , 0.22815391,
       0.7887096 , 0.07414774, 0.19938052, 0.61837476, 0.48453453,
       0.1396253 , 0.10339768, 0.98128575, 0.10170437, 0.0551746 ,
       0.07413637, 0.03118038, 0.12094733, 0.9568313 , 0.96794426,
       0.1209473 , 0.85615647, 0.26542783, 0.04655223, 0.9202087 ,
       0.08633909, 0.27187666, 0.96948856, 0.9596279 , 0.01832

Since we used a sigmoid activation function, the prediction probabilities will lie between 0.0 to 1.0. So, if the prediction result is > 0.5 the output should be 1, and if it is < 0.5, the output should be 0.

In [42]:
import numpy as np
y_predicted = np.where(y_predicted > 0.5, 1, 0)
print(y_predicted)

[[1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]


In [43]:
from sklearn.metrics import classification_report, confusion_matrix
cm= confusion_matrix(y_test, y_predicted)
cm

array([[164,  23],
       [ 11, 176]])

In [44]:
print(classification_report(y_test, y_predicted))


              precision    recall  f1-score   support

           0       0.94      0.88      0.91       187
           1       0.88      0.94      0.91       187

    accuracy                           0.91       374
   macro avg       0.91      0.91      0.91       374
weighted avg       0.91      0.91      0.91       374



In [45]:
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see   if you get your ."
]

model.predict(reviews)



array([[0.78540486],
       [0.8447811 ],
       [0.82256687],
       [0.22490428],
       [0.12603578]], dtype=float32)

From the output above, the first three email messages have been classified as spam. They have a prediction probability that is greater than 0.5. The last two email messages have been classified as ham. They have a prediction probability that is less than 0.5. These are the right predictions and show we have successfully built our text classification model.

The model was able to classify email messages as spam or ham. We started by using BERT to convert a given sentence into an embedding vector. This was done using the pre-trained BERT models.

We created our model using TensorFlow and initialized all the input and output layers. We followed all the stages of building the neural network and finally came up with a spam detection model. Finally, we used the model to make predictions, the model was able to give accurate predictions.