<a href="https://colab.research.google.com/github/Chintan45/Spam-Email-Classification-with-BERT/blob/main/Spam_Classifier_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT tutorial: Classify spam vs no spam emails

In [None]:
!pip install "tensorflow-text==2.11.*"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import pandas as pd

Import the dataset

In [None]:
url = 'https://drive.google.com/file/d/1-1hCBHrF1mUvtk7sMlOvzgSgh3T52IMp/view'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df= pd.read_csv(url, sep='\t', names=["Category", "Message"])
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Transforming the Target label from `Spam` to 1 and `Ham` to 0

In [None]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Split it into training and test data set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['Message']],df[['spam']], stratify=df[['spam']])
X_train.head()

Unnamed: 0,Message
4834,"New Mobiles from 2004, MUST GO! Txt: NOKIA to ..."
1079,Convey my regards to him
2707,S now only i took tablets . Reaction morning o...
1255,What your plan for pongal?
1374,"Bears Pic Nick, and Tom, Pete and ... Dick. In..."


Now lets import BERT model and get embeding vectors for few sample statements

In [None]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [None]:
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

get_sentence_embeding([
    "500$ discount. hurry up", 
    "Are you up for a volleybal game tomorrow?"]
)

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.843517  , -0.51327276, -0.88845724, ..., -0.7474888 ,
        -0.75314736,  0.91964495],
       [-0.91400784, -0.44170597, -0.85099435, ..., -0.7264388 ,
        -0.72576755,  0.93709135]], dtype=float32)>

Building model for output layer of BERT

In [None]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train)



<keras.callbacks.History at 0x7f0da1913b20>

In [None]:
model.evaluate(X_test, y_test)



[0.24853314459323883, 0.8715003728866577]