<a href="https://colab.research.google.com/github/Lasrixx/SMSTextClassifier/blob/main/fcc_sms_text_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called predict_message that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the SMS Spam Collection dataset. The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.

In [None]:
# import libraries
try:
  # %tensorflow_version only exists in Colab.
  !pip install tf-nightly
except Exception:
  pass
import tensorflow as tf
import pandas as pd
from tensorflow import keras
!pip install tensorflow-datasets
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

In [3]:
# Put files into dataframes
train_df = pd.read_table(train_file_path, header=0, names=['label','text'])
print(train_df)
test_df = pd.read_table(test_file_path, header=0, names=['label','text'])
print(test_df)

     label                                               text
0      ham                           you can never do nothing
1      ham  now u sound like manky scouse boy steve,like! ...
2      ham  mum say we wan to go then go... then she can s...
3      ham  never y lei... i v lazy... got wat? dat day ü ...
4      ham  in xam hall boy asked girl tell me the startin...
...    ...                                                ...
4173   ham  just woke up. yeesh its late. but i didn't fal...
4174   ham  what do u reckon as need 2 arrange transport i...
4175  spam  free entry into our £250 weekly competition ju...
4176  spam  -pls stop bootydelious (32/f) is inviting you ...
4177   ham  tell my  bad character which u dnt lik in me. ...

[4178 rows x 2 columns]
     label                                               text
0      ham         not much, just some textin'. how bout you?
1      ham  i probably won't eat at all today. i think i'm...
2      ham  don‘t give a flying monkeys wot t

In [4]:
train_df.shape

(4178, 2)

In [5]:
# Convert ham and spam to boolean - we have been told ham is 0 and spam is 1
train_df['label'] = train_df['label'].replace("ham", 0)
train_df['label'] = train_df['label'].replace("spam", 1)
test_df['label'] = test_df['label'].replace("ham", 0)
test_df['label'] = test_df['label'].replace("spam", 1)

In [6]:
print(train_df)

      label                                               text
0         0                           you can never do nothing
1         0  now u sound like manky scouse boy steve,like! ...
2         0  mum say we wan to go then go... then she can s...
3         0  never y lei... i v lazy... got wat? dat day ü ...
4         0  in xam hall boy asked girl tell me the startin...
...     ...                                                ...
4173      0  just woke up. yeesh its late. but i didn't fal...
4174      0  what do u reckon as need 2 arrange transport i...
4175      1  free entry into our £250 weekly competition ju...
4176      1  -pls stop bootydelious (32/f) is inviting you ...
4177      0  tell my  bad character which u dnt lik in me. ...

[4178 rows x 2 columns]


In [7]:
# Separate data and labels
train_labels = train_df.pop('label')
test_labels = test_df.pop('label')

In [8]:
print(train_df)
print(train_labels)

                                                   text
0                              you can never do nothing
1     now u sound like manky scouse boy steve,like! ...
2     mum say we wan to go then go... then she can s...
3     never y lei... i v lazy... got wat? dat day ü ...
4     in xam hall boy asked girl tell me the startin...
...                                                 ...
4173  just woke up. yeesh its late. but i didn't fal...
4174  what do u reckon as need 2 arrange transport i...
4175  free entry into our £250 weekly competition ju...
4176  -pls stop bootydelious (32/f) is inviting you ...
4177  tell my  bad character which u dnt lik in me. ...

[4178 rows x 1 columns]
0       0
1       0
2       0
3       0
4       0
       ..
4173    0
4174    0
4175    1
4176    1
4177    0
Name: label, Length: 4178, dtype: int64


In [9]:
# Want to partition part of the train dataset for validation purposes
# Split the dataset into 80% train, 20% validation
val_split = 0.2
split_index = int(len(train_df) * val_split)

print(split_index)

validation_data = train_df.to_numpy()[:split_index]
train_data = train_df.to_numpy()[split_index:]
validation_labels = train_labels.to_numpy()[:split_index]
train_labels = train_labels.to_numpy()[split_index:]

835


In [None]:
# Will use a pre-trained NLP model as a base then build on
# Using tensorflow hub's google/nnlm-en-dim50-with-normalization/2 
model = "https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=True)

In [11]:
# Build on the model
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(8, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 8)                 408       
                                                                 
 dense_1 (Dense)             (None, 1)                 9         
                                                                 
Total params: 48,191,017
Trainable params: 48,191,017
Non-trainable params: 0
_________________________________________________________________


In [12]:
# Compile the model
# BinaryCrossEntropy is useful for predicting probabilities 
model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.0, name='accuracy')])

In [13]:
# Train the model
history = model.fit(train_data,
                    train_labels,
                    epochs=40,
                    batch_size=512,
                    validation_data=(validation_data, validation_labels),
                    verbose=1)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [14]:
model.evaluate(test_df,test_labels)



[0.053845297545194626, 0.9820272922515869]

In [15]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  pred = model.predict([pred_text])[0]
  output = "ham"
  if pred[0] > 0.5:
    output = "spam"
  
  prediction = [pred[0],output]

  return (prediction)

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)
print(prediction)

[-4.2621765, 'ham']


In [16]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()

You passed the challenge. Great job!
