*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called `predict_message` that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the [SMS Spam Collection dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.


In [None]:
# import libraries
try:
  # %tensorflow_version only exists in Colab.
  !pip install tf-nightly
except Exception:
  pass
import tensorflow as tf
import pandas as pd
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
!pip install tensorflow-datasets
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt 

print(tf.__version__)

2.10.0-dev20220427


In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

--2022-05-02 10:12:07--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv.1’


2022-05-02 10:12:07 (13.3 MB/s) - ‘train-data.tsv.1’ saved [358233/358233]

--2022-05-02 10:12:08--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv.1’


2022-05-02 10:12:08 (13.6 MB/s) - ‘valid-data.tsv.1’ saved [118774/118774]



In [None]:
#Get as dataframe
df_train = pd.read_csv('train-data.tsv',header=None,sep='\t')
df_test = pd.read_csv('valid-data.tsv',header=None,sep='\t')


In [None]:
#Change the column headings
df_train.columns=['category','text']
df_test.columns = ['category','text']

In [None]:
#convert categorical to numeric data for categories
df_train['category'].replace('ham',value=0,inplace=True)
df_train['category'].replace('spam',value=1,inplace=True)

df_test['category'].replace('ham',value=0,inplace=True)
df_test['category'].replace('spam',value=1,inplace=True)


In [None]:
#print text to see what we're dealing with
samples = df_train['text'].sample(10)
for sample in samples:
  print(sample)



summers finally here! fancy a chat or flirt with sexy singles in yr area? to get matched up just reply summer now. free 2 join. optout txt stop help08714742804
lol! u drunkard! just doing my hair at d moment. yeah still up 4 tonight. wats the plan?
i can't believe how attached i am to seeing you every day. i know you will do the best you can to get to me babe. i will go to teach my class at your midnight
is that on the telly? no its brdget jones!
indians r poor but india is not a poor country. says one of the swiss bank directors. he says that  &lt;#&gt;  lac crore of indian money is deposited in swiss banks which can be used for 'taxless' budget for  &lt;#&gt;  yrs. can give  &lt;#&gt;  crore jobs to all indians. from any village to delhi 4 lane roads. forever free power suply to more than  &lt;#&gt;  social projects. every citizen can get monthly  &lt;#&gt; /- for  &lt;#&gt;  yrs. no need of world bank &amp; imf loan. think how our money is blocked by rich politicians. we have full r

In [None]:
#Cleaning:  lower case, remove non space and alphanumeric characters
df_train['text'] = df_train['text'].replace(to_replace=r'[^\w\s]',value='',regex=True).str.lower()
df_test['text'] = df_test['text'].replace(to_replace=r'[^\w\s]',value='',regex=True).str.lower()


In [None]:
#Separate inputs and outputs
#Inputs:
train_text = df_train.text
test_text = df_test.text

In [None]:
#Outputs
train_out = df_train.category.to_numpy().reshape(-1,1)
test_out = df_test.category.to_numpy().reshape(-1,1)

In [None]:
#Create a tokenizer using 1000 word vocab
num_words = 1000
tokenizer = Tokenizer(num_words=num_words)
#Fit the tokenizer to the training data
tokenizer.fit_on_texts(train_text)

In [None]:
word_index = tokenizer.word_index

In [None]:
#Print the word index
word_index

{'to': 1,
 'i': 2,
 'you': 3,
 'a': 4,
 'the': 5,
 'u': 6,
 'and': 7,
 'in': 8,
 'is': 9,
 'me': 10,
 'my': 11,
 'for': 12,
 'your': 13,
 'of': 14,
 'it': 15,
 'call': 16,
 'have': 17,
 'on': 18,
 'that': 19,
 'now': 20,
 'are': 21,
 'im': 22,
 '2': 23,
 'but': 24,
 'not': 25,
 'so': 26,
 'at': 27,
 'or': 28,
 'do': 29,
 'can': 30,
 'be': 31,
 'with': 32,
 'will': 33,
 'if': 34,
 'get': 35,
 'ur': 36,
 'just': 37,
 'we': 38,
 'this': 39,
 'no': 40,
 'its': 41,
 'up': 42,
 'dont': 43,
 'go': 44,
 '4': 45,
 'ok': 46,
 'ltgt': 47,
 'free': 48,
 'when': 49,
 'out': 50,
 'how': 51,
 'all': 52,
 'from': 53,
 'what': 54,
 'know': 55,
 'like': 56,
 'got': 57,
 'then': 58,
 'ill': 59,
 'come': 60,
 'good': 61,
 'time': 62,
 'am': 63,
 'was': 64,
 'only': 65,
 'day': 66,
 'he': 67,
 'love': 68,
 'send': 69,
 'there': 70,
 'as': 71,
 'want': 72,
 'text': 73,
 'going': 74,
 'by': 75,
 'ü': 76,
 'about': 77,
 'one': 78,
 'need': 79,
 'txt': 80,
 'lor': 81,
 'still': 82,
 'our': 83,
 'n': 84,
 'see'

In [None]:
#change text to sequences
train_sequences = tokenizer.texts_to_sequences(train_text)
test_sequences = tokenizer.texts_to_sequences(test_text)

max_length =20 #max number of words per sequence

In [None]:
#Pad the text data
train_padded = keras.preprocessing.sequence.pad_sequences(train_sequences,maxlen=max_length,padding='post',truncating='post')
test_padded = keras.preprocessing.sequence.pad_sequences(test_sequences,maxlen=max_length,padding='post',truncating='post')


In [None]:
#Create the model
model = keras.Sequential()
#Use Embedding Layer 
model.add(keras.layers.Embedding(num_words,32,input_length=max_length))
#Add bidirectional LSTM layer
model.add(keras.layers.Bidirectional(keras.layers.LSTM(64)))
#Sigmoid activation (0 to 1) single Dense neuron
model.add(keras.layers.Dense(1,activation='sigmoid'))
#Compile the model
model.compile(loss=keras.losses.BinaryCrossentropy(),optimizer='adam',metrics=['accuracy'])



In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 32)            32000     
                                                                 
 bidirectional (Bidirectiona  (None, 128)              49664     
 l)                                                              
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 81,793
Trainable params: 81,793
Non-trainable params: 0
_________________________________________________________________


In [None]:
#Fit the model
history = model.fit(train_padded,train_out,epochs=15,validation_data=(test_padded,test_out))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [None]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  pred_sequence = tokenizer.texts_to_sequences([pred_text])
  #Enter the value as a single batch
  pred_padded = [keras.preprocessing.sequence.pad_sequences(pred_sequence,maxlen=max_length,padding='post',truncating='post')]
 #return the probability value
  prob = model.predict(pred_padded)
  category_list = ['ham','spam']
  prediction = [prob[0][0],category_list[round(prob[0][0])]]
  return (prediction)

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)
print(prediction)

[0.00016217022, 'ham']


In [None]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


You passed the challenge. Great job!
