*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called `predict_message` that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the [SMS Spam Collection dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.


# Imports

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

--2022-01-14 12:03:11--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv.2’


2022-01-14 12:03:11 (11.1 MB/s) - ‘train-data.tsv.2’ saved [358233/358233]

--2022-01-14 12:03:11--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv.2’


2022-01-14 12:03:11 (9.74 MB/s) - ‘valid-data.tsv.2’ saved [118774/118774]



# Data Preprocessing

In [None]:
import pandas as pd
train_data = pd.read_csv(train_file_path, sep = '\t', header = None, names = ('type', 'message'))
test_data = pd.read_csv(test_file_path, sep = '\t', header = None, names = ('type', 'message'))

In [None]:
train_data.shape

(4179, 2)

In [None]:
train_data.head()

Unnamed: 0,type,message
0,ham,ahhhh...just woken up!had a bad dream about u ...
1,ham,you can never do nothing
2,ham,"now u sound like manky scouse boy steve,like! ..."
3,ham,mum say we wan to go then go... then she can s...
4,ham,never y lei... i v lazy... got wat? dat day ü ...


In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4179 entries, 0 to 4178
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   type     4179 non-null   object
 1   message  4179 non-null   object
dtypes: object(2)
memory usage: 65.4+ KB


In [None]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1392 entries, 0 to 1391
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   type     1392 non-null   object
 1   message  1392 non-null   object
dtypes: object(2)
memory usage: 21.9+ KB


# Stemming

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
port_stem = PorterStemmer()
def stemming(argument):
    stemmed_argument = re.sub('[^a-zA-Z]', ' ', argument)
    stemmed_argument = stemmed_argument.lower()
    stemmed_argument = stemmed_argument.split()
    stemmed_argument = [port_stem.stem(word) for word in stemmed_argument if not word in stopwords.words('english')]
    stemmed_argument = ' '.join(stemmed_argument)
    return stemmed_argument

Explanation of above function:

* Line 1: It differentiates between alphabets and all other characters, i.e. it considers only characters a-z and A-Z from describe, and replace any other character by a space.
* Line 2: It converts all letters into lower case.
* Line 3: It splits the letters in respective lists.
* Line 4: Now we stem each word all the non-stopwords.
* Line 5: We join all the words using space.
* Line 6: Return result.

In [None]:
train_data['message'] = train_data['message'].apply(stemming)
test_data['message'] = test_data['message'].apply(stemming)

In [None]:
train_data.head()

Unnamed: 0,type,message
0,ham,ahhhh woken bad dream u tho dont like u right ...
1,ham,never noth
2,ham,u sound like manki scous boy steve like travel...
3,ham,mum say wan go go shun bian watch da glass exh...
4,ham,never lei v lazi got wat dat day send da url c...


In [None]:
train_data['message'][2]

'u sound like manki scous boy steve like travel da bu home wot u inmind recreat di eve'

In [None]:
max = 0
for i in range(4179):
  if max < len(train_data['message'][i]):
    max = len(train_data['message'][i])
max

412

## Variable Extraction and Data Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [None]:
x_train = train_data['message'].values
y_train = train_data['type'].values
x_test = test_data['message'].values
y_test = test_data['type'].values

In [None]:
type(x_train), x_train[0]

(numpy.ndarray,
 'ahhhh woken bad dream u tho dont like u right didnt know anyth comedi night guess im')

In [None]:
vectorizer.fit(x_train)
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)
vectorizer_y = TfidfVectorizer()
vectorizer_y.fit(y_train)
y_train = vectorizer_y.transform(y_train)
y_test = vectorizer_y.transform(y_test)

In [None]:
type(x_train), type(y_train)

(scipy.sparse.csr.csr_matrix, scipy.sparse.csr.csr_matrix)

In [None]:
x_train_np = x_train.toarray()
y_train_np = y_train.toarray()
x_test_np = x_test.toarray()
y_test_np = y_test.toarray()

In [None]:
y_train_np[100]

array([1., 0.])

In [None]:
x_train[0].shape

(1, 5412)

# Neural Network Classifier

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

In [None]:
model_mlp = Sequential()
model_mlp.add(Dense(5412, input_shape = (None, 5412), activation = 'relu'))
model_mlp.add(Dropout(0.5))
model_mlp.add(Dense(2048, activation= 'relu'))
model_mlp.add(Dropout(0.5))
model_mlp.add(Dense(512, activation = 'relu'))
model_mlp.add(Dropout(0.5))
model_mlp.add(Dense(128, activation = 'relu'))
model_mlp.add(Dropout(0.5))
model_mlp.add(Dense(32, activation = 'relu'))
model_mlp.add(Dropout(0.5))
model_mlp.add(Dense(8, activation = 'relu'))
model_mlp.add(Dropout(0.5))
model_mlp.add(Dense(2, activation = 'softmax'))
model_mlp.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_22 (Dense)            (None, None, 5412)        29295156  
                                                                 
 dropout_18 (Dropout)        (None, None, 5412)        0         
                                                                 
 dense_23 (Dense)            (None, None, 2048)        11085824  
                                                                 
 dropout_19 (Dropout)        (None, None, 2048)        0         
                                                                 
 dense_24 (Dense)            (None, None, 512)         1049088   
                                                                 
 dropout_20 (Dropout)        (None, None, 512)         0         
                                                                 
 dense_25 (Dense)            (None, None, 128)        

In [None]:
model_mlp.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics = ['accuracy'])

In [None]:
fit = model_mlp.fit(x_train_np, y_train_np, epochs = 20, validation_data = (x_test_np, y_test_np))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
predict_mlp = model_mlp.predict(x_test_np)

In [None]:
predict_mlp[0][0]

1.0

# Final Evaluation

In [None]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  data = [[pred_text]]
  dataframe = pd.DataFrame(data, columns = ['message'])
  x_temp = dataframe['message'].values
  x_temp = vectorizer.transform(x_temp)
  x_temp_np = x_temp.toarray()
  pred = model_mlp.predict(x_temp_np)
  if pred[0][0] == 1.0:
    prediction = 'ham'
  else:
    prediction = 'spam'
  return (prediction)

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)
print(prediction)

ham


In [None]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    print(prediction)
    if prediction != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


ham
spam
ham
spam
spam
ham
ham
You passed the challenge. Great job!
