# Logistic regression for SMS spam classification


Each line of the data file `sms.txt`
contains a label---either "spam" or "ham" (i.e. non-spam)---followed
by a text message. Here are a few examples (line breaks added for readability):

    ham     Ok lar... Joking wif u oni...
    ham     Nah I don't think he goes to usf, he lives around here though
    spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.
            Text FA to 87121 to receive entry question(std txt rate)
            T&C's apply 08452810075over18's
    spam    WINNER!! As a valued network customer you have been
            selected to receivea £900 prize reward! To claim
            call 09061701461. Claim code KL341. Valid 12 hours only.

To create features suitable for logistic regression, use tools from the ``sklearn.feature_extraction.text``:

* Convert words to lowercase.
* Remove punctuation and special characters (but convert the \$ and
  £ symbols to special tokens and keep them, because these are useful for predicting spam).
* Create a dictionary containing the 3000 words that appeared
  most frequently in the entire set of messages.
* Encode each message as a vector $\mathbf{x}^{(i)} \in
  \mathbb{R}^{3000}$. The entry $x^{(i)}_j$ is equal to the
  number of times the $j$th word in the dictionary appears in that
  message.
* Discard some ham messages to have an
  equal number of spam and ham messages.
* Split data into a training set of 1000 messages and a
  test set of 400 messages.
  
Follow the instructions below to complete the implementation. You will be asked to: 

* write a code to implement logestic regression algorithm (you can use sklearn library for this but it affects your score.)
* Make predictions and report the accuracy on the test set
* Test out the classifier on a few of your own text messages

# build Logisitc Regression classifier
for this part you can use Andrew Ng course for machine learning week3 in coursera.

In [None]:
import pandas as pd
import numpy as np
import string
from nltk.tokenize import word_tokenize  
from nltk.corpus import stopwords
from collections import Counter
from sklearn.model_selection import train_test_split

In [8]:
class LogisticRegression():
    def __init__(self, X, Y, iter_num = 10000, learning_rate = 0.01):

        m = X.shape[1]

        self.X = np.concatenate((np.ones((1,m)), X), axis=0)
        self.Y = Y
        
        n_x = self.X.shape[0]
        
        self.theta = np.zeros((1,n_x))

        for _ in range(iter_num):
            grads = self.compute_grads()
            self.theta -= learning_rate * grads

    def sigmoid(self, z):
        return 1.0 / (1 + np.exp(-z))

    def compute_grads(self):
        m = self.X.shape[1]
        return (1/m) * np.dot(np.subtract(self.sigmoid(np.dot(self.theta, self.X)), self.Y),self.X.T)

    def predict(self, X_test):
        X_test = np.concatenate((np.ones((1,X_test.shape[1])), X_test), axis=0)
        predicted_Y = self.sigmoid(np.dot(self.theta, X_test))
        return (predicted_Y > 0.5).astype(int)

    def accuracy(self, X_test, Y_test):
      Y_pred = self.predict(X_test)
      corrects = np.sum(Y_pred == Y_test)
      return corrects/Y_test.shape[0]



# Load and prep data
using provided construction load and preprocess the data.

In [9]:
import pandas as pd
import numpy as np
import nltk
stopwords = nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from collections import Counter , Sequence
import string

ps = PorterStemmer()
raw_data = open('sms.txt').read()
parsed_data = raw_data.replace('\t','\n').split('\n')
label_list = parsed_data[0::2]
msg_list = parsed_data[1::2]
print(label_list[0:4])
print(msg_list[0:5]) 
print(len(label_list))
print(len(msg_list))
print(label_list[-3:])
combined_df = pd.DataFrame({'label' : label_list[:-1], 'msg_list' : msg_list})
print(combined_df.head())

def clean_text(text):
    for punc in string.punctuation:
        text = text.replace(punc, "")
    text = text.split(" ")
    text=[word for word in text if word not in stopwords.words('english')]
    return text
 
combined_df['msg_list'] = combined_df['msg_list'].apply(lambda x: clean_text(x.lower()))

def stem_and_tokenize(text):
    text = [ps.stem(word) for word in text]
    return text

combined_df['msg_list'] = combined_df['msg_list'].apply(lambda x: stem_and_tokenize(x))
clean_msg = list(combined_df.msg_list)

dictionary = []
for i in clean_msg:
  dictionary.extend(i)
words_occurences = dict(Counter(dictionary))

n_x = 500
words_occurences = dict(sorted(words_occurences.items(), key=lambda item: item[1], reverse=True)[:n_x])
corpus = list(words_occurences.keys())

X = np.zeros((n_x, len(clean_msg)))
for i in range(len(clean_msg)):
  X[:, i] = [clean_msg[i].count(x) for x in corpus]
X = X.T
print(X.shape)

Y=[1 if i=='spam' else 0 for i in combined_df.label]
Y = np.array(Y)
print(Y.shape)

['ham', 'ham', 'spam', 'ham']
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though"]
5575
5574
['ham', 'ham', '']
  label                                           msg_list
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


NameError: name 'stopwords' is not defined

# Train logistic regresion model
Using the logestic Regression method, train the logistic regression model.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
model = LogisticRegression(X_train.T, y_train, 10000)

# Make predictions on test set
Use the model fit in the previous cell to make predictions on the test set and compute the accuracy (percentage of messages in the test set that are classified correctly). You should be able to get accuracy above 95%.


In [1]:
print(model.accuracy(X_test.T, y_test))


# Inspect model parameters
find which words are most common in spam and ham messages.

In [2]:
weights = model.theta[0,1:]
spam_words = sorted(range(len(weights)), key=lambda i: weights[i], reverse=True)[:5]
ham_words = sorted(range(len(weights)), key=lambda i: weights[i])[:5]

print("Top spam words:")
for i in spam_words:
  print(corpus[i])
print("\n")
print("Top ham words:")
for i in ham_words:
  print(corpus[i])


##  Make a prediction on new messages
Type a few of your own messages in below and make predictions. Are they ham or spam? Do the predictions make sense?

In [7]:
#code here