<a href="https://colab.research.google.com/github/Fyfy1996/Natural_language_understanding/blob/master/HW_1_Part_1_Spam_Prediction_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1.

The deadline for Part 1 is **1:30 pm Feb 6th, 2020**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---


In this part we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [0]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2020-02-07 00:00:39--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 172.217.204.113, 172.217.204.100, 172.217.204.139, ...
Connecting to docs.google.com (docs.google.com)|172.217.204.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/vfmq1p64nlkcafn8o9flb5b6gc4n4dv5/1581033600000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2020-02-07 00:00:39--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/vfmq1p64nlkcafn8o9flb5b6gc4n4dv5/1581033600000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 172.217.204.132, 2607:f8b0:400c:c15::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14

In [0]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [0]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test. Make sure that each row appears only in one of the splits.

In [0]:
import random

In [0]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
YOUR CODE GOES HERE
"""
ind = list(df.index)
random.shuffle(ind)
val_ind = ind[0:val_size]
test_ind = ind[val_size:(test_size+val_size)]
train_ind = ind[(test_size+val_size):-1]

train_texts, train_labels = df.loc[train_ind, "v2"].reset_index(drop=True), df.loc[train_ind,"v1"].reset_index(drop=True)
val_texts, val_labels   = df.loc[val_ind, "v2"].reset_index(drop=True), df.loc[val_ind,"v1"].reset_index(drop=True)
test_texts, test_labels  = df.loc[test_ind, "v2"].reset_index(drop=True), df.loc[test_ind,"v1"].reset_index(drop=True)

## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [0]:
def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    YOUR CODE GOES HERE
    """
    preprocessed_data = data.apply(lambda x: x.split(" "))
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [0]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab
        """
        YOUR CODE GOES HERE
        """
        vocab_dict = {}
        for i in dataset:
          for j in i:
            curVocab = j
            try:
              vocab_dict[curVocab] += 1
            except KeyError:
              vocab_dict[curVocab] = 1
        sorted_vocabs = dict(sorted(vocab_dict.items(), key = lambda item:item[1], reverse=True))
        self.vocab_list = list(sorted_vocabs.keys())[:self.max_features]
        self.token_to_index = dict(zip( list(range(self.max_features)), self.vocab_list))
        pass

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for i in range(len(dataset)):
          for j in range(len(self.vocab_list)):
            data_matrix[i,j] = dataset[i].count(self.token_to_index[j])

        return data_matrix

In [0]:
max_features = 888 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


You can add more features to the feature matrix.

In [0]:
"""
YOUR CODE GOES HERE
"""

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [0]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [0]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    recall = np.sum((y_true == 1)&(y_pred == 1)) / np.sum(y_true)
    precision = np.sum((y_true == 1)&(y_pred == 1)) / np.sum(y_pred)
    f1 = 2 * (recall*precision) / (recall + precision)
    return f1

In [0]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.990, F1 score: 0.960
Validation accuracy: 0.971, F1 score: 0.899
Test accuracy: 0.975, F1 score: 0.911


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** The logistic regression is basically optimizing the binary cross entrophy cost function but not accuracy/recall/precision/f1score.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** In a imbalanced data, there is no point in high accuracy. For example, in a certain disease detection, 99 people of 100 don't have the disease. Simply predicting 0 can acheive 99% accuracy, but clearly it is not a good model. Recall and Precision are calculated by conditional probability which will not be influenced by imbalanced data. F1 score, the combination of both makes a better comments on models.

### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [0]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham
spam = {1:"spam",0:"ham"}
for i in range(5):
  j = random.randint(0, len(y_val))
  print("True label: ", spam[y_val[j]] )
  print("Prediction: ", spam[y_val_pred[j]] )
  print(train_texts[j])
  print("\n")

True label:  spam
Prediction:  spam
:-) yeah! Lol. Luckily i didn't have a starring role like you!


True label:  spam
Prediction:  spam
I.ll give her once i have it. Plus she said grinule greet you whenever we speak


True label:  ham
Prediction:  ham
Probably, want to pick up more?


True label:  ham
Prediction:  ham
Havent planning to buy later. I check already lido only got 530 show in e afternoon. U finish work already?


True label:  ham
Prediction:  ham
Midnight at the earliest




**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** Most of examples with the wrong labels has urls, telephone numbers or strange symbols in it, which may be why these examples are classified uncorrectly.

In [0]:
"""
YOUR CODE GOES HERE
"""

wrong_samples = val_texts.loc[(y_val != y_val_pred)].head(10)
for i in wrong_samples.index:
  print(wrong_samples.loc[i],"\n")

Hi I'm sue. I am 20 years old and work as a lapdancer. I love sex. Text me live - I'm i my bedroom now. text SUE to 89555. By TextOperator G2 1DA 150ppmsg 18+ 

SMS. ac JSco: Energy is high, but u may not know where 2channel it. 2day ur leadership skills r strong. Psychic? Reply ANS w/question. End? Reply END JSCO 

URGENT. Important information for 02 user. Today is your lucky day! 2 find out why , log onto http://www.urawinner.com there is a fantastic surprise awaiting you ! 

FREE2DAY sexy St George's Day pic of Jordan!Txt PIC to 89080 dont miss out, then every wk a saucy celeb!4 more pics c PocketBabe.co.uk 0870241182716 å£3/wk 

Can you call me plz. Your number shows out of coveragd area. I have urgnt call in vasai &amp; have to reach before 4'o clock so call me plz 

Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B 

Hi its LUCY Hubby at meetins all day Fri & I will B alone at hotel U fancy cumin over? Pls leave msg 2day

## End of Part 1.
