# Part 2

In this assignment you will learn to build *Bag Of Words* and *N-Gram* Models from scratch. Then you will also implement them using Scikit Learn's Library. At the end of this assignment you will have your own *Sarcasm Detector*.

In [68]:
# Initialize Otter
import otter
grader = otter.Notebook()

In [69]:
# Add Libraries Here
import pandas as pd
import numpy as np
import regex as re

## Dataset

The dataset is the **News Headline Dataset for Sarcasm Detection**. It contains the binary lables for each news headline, such as:

* 0 -> No Sarcasm
* 1 -> Sarcasm

You are provided with the dataset with an **80-20 train-test** split, in the `train.csv` and `test.csv`.

In [70]:
train_data = pd.read_csv("train.csv")
print("Shape:", train_data.shape)
train_data.head()

Shape: (21367, 2)


Unnamed: 0,headlines,is_sarcastic
0,shares of hazmat-suit maker spike on nyc ebola...,0
1,miss america called before u.n. council for no...,1
2,"tarsiers, the world's smallest primate: animal...",0
3,"'richie rich' comics introduces new, even gaye...",1
4,no one able to tell clam just had stroke,1


Now we will separate the data and the labels.

In [71]:
train_X = train_data["headlines"] # Data that we will pre-process
train_Y = train_data["is_sarcastic"] # True labels

## Feature Engineering + Preprocessing

Your job is to use your **Feature Engineering Skills** to convert the text into a feature vector. For this purpose we will first build our `Bag Of Words` and `N-Gram` models. But first we need to make our text uniform.

### 1) Text Normalization

The first step while dealing with the natural language is `Text Normalization`, which is done in the following three steps.

**Question 1:** Write a function `remove_punctuation_marks` that takes a text and replaces all punctuation marks with a single space. (3)

**Example:** Let's say we have a text, `"Are you ready to Co-ordinate with US?"`. The function must return this text `"Are you ready to Co ordinate with US "`

*Hint: You can use a regex to detect anything that is not a word.*

<!--
BEGIN QUESTION
name: q1
points: 3
-->

In [72]:
def remove_punctuation_marks(text):
    mystring = re.sub(r'[^\w\s]', ' ', text)
    return mystring
remove_punctuation_marks("Are you ready to Co-ordinate with US?")

'Are you ready to Co ordinate with US '

In [73]:
grader.check("q1")

**Question 2:** Write a function `to_lower_case` that takes a text and converts it into lower-case. (2)

**Example:** Now the above processed text, `"Are you ready to Co ordinate with US "` must become `"are you ready to co ordinate with us "`.

<!--
BEGIN QUESTION
name: q2
points: 2
-->

In [74]:
def to_lower_case(text):
    text = text.lower()
    return text


In [75]:
grader.check("q2")

**Question 3:** Write a function `remove_stop_words` that takes a text and a list of stop words, and removes all of the words that are not in the file of stop words provided to you (i.e. `stopwords.txt`). (5)

**Example:** This function will convert the above text, `"are you ready to co ordinate with us "` into `"ready co ordinate with us"`. Don't forget to remove spaces at the end.
<!--
BEGIN QUESTION
name: q3
points: 5
-->

In [76]:
file = open("stopwords.txt", "r")
stop_words = file.read().split()
file.close()

def remove_stop_words(text, stopwords_list):
    text_split = text.split(" ")
    word_str = ''
    for w in text_split:  
        if w not in stopwords_list:  
            word_str = word_str + ' ' + w
    return word_str.lstrip().rstrip()


In [77]:
grader.check("q3")

### 2) Bag of Words Representation

**Question 4:** You are given a function `create_BOW_vocab` that takes a training data and a list of stop words, and returns a Bag Of Words in the form of a `list`. You task is to normalize every text using above defined three functions in the given order and use the property of the `set` to store all unique words. (3)

**Example:** Let's say this is out training data:

`[ "Are you ready to Co-ordinate with US?",
   "We are ready for a race."              ]`

The function will firstly normalize the text and then it will return all uniques words as `['co', 'ordinate', 'race', 'ready', 'us']`. The Bag of Words does not contain any duplicate words.

<!--
BEGIN QUESTION
name: q4
points: 3
-->

In [78]:
def create_BOW_vocab(train_X, stopwords_list):
    bow = set()
    strarr = []
    for text in train_X:
        mystring = remove_punctuation_marks(text)
        mystring = to_lower_case(mystring)
        mystring = remove_stop_words(mystring, stopwords_list)
        strarr = mystring.split(" ")
        for w in strarr:
            bow.add(w)
    return sorted(list(bow))


In [79]:
grader.check("q4")

**Question 5:** You are given a function `BOW_feature_vectors` that takes a training data and a Bag of Words vocabulary list, and return the feature vectors of the training data, in the form of `numpy arrays`. Your job is to normalize every `text` using the three functions we build and then convert it into its vector based upon its features. You may need to revise BOW lecture for this. (3)

**Example:** Let's say this is out training data:

`[ "Are you ready to Co-ordinate with US?",
   "We are ready for a race."              ]`

and Bag of Words Vocab is `['co', 'ordinate', 'race', 'ready', 'us']`

The function will first normalize the text and return the following feature vectors:

`[  [1, 1, 0, 1, 1],
    [0, 0, 1, 1, 0]  ]`

<!--
BEGIN QUESTION
name: q5
points: 3
-->

In [87]:
def BOW_feature_vectors(train_X, BOW):
    feature_vectors = list()
    strarr = []
    for text in train_X:
        temp = list()
        mystring = remove_punctuation_marks(text)
        mystring = to_lower_case(mystring)
        mystring = remove_stop_words(mystring, stop_words)
        strarr = mystring.split(" ")
        for w in BOW:  
            if w not in strarr:
                temp.append(0)
            else:
                temp.append(1)
        feature_vectors.append(temp)
    return np.array(feature_vectors)


array([[1, 1, 0, 1, 1],
       [0, 0, 1, 1, 0]])

In [88]:
grader.check("q5")

### 3) N-Gram 

**Question 6:** You are given a function `create_NGram_vocab` that takes a training data, a parameter `n`, a list of stop words, and returns an N-Gram vocab in the form of a `list`. Your task is to normalize every text using above defined three functions in the given order and use the property of the `set` to store all unique n-grams. (3)

**Example:** Let's say this is out training data:

`[ "Are you ready to Co-ordinate with US?",
   "We are ready for a race."               ]`

After normalization the text will look like this:

`[ "ready co ordinate us",
   "ready race."           ]`

Now the function will return all unique N-Grams as `['co ordinate', 'ordinate us', 'ready co', 'ready race']`. The N-Gram does not contain any duplicate words.


<!--
BEGIN QUESTION
name: q6
points: 3
-->

In [129]:
def create_NGram_vocab(train_X, n, stopwords_list):
    ngram = set()
    strarr = []
    for text in train_X:
        temp = []
        mystring = remove_punctuation_marks(text)
        mystring = to_lower_case(mystring)
        mystring = remove_stop_words(mystring, stop_words)
        strarr = mystring.split(" ")
        for x in range(len(strarr)-n+1):
            word = ""
            for y in range(n):
                word = word + " " + strarr[x+y]
            temp.append(word.lstrip())
        for x in temp:
            ngram.add(x)
    return sorted(list(ngram))


In [130]:
grader.check("q6")

**Question 7:** You are given a function `NGram_feature_vectors` that takes a training data and an NGram vocabulary/features list, and returns the feature vectors of the training data, in the form of `numpy arrays`. Your job is to normalize every `text` using the three functions we build and then convert it into its vector based upon its features. You may need to revise N-Gram lecture for this. (3)

**Example:** Let's say this is out training data:

`[ "Are you ready to Co-ordinate with US?",
   "We are ready for a race."              ]`

and Bag of Words Vocab is `['co ordinate', 'ordinate us', 'ready co', 'ready race']`

The function will first normalize the text and return the following feature vectors:

`[  [1, 1, 1, 0],
    [0, 0, 0, 1]  ]`

<!--
BEGIN QUESTION
name: q7
points: 3
-->

In [158]:
def NGram_feature_vectors(train_X, n, NGram):
    feature_vectors = list()
    for text in train_X:
        strarr = []
        temp = []
        temp2 = []
        ngram = set()
        mystring = remove_punctuation_marks(text)
        mystring = to_lower_case(mystring)
        mystring = remove_stop_words(mystring, stop_words)
        strarr = mystring.split(" ")
        for x in range(len(strarr)-n+1):
            word = ""
            for y in range(n):
                word = word + " " + strarr[x+y]
            temp.append(word.lstrip())
        for x in temp:
            ngram.add(x)
        for w in NGram:
            if w in ngram:
                temp2.append(1)
            else:
                temp2.append(0)
        feature_vectors.append(temp2)
    return np.array(feature_vectors)

In [159]:
grader.check("q7")

***Congragulations!*** You have successfully implemented BOW and N-Gram models by yourself. Now is the time to introduce you to your new best friend, the [Scikit-Learn CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), that you can use to carry out the feature extraction. It has an in-built feature for n-gram vectorization as well. Do check out the documentation to figure out how to use simple Bag of Words vectorization.

In [131]:
from sklearn.feature_extraction.text import CountVectorizer

### 4) BOW using Scikit-learn CountVectorizer

<!-- BEGIN QUESTION -->

**Question 8:** Read the documentation for CountVector and extract the BOW features. (4)

<!--
BEGIN QUESTION
name: q8
points: 4
manual: true
-->

In [141]:
BOW_vectorizer = CountVectorizer(stop_words= stop_words, analyzer="word") # Choose the right arguments for CountVectorizer
BOW_features = BOW_vectorizer.fit_transform(train_X)

<!-- END QUESTION -->



### 5) N-Gram using Scikit-learn CountVectorizer

<!-- BEGIN QUESTION -->

**Question 9:** Read the documentation for CountVector and extract the NGrams features. (4)

<!--
BEGIN QUESTION
name: q9
points: 4
manual: true
-->

In [142]:
NGram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words = stop_words) # Choose the right arguments for CountVectorizer
NGram_features = NGram_vectorizer.fit_transform(train_X)

<!-- END QUESTION -->



## Logistic Regression

In this section you will be using the [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV) library from SKlearn (this one uses in-built cross validation as well). This part is open-ended and meant for you to explore how to change hyperparameters to get a good result. The coding here is simple - the only job you have to do is look at the documentation. *Data Science is about finding the right libraries to do the job, and again, the coding is simple, your job is to find the right functions.* And the code here is just 2 lines.

In [143]:
from sklearn.linear_model import LogisticRegressionCV

In [144]:
test_data = pd.read_csv("test.csv")
print("Shape:" , test_data.shape)
test_data.head()

Shape: (5342, 2)


Unnamed: 0,headlines,is_sarcastic
0,top official resigns from trump epa with scath...,0
1,lea delaria gets candid about her wild tour da...,0
2,who declares sierra leone free of ebola,0
3,top of mt. everest pulling away majority of ho...,1
4,sonoma sheriff battles with ice over misinform...,0


In [145]:
test_X = test_data["headlines"] # Test Data
test_Y = test_data["is_sarcastic"] # True Labels

### 1) Using your BOW Vectorizer

<!-- BEGIN QUESTION -->

**Question 10:** Read the documentation for Logistic Regression and train a BOW classifier. (5)

<!--
BEGIN QUESTION
name: q10
points: 5
manual: true
-->

In [173]:
BOW_classifier = LogisticRegressionCV(max_iter = 500, cv=5, random_state=0).fit(BOW_features, train_Y)
X_features = BOW_vectorizer.transform(test_X)
bow_array = BOW_classifier.predict(X_features)

<!-- END QUESTION -->



### 2) Using your N-Gram Vectorizer

<!-- BEGIN QUESTION -->

**Question 11:** Read the documentation for Logistic Regression and train an N-Gram classifier. (5)

<!--
BEGIN QUESTION
name: q11
points: 5
manual: true
-->

In [172]:
NGram_classifier = LogisticRegressionCV(max_iter = 500, cv=5, random_state=0).fit(NGram_features, train_Y)
X_features = NGram_vectorizer.transform(test_X)
ngram_array = NGram_classifier.predict(X_features)

<!-- END QUESTION -->



## Evaluation

Use scikit-learn's [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) and [confusion_matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py) to get **accuracies** of your classifiers and to **plot confusion matricies** to see how good your models are. You can use the functions provided in the documentation. I'll repeat *Data Science is about finding the right libraries to do the job*, and again, the coding is simple, your job is to find the right functions.

In [171]:
# Write code to evaluate both models to see which which one performed better
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
bow_score = accuracy_score(bow_array, test_Y)
ngram_score = accuracy_score(ngram_array, test_Y)
print("Accuracy Score for BOW is")
print(bow_score)
print("\nAccuracy Score for ngram is")
print(ngram_score)
print("\nconfusion matrix for BOW is")
display(confusion_matrix(bow_array, test_Y))
print("confusion matrix for Ngram is")
display(confusion_matrix(ngram_array, test_Y))

Accuracy Score for BOW is
0.8019468363908648

Accuracy Score for ngram is
0.6664170722575814

confusion matrix for BOW is


array([[2554,  615],
       [ 443, 1730]], dtype=int64)

confusion matrix for Ngram is


array([[2831, 1616],
       [ 166,  729]], dtype=int64)

<!-- BEGIN QUESTION -->

**Question 12:** Which model was better? Why? Answer in terms of accuracy score and confusion matrix values. (5 (implementation) + 5 (reasoning))

<!--
BEGIN QUESTION
name: q12
points: 10
manual: true
-->

**Answer:** Our results for accuracy score show us clearly that accuracy score for BOW performs better as the value for BOW is ~80% and value for Ngram is ~67%. Our model has a negative bias due to the false negatives we are getting in the Ngram. It also shows that for BOW the wrongly predicted values were lesses in comparission. Through this we can say that the BOW model is performing better here. 

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [160]:
grader.check_all()