# Suggestion Detection using Feature Engineering

---

This project involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. We will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
!curl https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv > train.csv
!curl https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv > test.csv
!curl https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv > test_unseen.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  670k  100  670k    0     0  1612k      0 --:--:-- --:--:-- --:--:-- 1618k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  168k  100  168k    0     0   989k      0 --:--:-- --:--:-- --:--:-- 1001k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  208k  100  208k    0     0  1057k      0 --:--:-- --:--:-- --:--:-- 1072k


In [2]:
import numpy as np
import pandas as pd

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

---

## Task 1: Data Pre-processing


---

> For pre-processing I have used the following methods:
1. Tokenization: With this method we divide the data frame into small tokens. i.e we separate words from sentences for pre-processing.
2. StopWordRemoval: Here we will be removing all the stop words in the list. Stopwords are the words that doesn't give much meaning to the sentence.
3. Punctuation removal: We will be removing all the punctuations from the sentences.
4. Text Lower Case: All the words in the list will be converted to lower.
5. Detokenization: Finally we detokenize the list to convert the list of words into a single list of sentences.

---

In [3]:
print(len(train_texts))
print(len(test_texts))
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)
test_unseen_list = list(test_unseen['text'])
print(len(test_unseen_list))

5440
1360
1700


Tokenization

In [4]:
from nltk.tokenize import TreebankWordTokenizer

train_tokens = []
test_tokens = []
testun_tokens = []
tokenizer = TreebankWordTokenizer()

for i in train_texts:
    train_tokens.append(tokenizer.tokenize(i))
# print(train_tokens)

for i in test_texts:
    test_tokens.append(tokenizer.tokenize(i))
# print(test_tokens)

for i in test_unseen_list:
    testun_tokens.append(tokenizer.tokenize(i))
# print(testun_tokens)

In [6]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stemmed_train = []
for i in train_tokens:
    out = []
    for words in i:
        out.append(stemmer.stem(words))
    stemmed_train.append(out)
# print(len(no_puncs_train))

stemmed_test = []
for i in test_tokens:
    out = []
    for words in i:
        out.append(stemmer.stem(words))
    stemmed_test.append(out)
# print(len(no_puncs_test))

stemmed_testun = []
for i in testun_tokens:
    out = []
    for words in i:
        out.append(stemmer.stem(words))
    stemmed_testun.append(out)
# print(len(no_puncs_testun))

Stopwords Removal

In [9]:
from nltk.corpus import stopwords
import nltk
# nltk.download('stopwords')

stops = set(stopwords.words('english'))
no_stops_train = []
no_stops_test = []

for sentences in stemmed_train:
    out = []
    for word in sentences:
        if word not in stops:
            out.append(word)
    no_stops_train.append(out)

# print(no_stops_train)


for sentences in stemmed_test:
    out = []
    for word in sentences:
        if word not in stops:
            out.append(word)
    no_stops_test.append(out)

# print(no_stops_test)

no_stops_testun = []

for sentences in stemmed_testun:
    out = []
    for word in sentences:
        if word not in stops:
            out.append(word)
    no_stops_testun.append(out)
# print(len(no_stops_testun))
# print(no_stops_testun)

Punctuation Removal

In [10]:
import re

no_puncs_train = []
for i in no_stops_train:
    out = []
    for words in i:
        out.append(re.sub(r'[^\w\s]', '', words))
    no_puncs_train.append(out)
# print(len(no_puncs_train))

no_puncs_test = []
for i in no_stops_test:
    out = []
    for words in i:
        out.append(re.sub(r'[^\w\s]', '', words))
    no_puncs_test.append(out)
# print(len(no_puncs_test))

no_puncs_testun = []
for i in no_stops_testun:
    out = []
    for words in i:
        out.append(re.sub(r'[^\w\s]', '', words))
    no_puncs_testun.append(out)
# print(len(no_puncs_testun))

String Lower

In [11]:
import string

lower_train = []
for i in no_puncs_train:
    out = []
    for words in i:
        out.append(str.lower(words))
    lower_train.append(out)
print(len(lower_train))

lower_test = []
for i in no_puncs_test:
    out = []
    for words in i:
        out.append(str.lower(words))
    lower_test.append(out)
print(len(lower_test))

lower_testun = []
for i in no_puncs_testun:
    out = []
    for words in i:
        out.append(str.lower(words))
    lower_testun.append(out)
print(len(lower_testun))


5440
1360
1700


Detokenization

In [12]:
from nltk.tokenize import TreebankWordDetokenizer

detokenizer = TreebankWordDetokenizer()

train_detokens = []
for i in stemmed_train:
    train_detokens.append(detokenizer.detokenize(i))
print(len(train_detokens))

test_detokens = []
for i in stemmed_test:
    test_detokens.append(detokenizer.detokenize(i))
print(len(test_detokens))

testun_detokens = []
for i in stemmed_testun:
    testun_detokens.append(detokenizer.detokenize(i))
print(len(testun_detokens))

5440
1360
1700


---

## Task 2: Feature Engineering (I) - TF-IDF as features

Raw counts of words and `tf-idf` scores can be useful features for a classification task. We will use `tf-idf` scores as features for a Naïve Bayes classifier.

After applying the preprocessing steps, we use the training data to train the classifier and make predictions on the test set.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.

cv = CountVectorizer()
tfidf = TfidfTransformer()
text_train_counts = cv.fit_transform(train_detokens)
# print(text_train_counts)
text_train_tfidf = tfidf.fit_transform(text_train_counts)

# Train a Naïve Bayes classifier using the tf-idf scores for words as features.

naive_bayes = GaussianNB()
naive_bayes.fit(text_train_counts.toarray(), train_labels)

# Predict on the test set.
predictions = []    # save your predictions on the test set into this list


text_test_counts = cv.transform(test_detokens)
text_test_tfidf = tfidf.transform(text_test_counts)

predictions = naive_bayes.predict(text_test_tfidf.toarray())
  


def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.
    
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

0.5272058823529412

---

## Task 3: Evaluation Metrics


---

> Accuracy cannot deal well with imbalanced data. For e.g if there are less number of positive results and more number of negative results then ML algorithm will support more number of negative results. I have used F1 score here because this method will provide better results as it works well with imbalanced data. F1 score will consider precision and recall both which will give better results than accuracy.

---

In [14]:
def evaluate(labels, predictions):

    # check that labels and predictions are of same length
    assert len(labels) == len(predictions)

    score = 0.0

    tp , fp, fn = 0,0,0
    for label, prediction in zip(labels, predictions):
        if label == prediction:
            if label == 1:
                tp += 1
        else:
            if label == 1:
                fn += 1
            else:
                fp += 1
    
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    score = 2 * (precision * recall) / (precision + recall)
    
    print('Precision:', format(precision, '.4f'), '\t', 'Recall:', format(recall, '.4f'), '\t', 'Score:', format(score, '.4f'))


    return format(score, '.4f')

# Calculate evaluation score based on the metric of your choice for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

Precision: 0.2950 	 Recall: 0.6697 	 Score: 0.4096


'0.4096'

---

## Task 4: Feature Engineering (II) - Other features



---

> To improve my accuracy we have specified ngram range and max_features in the count vectorizer. These features doesn't require any additional pre-processing steps. ngram range defines in how many terms we want to tokenize our data. For e.g. if ngram range is (1,1) then movie day word will be converted into 'movie', 'day'. But if the range is specified as (2,2) then the tokenized item will be 'movie day'. The max_features attribute will select the n number of terms with top frequencies. For e.g if max features is set to 5 then it'll select 5 most commonly used terms in the data.

---

In [64]:
# Create your features.
cv = CountVectorizer(ngram_range=(2,2), max_features=65)
tfidf = TfidfTransformer()
text_train_counts = cv.fit_transform(train_detokens)
text_train_tfidf = tfidf.fit_transform(text_train_counts)


# Train a Naïve Bayes classifier using the features defined.
naive_bayes = GaussianNB()
naive_bayes.fit(text_train_counts.toarray(), train_labels)


# Evaluate on the test set.
predictions = []    # save predictions on the test set into this list


text_test_counts = cv.transform(test_detokens)
text_test_tfidf = tfidf.transform(text_test_counts)

predictions = naive_bayes.predict(text_test_tfidf.toarray())

print(accuracy(test_labels, predictions))
evaluate(test_labels, predictions)

0.7757352941176471
Precision: 0.5598 	 Recall: 0.3934 	 Score: 0.4621


'0.4621'

---

## Task 5: Kaggle Competition

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d1068  
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

Read competition page for more details.



In [65]:
# Preparing submission for Kaggle
kaggle_test_set = "kaggle_test_set"
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)


# model fitting
cv = CountVectorizer(ngram_range=(2,2), max_features=65)
tfidf = TfidfTransformer()
text_testun_counts = cv.fit_transform(testun_detokens)
text_testun_tfidf = tfidf.fit_transform(text_testun_counts)


# predictions
predictions = naive_bayes.predict(text_testun_tfidf.toarray())


# Here Id is unique identifier assigned to each test sample ranging from test_0 till test_1699
# Expected is a list of prediction made by your classifier
sub = {"Id": [f"test_{i}" for i in range(len(predictions))],
       "Expected": predictions}

sub_df = pd.DataFrame(sub)

# This will generate a test set which after uploading on kaggle competition will give a score for the competition
sub_df.to_csv(f"{kaggle_test_set}.csv", sep=",", header=1, index=None)

Mention the approach that you have chosen briefly, and what is the mean average f-score that you have achieved? Did it improve above the chosen baseline model (0.36823)? Why or why not?

Edit this cell to write your answer below the line in no more than 500 words.

---
By performing various pre processing steps I was succesfully in achieving a 0.4090 f1 score in the step 3. In the step 4, I applied ngram and max_features attributes to the countvectorizer that increased the f1 score upto 0.5345. I specified the ngram range as (1,1) and max_features to consider as 18. With these values I am able to get the maximum of f1 score which is 0.5345 for my train and test data.

---