# [Getting Started with NLP](https://dphi.tech/bootcamps/getting-started-with-natural-language-processing?utm_source=header)
by [CSpanias](https://cspanias.github.io/aboutme/), 28/01 - 06/02/2022 <br>

Bootcamp organized by **[DPhi](https://dphi.tech/community/)**, lectures given by [**Dipanjan (DJ) Sarkar**](https://www.linkedin.com/in/dipanzan/) ([GitHub repo](https://github.com/dipanjanS/nlp_essentials)) <br>

This notebook constitutes my **personal submission** to the final assignment of the Bootcamp.

# CONTENT
1. [Problem Overview](#ProblemOverview)
2. [Import & Check Dataset](#Data)
    1. [Missing Values](#nans)
    1. [Duplicated Rows](#duplicates)
    1. [Balance](#balance)
3. [NLP Pipeline](#Pipeline)
    1. [Text Pre-Processing](#TextPre)
    1. [Splitting Dataset](#SplitData)
    1. [Basic NLP Count-Based Features](#NLPCB)
    1. [Build a Classification Model](#MLModel)
    1. [Hyperparameter Optimization](#GS)
    1. [Logistic Regression](#LogReg)
4. [Conclusion](#conclusion)
5. [Submission](#submission)

<a name="ProblemOverview"></a>
# 1. Problem Overview

> In this challenge, you will work on a dataset that contains **news headlines** - which are aimed to be **written in a sarcastic manner** by the news author. Our job here is to build our NLP models and **predict whether the headline is sarcastic or not**.

This problem represents a **binary classification problem** as the news headlines need to be **classified betweeen 2 categories**:
1. Sarcastic (1)
2. Not Sarcastic (0)

_More info about different Classification types [here](https://machinelearningmastery.com/types-of-classification-in-machine-learning/#:~:text=In%20machine%20learning%2C%20classification%20refers,one%20of%20the%20known%20characters.)._

<a name="Data"></a>
# 2. Import & Check Dataset

In [1]:
# import required libraries
import pandas as pd # import dataset, create and manipulate dataframes
import numpy as np # vectorize functions and perform calculations
import contractions # expand contractions
import re # regular expressions
import string # count-based features
import seaborn as sns # visualization
import matplotlib.pyplot as plt # visualization

from nltk.tokenize import word_tokenize # tokenize string or sentences
from nltk.corpus import stopwords # import english stopword list
from nltk.stem import PorterStemmer # stemming
from sklearn.linear_model import LogisticRegression # our algorithm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # count-based language models
from sklearn.metrics import confusion_matrix, classification_report, make_scorer # model evaluation metrics
from sklearn.metrics import accuracy_score, f1_score # model evaluation metrics
from sklearn.model_selection import train_test_split, GridSearchCV # split & evaluate dataset, hyperparameter optimization
from sklearn.model_selection import KFold # cross-validation
from collections import Counter # count-based calculations
from textblob import TextBlob # sentiment analysis
from wordcloud import WordCloud # visualization

pd.options.mode.chained_assignment = None  # hide warnings

In [2]:
# load dataset as dataframe
df_train = pd.read_csv('https://github.com/CSpanias/nlp_resources/blob/main/dphi_nlp_bootcamp/final_assigment/Train_Dataset.csv?raw=true')
df_test = pd.read_csv('https://github.com/CSpanias/nlp_resources/blob/main/dphi_nlp_bootcamp/final_assigment/Test_Dataset.csv?raw=true')

In [3]:
# drop duplicated rows
df_train.drop_duplicates(keep='first', inplace=True)
df_test.drop_duplicates(keep='first', inplace=True)

<a name="Pipeline"></a>
# 3. NLP Pipeline

The steps below will form our **NLP pipeline** for building our NLP models:
1. [Text Pre-Processing](#TextPre)
1. [Train & Test Datasets](#SplitData)
1. [Basic NLP Count-Based Features](#NLPCB)
1. [Sentiment Analysis](#sentana)
1. [Bag of Words](#BoW)
1. [Build a Classification Model](#MLModel)

<a name="TextPre"></a>
## 3.1 Text Pre-Processing

Normally, our $1^{st}$ step would be to perform some **basic text pre-processing** like:
* remove stopwords
* remove punctuation
* lower case characters
* stip whitespace
* expand contractions

In this case **stopwords**, **punctuation** as well as **character casing** could provide information regarding the **tone of the headline**, thus we will keep them as it is.

In [4]:
# load stopwords default nltk list
stop_words = stopwords.words('english')

def normalize_document(doc):
    """Normalize the document by performing basic text pre-processing tasks."""

    # remove special characters
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    # remove trailing whitespace
    nowhite = doc.strip()
    # expand contractions
    expanded = contractions.fix(nowhite)
    # tokenize document
    tokens = word_tokenize(expanded)
    # remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from tokens
    doc = ' '.join(filtered_tokens)
    
    return doc

# vectorize function for faster computations
normalize_corpus = np.vectorize(normalize_document)

In [5]:
# normalize 'headline'
norm_corpus_train = normalize_corpus(list(df_train['headline']))
norm_corpus_test = normalize_corpus(list(df_test['headline']))

<a name="SplitData"></a>
## 3.2 Splitting Dataset

In [6]:
# assign feature & target variables
X = df_train.drop(['is_sarcastic'], axis = 1)
y = df_train['is_sarcastic']

# split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<a name="NLPCB"></a>
## 3.3 Basic NLP Count-based Features

In [7]:
# calculate total number of characters
X_train['char_count'] = X_train['headline'].apply(len)
# calculate total number of words
X_train['word_count'] = X_train['headline'].apply(lambda x: len(x.split()))
# # calculate average word density
X_train['word_density'] = X_train['char_count'] / (X_train['word_count']+1)
# calculate total number of punctuaction marks
X_train['punctuation_count'] = X_train['headline'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation)))
# calculate total number of title-cased words
X_train['title_word_count'] = X_train['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
# calculate total number of upper-cased words
X_train['upper_case_word_count'] = X_train['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

# calculate total number of characters
X_test['char_count'] = X_test['headline'].apply(len)
# calculate total number of words
X_test['word_count'] = X_test['headline'].apply(lambda x: len(x.split()))
# calculate average word density
X_test['word_density'] = X_test['char_count'] / (X_test['word_count']+1)
# calculate total number of punctuaction marks
X_test['punctuation_count'] = X_test['headline'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
# calculate total number of title-cased words
X_test['title_word_count'] = X_test['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
# calculate total number of upper-cased words
X_test['upper_case_word_count'] = X_test['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

# calculate total number of characters
df_test['char_count'] = df_test['headline'].apply(len)
# calculate total number of words
df_test['word_count'] = df_test['headline'].apply(lambda x: len(x.split()))
# calculate average word density
df_test['word_density'] = df_test['char_count'] / (X_test['word_count']+1)
# calculate total number of punctuaction marks
df_test['punctuation_count'] = df_test['headline'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
# calculate total number of title-cased words
df_test['title_word_count'] = df_test['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
# calculate total number of upper-cased words
df_test['upper_case_word_count'] = df_test['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [8]:
# remove columns
X_train.drop(columns=['title_word_count', 'upper_case_word_count'], inplace=True, axis=0)
X_test.drop(columns=['title_word_count', 'upper_case_word_count'], inplace=True, axis=0)
df_test.drop(columns=['title_word_count', 'upper_case_word_count'], inplace=True, axis=0)

<a name="sentana"></a>
## 3.4 Sentiment Analysis 

In [9]:
# calculate review's sentiment 
x_train_snt_obj = X_train['headline'].apply(lambda row: TextBlob(row).sentiment)
# create a column for polarity scores
X_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
# create a column for subjectivity scores
X_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

# calculate review's sentiment 
x_test_snt_obj = X_test['headline'].apply(lambda row: TextBlob(row).sentiment)
# create a column for polarity scores
X_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
# create a column for subjectivity scores
X_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]

# calculate review's sentiment 
df_test_snt_obj = df_test['headline'].apply(lambda row: TextBlob(row).sentiment)
# create a column for polarity scores
df_test['Polarity'] = [obj.polarity for obj in df_test_snt_obj.values]
# create a column for subjectivity scores
df_test['Subjectivity'] = [obj.subjectivity for obj in df_test_snt_obj.values]

<a name="BoW"></a>
## 3.5 Bag of Words

In [10]:
# load stopwords default nltk list
stop_words = stopwords.words('english')

# load up a simple porter stemmer - nothing fancy
ps = PorterStemmer()

def simple_text_preprocessor(document):
    """Perform basic text pre-processing tasks."""
    
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

# vectorize function
stp = np.vectorize(simple_text_preprocessor)

In [11]:
# create a new column with cleaned text
X_train['Clean Headline'] = stp(X_train['headline'].values)
X_test['Clean Headline'] = stp(X_test['headline'].values)
df_test['Clean Headline'] = stp(df_test['headline'].values)

In [12]:
# remove the 2 columns
X_train_metadata = X_train.drop(['headline', 'Clean Headline'], axis=1).reset_index(drop=True)
X_test_metadata = X_test.drop(['headline', 'Clean Headline'], axis=1).reset_index(drop=True)
df_test_metadata = df_test.drop(['headline', 'Clean Headline'], axis=1).reset_index(drop=True)

In [13]:
# # instatiate vectorizer
# cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))

# # fit vectorizer to 'Clean Review' and convert it to numpy array
# X_traincv = cv.fit_transform(X_train['Clean Headline']).toarray()
# # create a pandas DataFrame
# X_traincv = pd.DataFrame(X_traincv, columns=cv.get_feature_names())

# # use vectorizer to transform 'Clean Review' and convert it to numpy array
# X_testcv = cv.transform(X_test['Clean Headline']).toarray()
# # create a pandas DataFrame
# X_testcv = pd.DataFrame(X_testcv, columns=cv.get_feature_names())

# # check first 5 rows
# X_traincv.head()

In [14]:
# # instatiate vectorizer
# cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))

# # fit vectorizer to 'Clean Review' and convert it to numpy array
# X_traincv = cv.fit_transform(X_train['Clean Headline']).toarray()
# # create a pandas DataFrame
# X_traincv = pd.DataFrame(X_traincv, columns=cv.get_feature_names())

# # use vectorizer to transform 'Clean Review' and convert it to numpy array
# df_testcv = cv.transform(df_test['Clean Headline']).toarray()
# # create a pandas DataFrame
# df_testcv = pd.DataFrame(df_testcv, columns=cv.get_feature_names())

We now must **concatenate the 2 seperate DataFrames into a single DataFrame**, and **remove `headline` column**.

In [15]:
# # concatenate the 2 dataframes
# X_train_comb = pd.concat([X_train_metadata, X_traincv], axis=1)
# df_test_comb = pd.concat([df_test_metadata, df_testcv], axis=1)

<a name="LogReg"></a>
## 3.7 Logistic Regression

In [16]:
# instantiate log reg
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')
# train logreg
lr.fit(X_train_metadata, y_train)
# predict using test data
target = lr.predict(X_test_metadata)

In [17]:
# print classification report
print(classification_report(y_test, target))

              precision    recall  f1-score   support

           0       0.62      0.72      0.67      4711
           1       0.61      0.49      0.54      4142

    accuracy                           0.61      8853
   macro avg       0.61      0.61      0.60      8853
weighted avg       0.61      0.61      0.61      8853



<a name="submission"></a>
# 5. Submission

You can read [this](https://discuss.dphi.tech/t/how-to-submit-predictions-in-datathons-data-sprints-on-dphi/548) post which includes **details regarding the submission process**. 

In [18]:
# predictions is nothing but the final predictions of your model on input features of your new unseen test data
res = pd.DataFrame(target) 

# its important for comparison
res.index = df_test.index 
res.columns = ["prediction"]

# the csv file will be saved locally on the same location where this notebook is located
res.to_csv("prediction_results.csv", index = False)

ValueError: Length mismatch: Expected axis has 8853 elements, new values have 11066 elements