<font size="6">**Headlines dataset - sarcasm detection**</font>

<div style="text-align: right">
<b>Maciej Kleyny</b><br>
m.kleyny@gazeta.pl<br>
</div>

## Data preparation

In [1]:
import json
import nltk
import string
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [2]:
headers_raw = []
for line in open('Data/Graduate - HEADLINES dataset (2019-06).json', 'r'):
    headers_raw.append(json.loads(line))

In [3]:
headers_raw[:5]

[{'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0},
 {'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
  'is_sarcastic': 0},
 {'headline': "mom starting to fear son's web series closest thing she will have to grandchild",
  'is_sarcastic': 1},
 {'headline': 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
  'is_sarcastic': 1},
 {'headline': 'j.k. rowling wishes snape happy birthday in the most magical way',
  'is_sarcastic': 0}]

In [4]:
# checking if all of the keys are the same and their amount equals headers' list shape

i = 0
j = 0

for line in headers_raw:
    
    if list(line.keys())[0] == 'headline':
        i += 1
        
    if list(line.keys())[1] == 'is_sarcastic':
        j += 1
        
print(len(headers_raw), i, j)

if len(headers_raw) == i ==j:
    
    print('All keys are proper')
    
else:
    
    print('Keys need to be investigated')

26709 26709 26709
All keys are proper


In [5]:
def Preprocessing_f(list, test_size=0.15, val_size=0.15):
# this function will take raw data as input, transform it and split it into train, validation and test data
    
    # extracting headlines and labels from dictionaries
    header = [] 
    y = []

    for line in list:
    
        header.append(line['headline'])
        y.append(line['is_sarcastic'])
        
    # removing redundant symbols (punctuation, digits, initial spaces etc.) and lowering all letters
    header_trimmed = []

    for line in header:
    
        header_trimmed.append(line.translate(str.maketrans('', '', string.digits))\
        .translate(str.maketrans('', '', string.punctuation))\
        .strip()\
        .lower())
    
    # tokenizing every headline, i.e. dividing it into separate words
    header_trimmed = [nltk.word_tokenize(line) for line in header_trimmed]
    
    # creating a set of english stop-words that will be removed from every headline
    # as they're too common to add any value
    stop_words = set(nltk.corpus.stopwords.words("english"))
    
    header_trimmed = [[word for word in line if word not in stop_words] for line in header_trimmed]
    
    # removing words endings to get rid of declination and conjugation
    stemmer = nltk.PorterStemmer()
    header_trimmed = [[stemmer.stem(word) for word in line] for line in header_trimmed]
    
    # putting words back together into a headline
    header_trimmed = [' '.join(line) for line in header_trimmed]
    
#     del list
    
#     list = header_trimmed
    
    # creating train, validation and test subsets
    global X_train
    global y_train
    global X_val
    global y_val
    global X_test
    global y_test
    
    X_train, X_test, y_train, y_test \
    = train_test_split(header_trimmed, y, test_size=test_size, random_state=1)
    
    X_train, X_val, y_train, y_val \
    = train_test_split(X_train, y_train, test_size=val_size/(1-test_size), random_state=2)

    print('This function preprocesses the raw data and splits it into train, validation and test data')

In [6]:
Preprocessing_f(headers_raw)

This function preprocesses the raw data and splits it into train, validation and test data


## Model training

In this part I'm going to grid-search for the best parameters combination for 3 models: Logistic Regression, Random Forest and Support Vector Machine. Next, I evaluate them based on their accuracy on the validation set, choose one and run it on the test set. The final step is comparing it to the 'dummy' model that predicts the most common class for every unseen observation.

### Logistic Regression

In [7]:
pipeline = Pipeline(
                    [('count_vect', CountVectorizer()),
                    ('LogReg', LogisticRegression())
])

params = {}
params['count_vect__ngram_range'] = [(1,1), (1,2), (1,3)]
params['count_vect__max_df'] = [0.5, 0.6, 0.7, 0.8]
params['count_vect__min_df'] = [0.005, 0.01, 0.02, 0.04]
params['LogReg__C'] = [0.1, 0.5, 1]

CV = GridSearchCV(pipeline, params, n_jobs=-1, verbose=1)

CV.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 144 candidates, totalling 432 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   13.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   49.5s
[Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed:  2.0min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('count_vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                 

In [8]:
CV.best_params_

{'LogReg__C': 0.5,
 'count_vect__max_df': 0.5,
 'count_vect__min_df': 0.005,
 'count_vect__ngram_range': (1, 2)}

In [9]:
# Logistic Regression accuracy on the train set
y_pred_train = CV.predict(X_train)
accuracy_score(y_train, y_pred_train)

0.6831773201390746

In [10]:
# Logistic Regression accuracy on the validation set
y_pred_val = CV.predict(X_val)
accuracy_score(y_val, y_pred_val)

0.6790616421262791

### Random Forest

In [11]:
pipeline2 = Pipeline(
                    [('count_vect', CountVectorizer()),
                    ('Forest', RandomForestClassifier())
])

params2 = {}
params2['count_vect__ngram_range'] = [(1,1), (1,2), (1,3)]
params2['count_vect__max_df'] = [0.5, 0.7]
params2['count_vect__min_df'] = [0.005, 0.02]
params2['Forest__n_estimators'] = [10, 20, 30]
params2['Forest__max_depth'] = [5, 10, 15]

CV2 = GridSearchCV(pipeline2, params2, n_jobs=-1, verbose=1)

CV2.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   57.4s
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:  1.7min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('count_vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                 

In [12]:
CV2.best_params_

{'Forest__max_depth': 15,
 'Forest__n_estimators': 30,
 'count_vect__max_df': 0.7,
 'count_vect__min_df': 0.005,
 'count_vect__ngram_range': (1, 2)}

In [13]:
# Random Forest accuracy on the train set
y_pred_train2 = CV2.predict(X_train)
accuracy_score(y_train, y_pred_train2)

0.6737095480074886

In [14]:
# Random Forest accuracy on the validation set
y_pred_val2 = CV2.predict(X_val)
accuracy_score(y_val, y_pred_val2)

0.6643374095333167

### Support Vector Machine

In [15]:
pipeline3 = Pipeline(
                    [('count_vect', CountVectorizer()),
                    ('SVM', SVC())
])

params3 = {}
params3['count_vect__ngram_range'] = [(1,1), (1,2), (1,3)]
params3['count_vect__max_df'] = [0.5, 0.7]
params3['count_vect__min_df'] = [0.005, 0.02]
params3['SVM__kernel'] = ['poly', 'rbf']
params3['SVM__degree'] = [2, 3]

CV3 = GridSearchCV(pipeline3, params3, n_jobs=-1, verbose=1)

CV3.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:  5.5min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('count_vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                 

In [16]:
CV3.best_params_

{'SVM__degree': 2,
 'SVM__kernel': 'rbf',
 'count_vect__max_df': 0.5,
 'count_vect__min_df': 0.005,
 'count_vect__ngram_range': (1, 1)}

In [17]:
# Support Vector Machine accuracy on the train set
y_pred_train3 = CV3.predict(X_train)
accuracy_score(y_train, y_pred_train3)

0.6420433270928055

In [18]:
# Support Vector Machine accuracy on the validation set
y_pred_val3 = CV3.predict(X_val)
accuracy_score(y_val, y_pred_val3)

0.6456201647117544

## Final evaluation

In [19]:
print(f'''Accuracy on the validation score:

LogisticRegression accuracy: {accuracy_score(y_val, y_pred_val).round(2)}
RandomForestClassifier accuracy: {accuracy_score(y_val, y_pred_val2).round(2)}
SupportVectorClassifier accuracy: {accuracy_score(y_val, y_pred_val3).round(2)}''')

Accuracy on the validation score:

LogisticRegression accuracy: 0.68
RandomForestClassifier accuracy: 0.66
SupportVectorClassifier accuracy: 0.65


Based on these scores I recommend Logistic Regression model with best_params:
- regularization: 0.5

and Count Vectorizer with parameters:
- max_df: 0.5,
- min_df: 0.005,
- ngram_range: (1,2).


Now I'm going to run the final check on the test subset using only 1 model.

In [20]:
# Model accuracy on the test set is...
y_pred_test = CV.predict(X_test)
accuracy_score(y_test, y_pred_test)

0.6823059645620164

In [21]:
# while the train set accuracy was...
accuracy_score(y_train, y_pred_train)

0.6831773201390746

In [23]:
# and the 'dummy' model (assigning the most common group's label to every prediction) would give...
np.mean(y_test == np.mean(y_train).round(0))

0.5672572997254804

It seems that the model I prepared is better than random and isn't overfitted at the same time. The accuracy isn't that great, but broader grid search might have helped. Also, dimensionality reduction would be a good idea to try.