# Classification model for detecting sarcasm

**Problem Overview**. 

In this challenge, you will work on a dataset that contains news headlines - which are aimed to be written in a sarcastic manner by the news author. Our job here is to build our NLP models and predict whether the headline is sarcastic or not.  


**About the Data**
Each record of dataset consists of two attributes:  

is_sarcastic: 1 if the record is sarcastic otherwise 0. This is the target variable.  

headline: this is the headline of the news article

In [1]:
#loading libraries
import pandas as pd
import numpy as np

In [2]:
# Loading dataset
train = pd.read_csv('Train_Data.csv')
validation = pd.read_csv('Test_Data.csv')

In [3]:
# getting same basic information about the dataset
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44262 entries, 0 to 44261
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   headline      44262 non-null  object
 1   is_sarcastic  44262 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 691.7+ KB


In [4]:
# Checking how imbalanced is the dataset
train['is_sarcastic'].value_counts()

0    23958
1    20304
Name: is_sarcastic, dtype: int64

# Preprocessing

In [5]:
#remove empty lines
def remove_blanks(dataset,column):
  dataset.drop(dataset.loc[dataset[column] == ' ',:].index,axis=0,inplace=True)
  return dataset

In [6]:
remove_blanks(train,'headline')
remove_blanks(validation,'headline')

Unnamed: 0,headline
0,area stand-up comedian questions the deal with...
1,dozens of glowing exit signs mercilessly taunt...
2,perfect response to heckler somewhere in prop ...
3,gop prays for ossoff lossoff
4,trevor noah says the scary truth about trump's...
...,...
11061,house conservatives claim democrats have faile...
11062,area man having one of his little bursts of en...
11063,there is nothing libertarian about conservatives
11064,mike pompeo startled after seeing 'beware of h...


In [7]:
# Splitting our dataset into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train[['headline']], train['is_sarcastic'], random_state=42)
X_train.shape, X_test.shape

((33196, 1), (11066, 1))

# Features extraction

In [8]:
import textblob
import string

In [9]:
# Extracting polarity and subjetivity
def textblob_features(X):
    snt_obj = X['headline'].apply(lambda row: textblob.TextBlob(row).sentiment)
    X['Polarity'] = [obj.polarity for obj in snt_obj.values]
    X['Subjectivity'] = [obj.subjectivity for obj in snt_obj.values]
    return X

In [10]:
textblob_features(X_train)
textblob_features(X_test)
textblob_features(validation)

Unnamed: 0,headline,Polarity,Subjectivity
0,area stand-up comedian questions the deal with...,0.000000,0.000000
1,dozens of glowing exit signs mercilessly taunt...,-0.700000,1.000000
2,perfect response to heckler somewhere in prop ...,1.000000,1.000000
3,gop prays for ossoff lossoff,0.000000,0.000000
4,trevor noah says the scary truth about trump's...,0.000000,0.800000
...,...,...,...
11061,house conservatives claim democrats have faile...,-0.333333,0.366667
11062,area man having one of his little bursts of en...,-0.143750,0.450000
11063,there is nothing libertarian about conservatives,0.000000,0.000000
11064,mike pompeo startled after seeing 'beware of h...,0.000000,0.000000


In [11]:
# Extracting features based on characters
def extract_features(X):
    X['char_count'] = X['headline'].apply(len)
    X['word_count'] = X['headline'].apply(lambda x: len(x.split()))
    X['word_density'] = X['char_count'] / (X['word_count']+1)
    X['punctuation_count'] = X['headline'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
    X['upper_case_word_count'] = X['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))
    return X

In [12]:
extract_features(X_train)
extract_features(X_test)
extract_features(validation)

Unnamed: 0,headline,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count
0,area stand-up comedian questions the deal with...,0.000000,0.000000,65,9,6.500000,2,0
1,dozens of glowing exit signs mercilessly taunt...,-0.700000,1.000000,65,9,6.500000,0,0
2,perfect response to heckler somewhere in prop ...,1.000000,1.000000,62,9,6.200000,1,0
3,gop prays for ossoff lossoff,0.000000,0.000000,28,5,4.666667,0,0
4,trevor noah says the scary truth about trump's...,0.000000,0.800000,65,11,5.416667,1,0
...,...,...,...,...,...,...,...,...
11061,house conservatives claim democrats have faile...,-0.333333,0.366667,65,8,7.222222,0,0
11062,area man having one of his little bursts of en...,-0.143750,0.450000,81,17,4.500000,0,0
11063,there is nothing libertarian about conservatives,0.000000,0.000000,48,6,6.857143,0,0
11064,mike pompeo startled after seeing 'beware of h...,0.000000,0.000000,87,14,5.800000,2,0


In [13]:
X_train.head()

Unnamed: 0,headline,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count
34059,authoritarian secretary of transportation decl...,0.142857,0.767857,106,15,6.625,0,0
6098,study: marriages between perfectly matched cou...,0.333333,0.688889,88,13,6.285714,1,0
6252,employee wellness programs aren't so voluntary...,0.0,0.0,54,7,6.75,1,0
20826,p is for p*ssy' is the alphabet book of your w...,-0.1,0.4,55,12,4.230769,2,0
24719,nothing going right for area surgeon today,0.285714,0.535714,42,7,5.25,0,0


In [14]:
X_test.head()

Unnamed: 0,headline,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count
12782,north dakota not heard from in 48 hours,0.0,0.0,39,8,4.333333,0,0
42915,report: it going to take way more than an inco...,0.25,0.3,106,19,5.3,1,0
33043,states' rights rancher ryan bundy to run for n...,0.0,0.0,60,10,5.454545,1,0
1121,watching thousands march in his honor unlocks ...,0.0,0.0,85,13,6.071429,2,0
38782,"during the debate, these two did the unthinkab...",-0.05,0.8,71,12,5.461538,1,0


In [15]:
validation.head()

Unnamed: 0,headline,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count
0,area stand-up comedian questions the deal with...,0.0,0.0,65,9,6.5,2,0
1,dozens of glowing exit signs mercilessly taunt...,-0.7,1.0,65,9,6.5,0,0
2,perfect response to heckler somewhere in prop ...,1.0,1.0,62,9,6.2,1,0
3,gop prays for ossoff lossoff,0.0,0.0,28,5,4.666667,0,0
4,trevor noah says the scary truth about trump's...,0.0,0.8,65,11,5.416667,1,0


## Implementng a logistic regression model

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [17]:
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')
lr.fit(X_train.drop(['headline'], axis=1), y_train)
predictions = lr.predict(X_test.drop(['headline'], axis=1))

print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.62      0.72      0.67      5947
           1       0.60      0.49      0.54      5119

    accuracy                           0.61     11066
   macro avg       0.61      0.60      0.60     11066
weighted avg       0.61      0.61      0.61     11066



Unnamed: 0,0,1
0,4294,1653
1,2629,2490


In [18]:
accuracy_score(y_test,predictions)

0.6130489788541479

The performance of this model is not very good, since it is only a little better than flipping a coin. so we are going to improve the feature engineering of our model by extracting variables with statistical methods, in the following part of the notebook.

# Feature extraction based on TF-IDF

## Preprocessing

In [19]:
# Installing dependencies
!pip install textsearch
!pip install contractions
!pip install tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import contractions
import re



[nltk_data] Downloading package punkt to /Users/linatobon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/linatobon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# remove some stopwords to capture negation in n-grams if possible
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('but')

# load up a simple porter stemmer - nothing fancy
ps = nltk.porter.PorterStemmer()

def simple_text_preprocessor(document): 
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

stp = np.vectorize(simple_text_preprocessor)

In [21]:
# applying the prepocessor function to the train ,test and validation datasets
X_train['Clean'] = stp(X_train['headline'].values)
X_test['Clean'] = stp(X_test['headline'].values)
validation['Clean'] = stp(validation['headline'].values)

## TF-IDF

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
#fitting TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(X_train['Clean'])
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
tfidf_train = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)



In [24]:
# extracting features for the test dataset

tv_matrix = tv.transform(X_test['Clean'])
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
tfidf_test = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

# extracting features for the validation dataset
tv_matrix = tv.transform(validation['Clean'])
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
tfidf_valid = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

In [25]:
# keeping just the predictors in both training and test datasets
X_train_metadata = X_train.drop(['headline', 'Clean'], axis=1).reset_index(drop=True)
X_test_metadata = X_test.drop(['headline', 'Clean'], axis=1).reset_index(drop=True)
val_metadata = validation.drop(['headline', 'Clean'], axis=1).reset_index(drop=True)

In [26]:
val_metadata.head()

Unnamed: 0,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count
0,0.0,0.0,65,9,6.5,2,0
1,-0.7,1.0,65,9,6.5,0,0
2,1.0,1.0,62,9,6.2,1,0
3,0.0,0.0,28,5,4.666667,0,0
4,0.0,0.8,65,11,5.416667,1,0


In [27]:
X_train_comb = pd.concat([X_train_metadata, tfidf_train], axis=1)
X_test_comb = pd.concat([X_test_metadata, tfidf_test], axis=1)
X_validation = pd.concat([val_metadata, tfidf_valid], axis=1)

## Modeling

### logistic regression

In [28]:
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')
lr.fit(X_train_comb, y_train)
predictions_lr = lr.predict(X_test_comb)

print(classification_report(y_test, predictions_lr))
pd.DataFrame(confusion_matrix(y_test, predictions_lr))

              precision    recall  f1-score   support

           0       0.84      0.88      0.86      5947
           1       0.86      0.81      0.83      5119

    accuracy                           0.85     11066
   macro avg       0.85      0.85      0.85     11066
weighted avg       0.85      0.85      0.85     11066



Unnamed: 0,0,1
0,5260,687
1,982,4137


In [29]:
accur_lr_tfidf = round(accuracy_score(y_test,predictions_lr),2)

### Linear SVC

In [30]:
from sklearn.svm import LinearSVC

In [31]:
# Using linear support vector classifier
lsvc = LinearSVC()
# training the model
lsvc.fit(X_train_comb, y_train)
lsvc_predictions = lsvc.predict(X_test_comb)
print(classification_report(y_test, lsvc_predictions))
pd.DataFrame(confusion_matrix(y_test, lsvc_predictions))



              precision    recall  f1-score   support

           0       0.59      1.00      0.74      5947
           1       0.99      0.18      0.31      5119

    accuracy                           0.62     11066
   macro avg       0.79      0.59      0.52     11066
weighted avg       0.77      0.62      0.54     11066



Unnamed: 0,0,1
0,5933,14
1,4184,935


In [32]:
accur_svc_tfidf = round(accuracy_score(y_test,lsvc_predictions),2)

### Decision Tree

In [33]:
from sklearn.tree import DecisionTreeClassifier

In [34]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_comb, y_train)
tree_tfidf_predictions = clf.predict(X_test_comb)
print(classification_report(y_test, tree_tfidf_predictions))
pd.DataFrame(confusion_matrix(y_test, tree_tfidf_predictions))

              precision    recall  f1-score   support

           0       0.87      0.90      0.88      5947
           1       0.88      0.84      0.86      5119

    accuracy                           0.87     11066
   macro avg       0.87      0.87      0.87     11066
weighted avg       0.87      0.87      0.87     11066



Unnamed: 0,0,1
0,5337,610
1,799,4320


In [35]:
accur_tree_tfidf = round(accuracy_score(y_test,tree_tfidf_predictions),2)

up to this point we have trained two different models, but we have not had significant increases in model performance. so we will try to extract features with the TF method.

## Extracting features with TF

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))
X_traincv = cv.fit_transform(X_train['Clean']).toarray()
X_traincv = pd.DataFrame(X_traincv, columns=cv.get_feature_names())

X_testcv = cv.transform(X_test['Clean']).toarray()
X_testcv = pd.DataFrame(X_testcv, columns=cv.get_feature_names())

X_valcv = cv.transform(validation['Clean']).toarray()
X_valcv = pd.DataFrame(X_valcv, columns=cv.get_feature_names())
X_valcv.head()



Unnamed: 0,aaa,aaron,aarp,ab,abandon,abaya,abba,abbey,abbi,abc,...,zoo,zookeep,zooland,zoologist,zoom,zoroastrian,zsa,zucker,zuckerberg,zz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
X_train_combin = pd.concat([X_train_metadata, X_traincv], axis=1)
X_test_combin = pd.concat([X_test_metadata, X_testcv], axis=1)
X_val_combin = pd.concat([val_metadata, X_valcv], axis=1)

X_val_combin.head()

Unnamed: 0,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count,aaa,aaron,aarp,...,zoo,zookeep,zooland,zoologist,zoom,zoroastrian,zsa,zucker,zuckerberg,zz
0,0.0,0.0,65,9,6.5,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,-0.7,1.0,65,9,6.5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1.0,62,9,6.2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.0,0.0,28,5,4.666667,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.0,0.8,65,11,5.416667,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Fiting a logistic model

In [38]:
lr.fit(X_train_combin, y_train)
predictions_lr_tf = lr.predict(X_test_combin)

print(classification_report(y_test, predictions_lr_tf))
pd.DataFrame(confusion_matrix(y_test, predictions_lr_tf))

              precision    recall  f1-score   support

           0       0.87      0.90      0.88      5947
           1       0.88      0.84      0.86      5119

    accuracy                           0.87     11066
   macro avg       0.87      0.87      0.87     11066
weighted avg       0.87      0.87      0.87     11066



Unnamed: 0,0,1
0,5361,586
1,822,4297


In [39]:
accur_lr_tf = round(accuracy_score(y_test,predictions_lr_tf),2)

### Fiting a SVC model

In [40]:
clf = LinearSVC(random_state=0)
clf.fit(X_train_combin,y_train)
clf_predictions_tf = clf.predict(X_test_combin)
print(classification_report(y_test, clf_predictions_tf))
pd.DataFrame(confusion_matrix(y_test, clf_predictions_tf))



              precision    recall  f1-score   support

           0       0.93      0.76      0.84      5947
           1       0.77      0.94      0.84      5119

    accuracy                           0.84     11066
   macro avg       0.85      0.85      0.84     11066
weighted avg       0.86      0.84      0.84     11066



Unnamed: 0,0,1
0,4507,1440
1,330,4789


In [41]:
accur_svc_tf = round(accuracy_score(y_test,clf_predictions_tf),2)

### Fiting a Decision Tree model

In [42]:
clf_tree = DecisionTreeClassifier(random_state=42)
clf_tree.fit(X_train_combin, y_train)
tree_tf_predictions = clf_tree.predict(X_test_combin)
print(classification_report(y_test, tree_tf_predictions))
pd.DataFrame(confusion_matrix(y_test, tree_tf_predictions))

              precision    recall  f1-score   support

           0       0.87      0.90      0.89      5947
           1       0.88      0.85      0.86      5119

    accuracy                           0.88     11066
   macro avg       0.88      0.87      0.87     11066
weighted avg       0.88      0.88      0.88     11066



Unnamed: 0,0,1
0,5353,594
1,783,4336


In [43]:
accur_tree_tf = round(accuracy_score(y_test,tree_tf_predictions),2)

## Model Comparison

In [44]:
models = ['DesicionTree_TFIDF','LogReg_TFIDF','SVC_TFIDF','DesicionTree_TF','LogReg_TF','SVC_TF']
col = [accur_tree_tfidf,accur_lr_tfidf,accur_svc_tfidf, accur_tree_tf,accur_lr_tf, accur_svc_tf ]
data = {'Models':models,'Accuracy':col}
graph_df = pd.DataFrame(data)
graph_df

Unnamed: 0,Models,Accuracy
0,DesicionTree_TFIDF,0.87
1,LogReg_TFIDF,0.85
2,SVC_TFIDF,0.62
3,DesicionTree_TF,0.88
4,LogReg_TF,0.87
5,SVC_TF,0.84


From the table above we can conclude that the model that has the best performance is the decision Tree with the TF method for feature engineering.

## Classyfing new data

In [45]:
X_val_combin.head()

Unnamed: 0,Polarity,Subjectivity,char_count,word_count,word_density,punctuation_count,upper_case_word_count,aaa,aaron,aarp,...,zoo,zookeep,zooland,zoologist,zoom,zoroastrian,zsa,zucker,zuckerberg,zz
0,0.0,0.0,65,9,6.5,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,-0.7,1.0,65,9,6.5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1.0,62,9,6.2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.0,0.0,28,5,4.666667,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.0,0.8,65,11,5.416667,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
predictions = clf_tree.predict(X_val_combin)
res = pd.DataFrame(predictions) #preditcions are nothing but the final predictions of your model on input features of your new unseen test data
res.index = X_val_combin.index # its important for comparison. Here "test_new" is your new test dataset
res.columns = ["prediction"]
res.to_csv("prediction_results.csv", index = False)      # the csv file will be saved locally on the same location where this notebook is located.

In [47]:
submission = pd.read_csv("prediction_results.csv")

In [48]:
submission.value_counts()

prediction
0             6314
1             4752
dtype: int64

## conclusions
* Data preparation: we eliminated double blanks, converted text to lowercase, removed stopwords and applied stemming to the texts.
* For feature extraction, we have created variables based on the length of the texts and words, as well as from the number of punctuation marks in the texts. we also used the tfidf and TF methods to extract features from the frequency of words in the different documents.
* Three models were trained for each of the datasets (with features from TFIDF and with features from TF). surprisingly, the best performing model is the one that uses feature engineering with TF and applies a decision tree model.
