## Introduction

Now that we have completed all of our models, we can finally implement our own sarcasm detector! We will perform this on our best performing model (outside of the BERT), which was a Logistic Regression with a TFIDF vectorization and a tweaked tokenizer. Let's jump right into it.

As always, the libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import string
import nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

And the dataframe

In [3]:
df = pd.read_csv('/Users/lokikeeler/Downloads/train-balanced-sarcasm_2.csv')

In [4]:
df.head()

Unnamed: 0,label,comment,score,ups,downs,date,created_utc,parent_comment,year,SUB_2007scape,...,SUB_television,SUB_tf2,SUB_todayilearned,SUB_trees,SUB_ukpolitics,SUB_unitedkingdom,SUB_videos,SUB_worldnews,SUB_wow,SUB_xboxone
0,0,NC and NH.,2,-1,-1,2016-10-01,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ...",2016,False,...,False,False,False,False,False,False,False,False,False,False
1,0,You do know west teams play against west teams...,-4,-1,-1,2016-11-01,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...,2016,False,...,False,False,False,False,False,False,False,False,False,False
2,0,"They were underdogs earlier today, but since G...",3,3,0,2016-09-01,2016-09-22 21:45:37,They're favored to win.,2016,False,...,False,False,False,False,False,False,False,False,False,False
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,2016-10-01,2016-10-18 21:03:47,deadass don't kill my buzz,2016,False,...,False,False,False,False,False,False,False,False,False,False
4,0,"I don't pay attention to her, but as long as s...",0,0,0,2016-09-01,2016-09-02 10:35:08,do you find ariana grande sexy ?,2016,False,...,False,False,False,False,False,False,False,False,False,False


In [5]:
df = df.drop(columns='parent_comment')

Here is our tokenizer, the same as the one in our Logistic Regression model with the TFIDF vectorizer

In [6]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 

ENGLISH_STOP_WORDS = stopwords.words('english')

def my_tokenizer(sentence):
    stemmer = nltk.stem.PorterStemmer()
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'')

    listofwords = sentence.split(' ')
    listofstemmed_words = []

    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words  


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lokikeeler/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Again we will work with the comment column, and create our train/test split

In [7]:
X = df.drop(columns='label')
y = df['label']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [9]:
X_train.shape

(384932, 107)

Now I will apply the same TFIDF vectorizer with the same parameters as before

In [10]:
# using our custom tokenizer in TfidfVectorizer

tfidf = TfidfVectorizer(
                       
                        min_df=5,
                        max_features=500,
                        tokenizer=my_tokenizer, 
                        ngram_range=(1, 3)
                        )
tfidf.fit(X_train['comment'])

X_train_transformed = tfidf.transform(X_train['comment'])
X_test_transformed = tfidf.transform(X_test['comment'])

X_train_transformed.shape



(384932, 500)

Lastly, fitting the logistic regression model

In [13]:
logreg = LogisticRegression()

In [31]:
logreg.fit(X_train_transformed, y_train)

LogisticRegression()

Perfect. Now we are ready for our sarcasm detector! We will create this function which includes the sentence, the vectorizer (tfidf), and the model (logistic regression). Then our fitted model will make a prediction about whether the setence is sarcastic or not and return a 1 (sarcastic) or 0 (not sarcastic).

In [70]:
def sarcasm_detector(sentence, vectorizer, model):  
    vector1 = vectorizer.transform([sentence])
    prediction = model.predict_proba(vector1)
    if prediction[0][0] < 0.5:
        print("Sarcastic")
    else:
        print("Not Sarcastic")
    return prediction

For this first sentence, I meant this to be sarcastic. In reality this sentence could be serious or sarcastic, but I'm glad it got it correct!

In [80]:
sarcasm_detector("That shirt is blue", tfidf, logreg)

Not Sarcastic


array([[0.59201511, 0.40798489]])

Alright, that one was easy, but an important test. Let's give it another sarcastic sentence

In [81]:
sarcasm_detector("Great haircut, Paul!", tfidf, logreg)


Sarcastic


array([[0.31247678, 0.68752322]])

And again it gets it correct! Let's go again

In [83]:
sarcasm_detector("I'm looking forward to Demo Day", tfidf, logreg)

Not Sarcastic


array([[0.72186234, 0.27813766]])

Hmmm, this time it got it wrong, but let's see if we slightly tweak this sentence by adding "totally" and an exclamation mark

In [84]:
sarcasm_detector("I'm totally looking forward to Demo Day!", tfidf, logreg)

Sarcastic


array([[0.3179921, 0.6820079]])

Fascinating! With just a couple tweaks it chaged to reading the sarcasm. This shows that the word totally and the exclamation mark has some feature importance in our model

## Conclusion

Using the Logistic Regression with a TFIDF vectorizer from our best model in a previous journal, I was able to create a real sarcasm detector function. I inputted sarcastic and non sarcastic sentences and the model guessed many of them correctly. Furthermore, for the one it got incorrect, we were able to add just a couple features that allowed the function to read it correctly, highlighting the importance of feature selection.