## Reddit Sarcasm Detection

### Import Libraries

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

### Import CSV

In [2]:
training_csv_1 = pd.read_csv("train-balanced-sarcasm.csv")

In [3]:
training_csv_1["comment"] = training_csv_1["comment"].astype(str)

In [4]:
training_csv_1.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


### Exploratory Data Analysis

In [5]:
print(f"The total training data has {training_csv_1.author.nunique()} rows.")
training_csv_1.groupby("author").mean()["label"].value_counts()

The total training data has 256561 rows.


0.500000    251903
1.000000      2302
0.333333       770
0.400000       405
0.428571       268
             ...  
0.533333         1
0.494253         1
0.555556         1
0.499408         1
0.495050         1
Name: label, Length: 64, dtype: int64

##### The authors is mostly 0.5 probability of each label, might consider dropping it

In [6]:
print(f"The total training data has {training_csv_1.subreddit.nunique()} rows.")
training_csv_1.groupby("subreddit").mean()["label"].value_counts()

The total training data has 14878 rows.


0.000000    5883
1.000000    2042
0.500000    1242
0.333333     585
0.250000     362
            ... 
0.465517       1
0.506567       1
0.257874       1
0.550802       1
0.373333       1
Name: label, Length: 1430, dtype: int64

##### Subreddit seems to provide more info than expected, should probably keep

In [18]:
training_csv_1[["ups", "downs"]]

Unnamed: 0,ups,downs
0,-1,-1
1,-1,-1
2,3,0
3,-1,-1
4,-1,-1
...,...,...
1010821,2,0
1010822,1,0
1010823,1,0
1010824,1,0


##### Notice how ups and downs seem to have a correlation? Lets test this theory out

In [22]:
training_csv_1[training_csv_1["ups"].apply(lambda x: -1 if x <= -1 else 0) != training_csv_1["downs"]]

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
140,0,my comment very similar to this went down a fu...,Schumarker,Android,-6,-6,0,2016-09,2016-09-24 21:50:56,Badumm-tzz
204,0,it really does,Horus_Krishna_2,radiohead,-1,-1,0,2016-09,2016-09-14 20:07:04,"As far as I know, someone's reddit history doe..."
414,0,"meh, my upper body blows his away.",GiveMeSomeIhedigbo,bodybuilding,-6,-6,0,2016-09,2016-09-19 06:27:32,Do you Agree that this version is The BEST Ver...
431,0,such a shitty meme.,Geralt-of_Rivia,AdviceAnimals,-4,-4,0,2016-09,2016-09-02 02:39:44,Front page post with 2000 comments and is 10 h...
454,0,"this sub is for open ended questions, not yes ...",hunterz5,AskReddit,-3,-3,0,2016-09,2016-09-10 01:54:47,Do you think IB/AP classes are truly worth it?...
...,...,...,...,...,...,...,...,...,...,...
1010772,1,"nsfw, thanks.",Underdogg13,pics,-1,-1,0,2009-09,2009-09-06 17:16:11,"(PIC) Penelope Cruz, firm and without shirt. F..."
1010790,1,".. erm .. good for them, i guess .. they're ma...",mijj,Economics,-2,-2,0,2009-10,2009-10-30 01:12:47,There was a rash of muggings in my neighborhoo...
1010791,1,"zombie, frankenstein, jesus, now thats a real ...",Rip_Van_Winkle,pics,-2,-2,0,2009-10,2009-10-31 23:20:10,"jesus christ, that's a funny diagram."
1010801,1,"yes, and there's no such thing as mental illne...",Davin900,worldnews,-1,-1,0,2009-08,2009-08-14 18:34:29,And my parents had a rough upbringing/backgrou...


##### Only 6.1% does not follow the rules, is downs worth keeping? Debatable I guess

### Build model using Comment Column only (Unigram Model)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

In [8]:
training_csv_1["comment"] = training_csv_1["comment"].apply(lambda x: x.lower())

In [9]:
X_train, X_val, y_train, y_val = train_test_split(
    training_csv_1["comment"], 
    training_csv_1["label"], 
    test_size = 0.2
)

In [10]:
def create_ngram_vectorizer(text_train, ngram_range = (1,1)):
    vectorizer = CountVectorizer(ngram_range = ngram_range)
    vectorizer.fit(text_train)
    return vectorizer

In [11]:
unigram_vectorizer = create_ngram_vectorizer(X_train)

In [12]:
X_train_transformed = unigram_vectorizer.transform(X_train)
X_val_transformed = unigram_vectorizer.transform(X_val)

In [13]:
classifier = SGDClassifier()
classifier.fit(X_train_transformed, y_train)

SGDClassifier()

In [14]:
print(f"Training Score: {classifier.score(X_train_transformed, y_train)}")
print(f"Validation Score: {classifier.score(X_val_transformed, y_val)}")

Training Score: 0.6871095392377513
Validation Score: 0.6793328254998368


### Now what? Bigrams and Trigrams, LETZ GO!!!

In [15]:
for i in range(2, 4):
    igram_vectorizer = create_ngram_vectorizer(X_train, ngram_range = (1,i))
    X_train_transformed = igram_vectorizer.transform(X_train)
    X_val_transformed = igram_vectorizer.transform(X_val)
    
    classifier.fit(X_train_transformed, y_train)
    
    print(f"Training Score: {classifier.score(X_train_transformed, y_train)}")
    print(f"Validation Score: {classifier.score(X_val_transformed, y_val)}")

Training Score: 0.7339784334578191
Validation Score: 0.7072306916098652
Training Score: 0.7626888927361314
Validation Score: 0.7115637644312099
