## Reddit Sarcasm Detection

### Import Libraries

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

### Import CSV

In [2]:
training_csv_1 = pd.read_csv("train-balanced-sarcasm.csv")

In [3]:
training_csv_1["comment"] = training_csv_1["comment"].astype(str)

In [4]:
training_csv_1.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


### Exploratory Data Analysis

In [5]:
print(f"The total training data has {training_csv_1.author.nunique()} rows.")
training_csv_1.groupby("author").mean()["label"].value_counts()

The total training data has 256561 rows.


0.500000    251903
1.000000      2302
0.333333       770
0.400000       405
0.428571       268
             ...  
0.533333         1
0.494253         1
0.555556         1
0.499408         1
0.495050         1
Name: label, Length: 64, dtype: int64

##### The authors is mostly 0.5 probability of each label, might consider dropping it

### Build model using Comment Column only (Unigram Model)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

In [7]:
training_csv_1["comment"] = training_csv_1["comment"].apply(lambda x: x.lower())

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    training_csv_1["comment"], 
    training_csv_1["label"], 
    test_size = 0.2
)

In [9]:
def create_unigram_transform_func(text_train):
    vectorizer = CountVectorizer()
    vectorizer.fit(text_train)
    def transform(data):
        return vectorizer.transform(data)
    return transform

In [10]:
transform_func = create_unigram_transform_func(X_train)

In [11]:
X_train = transform_func(X_train)
X_test = transform_func(X_test)

In [12]:
classifier = SGDClassifier()
classifier.fit(X_train, y_train)

SGDClassifier()

In [13]:
print(classifier.score(X_train, y_train))
print(classifier.score(X_test, y_test))

0.6886417035589741
0.6819000227535787
