Yi Xiao(xiaoyiyi)

### 1. Data exploration and preparation
##### Loading data

In [3]:
import pandas as pd
data = pd.read_csv('train-balanced-sarcasm.csv')
data = data.drop(columns=['author', 'subreddit', 'date','created_utc'])
data.dropna(subset=['comment'], inplace=True)

In [4]:
#this is very balanced dataset
data['label'].value_counts()
#we find this is a very balanced dataset

0    505405
1    505368
Name: label, dtype: int64

In [5]:
data.reset_index(drop=True)

Unnamed: 0,label,comment,score,ups,downs,parent_comment
0,0,NC and NH.,2,-1,-1,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,-4,-1,-1,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",3,3,0,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,deadass don't kill my buzz
4,0,I could use one of those tools.,6,-1,-1,Yep can confirm I saw the tool they use for th...
5,0,"I don't pay attention to her, but as long as s...",0,0,0,do you find ariana grande sexy ?
6,0,Trick or treating in general is just weird...,1,-1,-1,What's your weird or unsettling Trick or Treat...
7,0,Blade Mastery+Masamune or GTFO!,2,-1,-1,Probably Sephiroth. I refuse to taint his grea...
8,0,"You don't have to, you have a good build, buy ...",1,-1,-1,What to upgrade? I have $500 to spend (mainly ...
9,0,I would love to see him at lolla.,2,-1,-1,Probably count Kanye out Since the rest of his...


In [6]:
print(data.head())

   label                                            comment  score  ups  \
0      0                                         NC and NH.      2   -1   
1      0  You do know west teams play against west teams...     -4   -1   
2      0  They were underdogs earlier today, but since G...      3    3   
3      0  This meme isn't funny none of the "new york ni...     -8   -1   
4      0                    I could use one of those tools.      6   -1   

   downs                                     parent_comment  
0     -1  Yeah, I get that argument. At this point, I'd ...  
1     -1  The blazers and Mavericks (The wests 5 and 6 s...  
2      0                            They're favored to win.  
3     -1                         deadass don't kill my buzz  
4     -1  Yep can confirm I saw the tool they use for th...  


##### split label and other information into x, y

In [7]:
y = data.pop('label')
x = data

### baseline accuracy

In [6]:
from sklearn.metrics import accuracy_score
import random
baseline_y = []
for i in range(len(x)):
    baseline_y.append(random.randint(0, 1))
print("\nAccuracy score : %f" %(accuracy_score(y, baseline_y)))


Accuracy score : 0.500353


### 2. extract features
#### feature1: we have existing features: score, ups, down, and we want to extract features from comment text and parent comment text

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
        train_test_split(x, y, random_state=17)

In [8]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
gnb = GaussianNB()
gnb.fit(X_train[['score','ups', 'downs']], y_train)
y_pred = gnb.predict(X_test[['score','ups', 'downs']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.519961


##### according to the accuracy score, looks like those three features are not very helpful features

#### feature2:  bag of words， I think comments usually are not long, and context of words should not be in large scale, so 1gram - 3 gram shoud be ideal features

##### feature 2.1.1 : only analyze comment text,  1-gram to 2-gram this time

In [9]:
from sklearn.model_selection import train_test_split
X_f1, y_f1, y_train_f1, y_test_f1 = \
        train_test_split(x['comment'], y, random_state=17)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tf_idf_1 = TfidfVectorizer(ngram_range = (1, 2), max_features = 50000)
logistic = LogisticRegression()
# sklearn's pipeline
tfidf_logitistic_pipeline_1 = Pipeline([('tf_idf', tf_idf_1), 
                                 ('logit', logistic)])
tfidf_logitistic_pipeline_1.fit(X_f1, y_train_f1)

Pipeline(memory=None,
     steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50000, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [11]:
y_pred_f1 = tfidf_logitistic_pipeline_1.predict(y_f1)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f1, y_pred_f1)))


Accuracy score : 0.720939


##### feature 2.1.2 : only analyze comment text, use 1-gram to 3-gram this time

In [5]:
from sklearn.model_selection import train_test_split
X_f2, y_f2, y_train_f2, y_test_f2 = \
        train_test_split(x['comment'], y, random_state=17)

In [7]:
print(y_f2)

469600     Starting to feel pretty fucking tired of all t...
639137     It's like that label actually has no meaning b...
240293     Mained Fiora - Reworked Mained AP Tristana - W...
702254     Yeah lol that's right they wouldn't let black ...
889040               No, he made the thread asking jokingly.
118721     Probably - they flaunt a subscription button o...
947749                                          So horrible!
112289     You'd think someone would have foreseen this a...
564578     Elasticity of driver material based on tempera...
1020       I can't believe after ~30 comments I have yet ...
44677                  nah dude OP is tryin to get followers
43630      Remember, everyone: vote democrat if you want ...
822869                             Download the NSA toolkit.
950321                                  How original of you.
995459     Better start sniffing coke and drinking a few ...
1007434    It's like he invented a whole new type of chec...
254789     I wasn't clai

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tf_idf_2 = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
logistic = LogisticRegression()
# sklearn's pipeline
tfidf_logitistic_pipeline_2 = Pipeline([('tf_idf', tf_idf_2), 
                                 ('logit', logistic)])

tfidf_logitistic_pipeline_2.fit(X_f2, y_train_f2)
y_pred_f2 = tfidf_logitistic_pipeline_2.predict(y_f2)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f2, y_pred_f2)))


Accuracy score : 0.721762


the accuracy has been improved when we use 1-3 gram 

##### different models

In [16]:
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

svc = LinearSVC()
tf_idf_2 = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
clf_pipeline_1 = Pipeline([('tf_idf', tf_idf_2), 
                                 ('svc', svc)])

clf_pipeline_1.fit(X_f2, y_train_f2)
y_pred_f2 = clf_pipeline_1.predict(y_f2)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f2, y_pred_f2)))


Accuracy score : 0.715248


In [10]:
from sklearn import tree
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier(max_depth = 4)
tf_idf_2 = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
clf_pipeline_2 = Pipeline([('tf_idf', tf_idf_2), 
                                 ('clf', clf)])

clf_pipeline_2.fit(X_f2, y_train_f2)
y_pred_f2 = clf_pipeline_2.predict(y_f2)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f2, y_pred_f2)))


Accuracy score : 0.560164


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
clf = GaussianNB()
tf_idf_2 = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
clf_pipeline_3 = Pipeline([('tf_idf', tf_idf_2), 
                           ('to_dense', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                                 ('clf', clf)])

clf_pipeline_3.fit(X_f2, y_train_f2)
y_pred_f2 = clf_pipeline_3.predict(y_f2)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f2, y_pred_f2)))

##### try to extend the context

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tf_idf_3 = TfidfVectorizer(ngram_range = (2, 3), max_features = 50000)
logistic = LogisticRegression()
# sklearn's pipeline
tfidf_logitistic_pipeline_3 = Pipeline([('tf_idf', tf_idf_3), 
                                 ('logit', logistic)])

tfidf_logitistic_pipeline_3.fit(X_f2, y_train_f2)
y_pred_f3 = tfidf_logitistic_pipeline_3.predict(y_f2)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f2, y_pred_f3)))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tf_idf_4 = TfidfVectorizer(ngram_range = (2, 4), max_features = 50000)
logistic = LogisticRegression()
# sklearn's pipeline
tfidf_logitistic_pipeline_4 = Pipeline([('tf_idf', tf_idf_4), 
                                 ('logit', logistic)])

tfidf_logitistic_pipeline_4.fit(X_f2, y_train_f2)
y_pred_f4 = tfidf_logitistic_pipeline_4.predict(y_f2)
print("\nAccuracy score : %f" %(accuracy_score(y_test_f2, y_pred_f4)))

##### feature 2.2 : only analyze parent comment text

In [8]:
from sklearn.model_selection import train_test_split
X_comments, y_comments, y_train_temp, y_test_temp = \
        train_test_split(x['parent_comment'], y, random_state=17)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tf_idf = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
logistic = LogisticRegression()
# sklearn's pipeline
tfidf_logitistic_pipeline = Pipeline([('tf_idf', tf_idf), 
                                 ('logit', logistic)])
tfidf_logitistic_pipeline.fit(X_comments, y_train_temp)
y_pred = tfidf_logitistic_pipeline.predict(y_comments)
print("\nAccuracy score : %f" %(accuracy_score(y_test_temp, y_pred)))

it looks like parent_comments bag of words have quite light infuence on the result

##### different models

In [10]:
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

svc = LinearSVC()
tf_idf_2 = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
clf_pipeline_1 = Pipeline([('tf_idf', tf_idf_2), 
                                 ('svc', svc)])

clf_pipeline_1.fit(X_comments, y_train_temp)
y_pred = clf_pipeline_1.predict(y_comments)
print("\nAccuracy score : %f" %(accuracy_score(y_test_temp, y_pred)))


Accuracy score : 0.574074


In [12]:
from sklearn import tree
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier(max_depth = 4)
tf_idf_2 = TfidfVectorizer(ngram_range = (1, 3), max_features = 50000)
clf_pipeline_2 = Pipeline([('tf_idf', tf_idf_2), 
                                 ('clf', clf)])

clf_pipeline_2.fit(X_comments, y_train_temp)
y_pred = clf_pipeline_2.predict(y_comments)
print("\nAccuracy score : %f" %(accuracy_score(y_test_temp, y_pred)))


Accuracy score : 0.518184


##### feature3. sentiment analysis

as we know, sarcasm usually is saying something positive and then saying something negative

##### feature 3. 1 overall negative or overral postive, calculate out fraction of postive, fraction of negative

###### data preparation for training the model

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
ss_analyzer = SentimentIntensityAnalyzer()

ss_list = []
neg_frac_list = []
pos_frac_list = []

for index, row in data.iterrows():
    try:
        ss = ss_analyzer.polarity_scores(row['comment'])['compound']
        tokens = nltk.word_tokenize(row['comment'])
        neg_counts = 0
        pos_counts = 0
        for word in tokens:
            ps = ss_analyzer.polarity_scores(word)
            if ps["neg"] == 1.0:
                neg_counts += 1
            elif ps["pos"] == 1.0:
                pos_counts += 1
        neg_frac = neg_counts / len(tokens)
        pos_frac = pos_counts / len(tokens)
    except:
        ss = 0
        neg_frac = 0
        pos_frac = 0
    print(index)
    ss_list.append(ss)
    neg_frac_list.append(neg_frac)
    pos_frac_list.append(pos_frac)

In [None]:
len(ss_list)

In [None]:
len(neg_frac_list)

In [None]:
len(pos_frac_list)

In [None]:
len(data)

In [None]:
s1 = pd.Series(ss_list)
data['ss'] = s1

In [None]:
print(data)

In [None]:
s2 = pd.Series(neg_frac_list)
data['neg_frac'] = s2
s3 = pd.Series(pos_frac_list)
data['pos_frac'] = s3

In [None]:
print(data)

In [None]:
data.to_csv("data_1.csv", index=False)

In [None]:
data_1 = pd.read_csv('data_1.csv')

In [None]:
print(data_1)

##### feature 3.2 sentiment score

In [25]:
data_1 = pd.read_csv('data_1.csv')
y = data_1.pop('label')
X = data_1
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, random_state=17)

In [None]:
print(X_train)

In [None]:
logistic = LogisticRegression()
from sklearn.metrics import accuracy_score
logistic.fit(X_train[['ss']], y_train)
y_pred = gnb.predict(X_test[['ss']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))

In [28]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
from sklearn.metrics import accuracy_score
logistic.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = logistic.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.538462


##### different models

In [26]:
svc = LinearSVC()
from sklearn.metrics import accuracy_score
svc.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = svc.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.538462


In [29]:
clf = tree.DecisionTreeClassifier(max_depth = 4)
from sklearn.metrics import accuracy_score
clf.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = clf.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.545810


In [30]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
from sklearn.metrics import accuracy_score
clf.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = clf.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.498000


### parent comment sentiment calculation

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
ss_analyzer = SentimentIntensityAnalyzer()

p_ss_list = []
p_neg_frac_list = []
p_pos_frac_list = []

for index, row in data.iterrows():
    try:
        ss = ss_analyzer.polarity_scores(row['parent_comment'])['compound']
        tokens = nltk.word_tokenize(row['parent_comment'])
        neg_counts = 0
        pos_counts = 0
        for word in tokens:
            ps = ss_analyzer.polarity_scores(word)
            if ps["neg"] == 1.0:
                neg_counts += 1
            elif ps["pos"] == 1.0:
                pos_counts += 1
        neg_frac = neg_counts / len(tokens)
        pos_frac = pos_counts / len(tokens)
    except:
        ss = 0
        neg_frac = 0
        pos_frac = 0
    print(index)
    p_ss_list.append(ss)
    p_neg_frac_list.append(neg_frac)
    p_pos_frac_list.append(pos_frac)

In [None]:
s1_p = pd.Series(p_ss_list)
data['p_ss'] = s1_p
s2_p = pd.Series(p_neg_frac_list)
data['p_neg_frac'] = s2_p
s3_p = pd.Series(p_pos_frac_list)
data['p_pos_frac'] = s3_p

In [None]:
print(data)

In [None]:
data_1 = pd.read_csv('data_1.csv')

In [None]:
s1_p = pd.Series(p_ss_list)
data_1['p_ss'] = s1_p
s2_p = pd.Series(p_neg_frac_list)
data_1['p_neg_frac'] = s2_p
s3_p = pd.Series(p_pos_frac_list)
data_1['p_pos_frac'] = s3_p

In [None]:
print(data_1)

In [None]:
data_1.to_csv("data_2.csv", index=False)

In [32]:
data_2 = pd.read_csv('data_2.csv')

##### feature analysis

In [33]:
y = data_2.pop('label')
X = data_2
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, random_state=17)

In [None]:
logistic = LogisticRegression()
from sklearn.metrics import accuracy_score
logistic.fit(X_train[['p_ss']], y_train)
y_pred = logistic.predict(X_test[['p_ss']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))

In [34]:
logistic = LogisticRegression()
from sklearn.metrics import accuracy_score
logistic.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = logistic.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.538462


##### different models

In [35]:
svc = LinearSVC()
from sklearn.metrics import accuracy_score
svc.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = svc.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.538462


In [36]:
clf = tree.DecisionTreeClassifier(max_depth = 4)
from sklearn.metrics import accuracy_score
clf.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = clf.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.545810


In [23]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
from sklearn.metrics import accuracy_score
clf.fit(X_train[['pos_frac','pos_frac']], y_train)
y_pred = clf.predict(X_test[['pos_frac','pos_frac']])
print("\nAccuracy score : %f" %(accuracy_score(y_test, y_pred)))


Accuracy score : 0.523377
