# Problem Statement

Shakespeare probably saw a world expressing itself on Twitter before any one did! And he would have known if any one made a sarcastic tweet or not. But for our machines and bots, they need help from data scientists to help them decipher sarcasm. 

Your ask for this competition is to build a machine learning model that, given a tweet, can classify it correctly as sarcastic or non-sarcastic. This problem is one of the most interesting problems in Natural Language Processing and Text Mining. You never know - the next generation of bots might come and thank you for making them smarter! 

The prediction has to be made using only the text of the tweet.

# Dataset

Two files - one each for training and testing - are provided.

**training.csv**: This file contains three columns -
    - ID: ID for each tweet
    - tweet: contains the text of the tweet
    - label: the label for the tweet (‘sarcastic’ or ‘non-sarcastic’)

**test.csv**: This file has two columns containing the ID and tweets. The predictions on this set would be judged.

**submission.csv**: This contains the predictions of the model on the test file. The file has to contain a two columns (ID and label). Each label takes one of the two string values, ‘sarcastic’ or ‘non-sarcastic’.

# Evauation

The metric used for evaluating the predictions for this problem is simply the F1-score.

Public : Private leaderboard split on test data is 25:75

# Import packages

In [1]:
import graphlab as gl
import graphlab.aggregate as agg

In [2]:
import nltk
#nltk.download()
#install the following
#1. Models > Averaged Perceptron Tagger (2.4 MB)
#2. Models > VADER sentiment lexicon (88.4 KB)
#3. Corpora > Stopwords corpus (10.2 KB)

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
sent = SentimentIntensityAnalyzer() #Create a sentiment transformer
tokenizer = RegexpTokenizer(r'\w+') #Create a tokenizer transformer
from nltk.probability import FreqDist

from collections import Counter
import itertools



# Import Data

In [3]:
train = gl.SFrame('data/train_MLWARE1.csv')
test = gl.SFrame('data/test_MLWARE1.csv')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1488344803.log


This non-commercial license of GraphLab Create for academic use is assigned to karthi.aru@gmail.com and will expire on May 31, 2017.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


# Extract required cols

In [4]:
id = test['ID']
train['set'] = 'train'
test['set'] = 'test'
test['label'] = 'None'
train = train.select_columns(['tweet','set','label'])
test = test.select_columns(['tweet','set','label'])

# Feature Engineering

In [5]:
# Join Train-Test
df = train.append(test)

In [6]:
df['senti_score'] = df['tweet'].apply(lambda x: sent.polarity_scores(x)['pos']) - df['tweet'].apply(lambda x: sent.polarity_scores(x)['neg'])
df['tweet_length'] = df['tweet'].apply(lambda x: len(x))
df['caps_count'] = df['tweet'].apply(lambda x: sum(1 for c in x if c.isupper()))
df['exclaim_count'] = df['tweet'].apply(lambda x: x.count("!"))
df['ques_count'] = df['tweet'].apply(lambda x: x.count("?"))
df['comma_count'] = df['tweet'].apply(lambda x: x.count(","))
df['quote_count'] = df['tweet'].apply(lambda x: x.count("'")) + df['tweet'].apply(lambda x: x.count('"'))

In [7]:
def pos_ratio(pos_tag):
    '''
    Function to find the ratio of POS tokens to total tokens
    '''
    counts = Counter(tag for word,tag in pos_tag)
    total = len(pos_tag)
    return dict((word, float(count)/total) for word,count in counts.items())

def tokens_count(tokens):
    '''
    Function to count the tokens
    '''
    counts = Counter(word for word in tokens)
    return dict((word, float(count)) for word,count in counts.items())

In [8]:
df['tokens'] = df['tweet'].apply(lambda x: tokenizer.tokenize(x)) #tokenize
df['tokens'] = df['tokens'].apply(lambda x: [token.lower() for token in x if token.lower() not in stops]) #lower case & remove stop words
df['pos_tag'] = df['tokens'].apply(lambda x: nltk.pos_tag(x)) #Parts of Speech (POS) tags
df['pos_ratio'] = df['pos_tag'].apply(lambda x: pos_ratio(x))
df['tokens'] = df['tokens'].apply(lambda x: [stemmer.stem(w) for w in x]) #Stem the tokens
df['tokens_count'] = df['tokens'].apply(lambda x: tokens_count(x))

In [9]:
df

tweet,set,label,senti_score,tweet_length,caps_count,exclaim_count,ques_count
b'oh yea that makes sense ' ...,train,sarcastic,0.0,27,0,0,0
Estas enfermedad a un cargo poltico tu como ...,train,sarcastic,0.0,70,1,0,0
@alleygirl2409 until i\'m and all the old men will ...,train,sarcastic,0.0,93,0,0,0
"b""@sarinas it had been chanted peacefully you ...",train,sarcastic,0.331,85,0,0,0
"b""there's nothing like being on vacation and ...",train,sarcastic,-0.161,71,0,0,0
People who are sarcastic tend to be better pro ...,train,sarcastic,0.0,81,2,0,0
b'May I block you too RT But what if he or she ...,train,sarcastic,-0.252,100,5,0,0
b'Wow I really forgot how much I love the traffic ...,train,sarcastic,0.344,55,3,0,0
b'How perfect my internet just went out thanks ...,train,sarcastic,0.485,53,1,0,0
b'Love having no voice $$SAR$$' ...,train,sarcastic,-0.355,31,4,0,0

comma_count,quote_count,tokens,pos_tag,pos_ratio
0,2,"[b, oh, yea, make, sens]","[[b, NN], [oh, MD], [yea, VB], [makes, VBZ], ...","{'MD': 0.2, 'VB': 0.2, 'NN': 0.4, 'VBZ': 0.2} ..."
0,1,"[esta, enfermedad, un, cargo, poltico, tu, c ...","[[estas, NNS], [enfermedad, VBP], [un, ...","{'VBP': 0.1111111111111111, ..."
0,1,"[alleygirl2409, old, men, final, date, sarcasmsun, ...","[[alleygirl2409, JJ], [old, JJ], [men, NNS], ...","{'NN': 0.4444444444444444, ..."
0,2,"[b, sarina, chant, peac, deni, hypocrisysat, mar, ...","[[b, NN], [sarinas, NNS], [chanted, VBD], ...","{'RB': 0.1111111111111111, ' ..."
0,2,"[b, noth, like, vacat, homework] ...","[[b, NN], [nothing, NN], [like, IN], [vacation, ...","{'NN': 0.8, 'IN': 0.2}"
0,1,"[peopl, sarcast, tend, better, problem, solver, ...","[[people, NNS], [sarcastic, JJ], [tend, ...","{'VBP': 0.25, 'NNS': 0.25, 'JJR': 0.125, ' ..."
0,4,"[b, may, block, rt, understand, say, fuck] ...","[[b, NN], [may, MD], [block, VB], [rt, NN], ...","{'MD': 0.14285714285714285, ..."
0,1,"[b, wow, realli, forgot, much, love, traffic, ...","[[b, NN], [wow, NN], [really, RB], [forgot, ...","{'JJ': 0.125, 'RB': 0.125, 'NN': 0.625, ..."
0,2,"[b, perfect, internet, went, thank, time] ...","[[b, NN], [perfect, NN], [internet, NN], [went, ...","{'NNS': 0.16666666666666666, ..."
0,2,"[b, love, voic, sar]","[[b, NN], [love, NN], [voice, NN], [sar, NN]] ...",{'NN': 1.0}

tokens_count
"{'yea': 1.0, 'sens': 1.0, 'make': 1.0, 'b': 1.0, ..."
"{'cargo': 1.0, 'pblico': 1.0, 'esta': 1.0, 'tu': ..."
"{'mar': 1.0, 'old': 1.0, 'men': 1.0, 'ist': 1.0, ..."
"{'b': 1.0, 'hypocrisysat': 1.0, ..."
"{'b': 1.0, 'vacat': 1.0, 'noth': 1.0, 'homework': ..."
"{'better': 1.0, 'solver': 1.0, 'peopl': 1.0, ..."
"{'rt': 1.0, 'b': 1.0, 'may': 1.0, 'fuck': 1.0, ..."
"{'b': 1.0, 'love': 1.0, 'wow': 1.0, 'scene': ..."
"{'perfect': 1.0, 'b': 1.0, 'thank': 1.0, ..."
"{'sar': 1.0, 'b': 1.0, 'love': 1.0, 'voic': ..."


In [10]:
# Extract the tokens for sarcastic and non-sarcastic
sarcasm = list(df.filter_by(['sarcastic'], 'label')['tokens'])
non_sarcasm = list(df.filter_by(['non-sarcastic'], 'label')['tokens'])

In [11]:
# Find the frequency distribution
sarcasm_dist = FreqDist(list(itertools.chain(*sarcasm)))
non_sarcasm_dist = FreqDist(list(itertools.chain(*non_sarcasm)))

In [12]:
print sarcasm_dist.most_common(50)

[('b', 38877), ('love', 5509), ('sarcasm', 4905), ('get', 3282), ('like', 3097), ('go', 3058), ('day', 3057), ('great', 2666), ('mar', 2447), ('sar', 2257), ('good', 2194), ('thank', 2149), ('peopl', 2013), ('know', 1983), ('realli', 1958), ('oh', 1946), ('serious', 1894), ('work', 1890), ('today', 1786), ('time', 1737), ('well', 1717), ('make', 1710), ('one', 1617), ('pkt', 1455), ('fun', 1390), ('feel', 1349), ('night', 1346), ('look', 1320), ('see', 1319), ('thing', 1313), ('right', 1276), ('much', 1260), ('think', 1211), ('want', 1205), ('start', 1201), ('wait', 1156), ('back', 1142), ('would', 1129), ('better', 1117), ('wow', 1111), ('got', 1079), ('say', 1075), ('need', 1067), ('best', 1051), ('morn', 1046), ('school', 1039), ('yay', 1009), ('ist', 989), ('come', 986), ('hour', 963)]


In [13]:
print non_sarcasm_dist.most_common(50)

[('b', 31919), ('rt', 6701), ('day', 5414), ('happi', 3915), ('love', 3477), ('thank', 2566), ('today', 2137), ('father', 2063), ('new', 2036), ('go', 1945), ('amp', 1877), ('get', 1848), ('see', 1822), ('time', 1805), ('make', 1733), ('wait', 1621), ('birthday', 1604), ('life', 1497), ('good', 1456), ('week', 1393), ('one', 1315), ('look', 1261), ('u', 1247), ('great', 1208), ('tomorrow', 1183), ('come', 1143), ('first', 1068), ('work', 1042), ('start', 1030), ('got', 1003), ('like', 973), ('weekend', 971), ('feel', 944), ('best', 886), ('friday', 861), ('smile', 859), ('final', 819), ('excit', 800), ('us', 790), ('next', 786), ('dad', 774), ('friend', 773), ('readi', 773), ('year', 770), ('morn', 768), ('back', 752), ('tonight', 751), ('take', 711), ('beauti', 701), ('bless', 679)]


Few words like 'sarcasm', 'sar', 'oh', 'realli', 'serious' seem to stand out in the sarcastic tweets compared to non-sarcastic tweets. Words 'sarcasm', 'sar' might be a leakage due to the way the tweets were scraped.

# Modelling

In [14]:
# Train-Test Split
train_data = df.filter_by(['train'], 'set')
test_data = df.filter_by(['test'], 'set')

In [15]:
# Train-Validation Split
train_data,val_data = train_data.random_split(.8, seed=123)

In [16]:
# Modelling using GBM
model_gbm1 = gl.boosted_trees_classifier.create(train_data, validation_set=val_data,
                                         target='label',
                                         features=['senti_score','tweet_length','tokens_count','pos_ratio','caps_count','exclaim_count','ques_count','comma_count','quote_count',],
                                         max_iterations=2500, max_depth=6, step_size=0.3,
                                         min_loss_reduction=1e-6, min_child_weight=0.1,
                                         row_subsample=1.0, column_subsample=0.8, random_seed=123,
                                         metric='auto')

In [17]:
model_gbm1

Class                          : BoostedTreesClassifier

Schema
------
Number of examples             : 73155
Number of feature columns      : 9
Number of unpacked features    : 29769
Number of classes              : 2

Settings
--------
Number of trees                : 2500
Max tree depth                 : 6
Training time (sec)            : 238.1658
Training accuracy              : 0.9969
Validation accuracy            : 0.9572
Training log_loss              : 0.0279
Validation log_loss            : 0.1136

In [19]:
model_gbm1.evaluate(val_data)

{'accuracy': 0.9571735655624759,
 'auc': 0.9908550741143605,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +---------------+-----------------+-------+
 |  target_label | predicted_label | count |
 +---------------+-----------------+-------+
 |   sarcastic   |    sarcastic    |  9658 |
 |   sarcastic   |  non-sarcastic  |  570  |
 | non-sarcastic |    sarcastic    |  207  |
 | non-sarcastic |  non-sarcastic  |  7708 |
 +---------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9613298163539542,
 'log_loss': 0.11355411998213585,
 'precision': 0.9790167257982767,
 'recall': 0.9442706296441142,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0

In [20]:
# Make Predictions
predictions = model_gbm1.classify(test_data)

In [21]:
submit = gl.SFrame({'ID':id,'label':predictions['class']})
submit.head()

ID,label
T000543656,non-sarcastic
T000543657,sarcastic
T000543658,sarcastic
T000543659,non-sarcastic
T000543660,sarcastic
T000543661,non-sarcastic
T000543662,non-sarcastic
T000543663,non-sarcastic
T000543664,sarcastic
T000543665,non-sarcastic


In [22]:
submit.save('data/submit.csv', format='csv')

This scores the following F1 score,

- Validation: 0.96132
- Public LB: 0.6642
- Private LB: 0.619765

As you can see there is a huge gap between Validation and Public LB which makes it tricker.

Including higher order n-grams or skip-grams results in an exponential increase in the number of features in the order of millions.