# Problem Statement

Shakespeare probably saw a world expressing itself on Twitter before any one did! And he would have known if any one made a sarcastic tweet or not. But for our machines and bots, they need help from data scientists to help them decipher sarcasm. 

Your ask for this competition is to build a machine learning model that, given a tweet, can classify it correctly as sarcastic or non-sarcastic. This problem is one of the most interesting problems in Natural Language Processing and Text Mining. You never know - the next generation of bots might come and thank you for making them smarter! 

The prediction has to be made using only the text of the tweet.

# Dataset

Two files - one each for training and testing - are provided.

**training.csv**: This file contains three columns -
    - ID: ID for each tweet
    - tweet: contains the text of the tweet
    - label: the label for the tweet (‘sarcastic’ or ‘non-sarcastic’)

**test.csv**: This file has two columns containing the ID and tweets. The predictions on this set would be judged.

**submission.csv**: This contains the predictions of the model on the test file. The file has to contain a two columns (ID and label). Each label takes one of the two string values, ‘sarcastic’ or ‘non-sarcastic’.

# Evauation

The metric used for evaluating the predictions for this problem is simply the F1-score.

Public : Private leaderboard split on test data is 25:75

# Import packages

In [2]:
import graphlab as gl
import graphlab.aggregate as agg

In [3]:
train = gl.SFrame('data/train_MLWARE1.csv')
test = gl.SFrame('data/test_MLWARE1.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [5]:
id = test['ID']
label = train['label']
train_tweet = train['tweet']
test_tweet = test['tweet']

# Approach 1

This approach involves building simple features using the word count i.e. 1-grams. This results in 37782 features. 

In [6]:
# Building 1-grams / extracting word counts
train['ngram1'] = gl.text_analytics.count_ngrams(train['tweet'], 1, to_lower=True)
test['ngram1'] = gl.text_analytics.count_ngrams(test['tweet'], 1, to_lower=True)

In [7]:
# Train-Validation Split
train_data,test_data = train.random_split(.8, seed=0)

In [8]:
# Modelling using GBM
model_gbm1 = gl.boosted_trees_classifier.create(train_data, validation_set=test_data,
                                         target='label', features=['ngram1'],
                                         max_iterations=2500, max_depth=6, step_size=0.3,
                                         min_loss_reduction=1e-6, min_child_weight=0.1,
                                         row_subsample=1.0, column_subsample=0.8, random_seed=123,
                                         metric='auto')

In [9]:
model_gbm1

Class                          : BoostedTreesClassifier

Schema
------
Number of examples             : 72968
Number of feature columns      : 1
Number of unpacked features    : 37782
Number of classes              : 2

Settings
--------
Number of trees                : 2500
Max tree depth                 : 6
Training time (sec)            : 208.5977
Training accuracy              : 0.9719
Validation accuracy            : 0.8926
Training log_loss              : 0.1165
Validation log_loss            : 0.2518

In [11]:
model_gbm1.evaluate(test_data)

{'accuracy': 0.8925804691762138,
 'auc': 0.9609499612714539,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +---------------+-----------------+-------+
 |  target_label | predicted_label | count |
 +---------------+-----------------+-------+
 |   sarcastic   |    sarcastic    |  9289 |
 |   sarcastic   |  non-sarcastic  |  946  |
 | non-sarcastic |    sarcastic    |  1023 |
 | non-sarcastic |  non-sarcastic  |  7072 |
 +---------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9041709251958924,
 'log_loss': 0.2518241364558384,
 'precision': 0.9007951900698216,
 'recall': 0.9075720566682951,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+-----+-------+------+
 | threshold |      fpr       | tpr |   p   |  n   |
 +-----------+----------------+-----+-------+------+
 |    0.0    |      1.0       | 1.0 | 10235 | 8095 |
 |  

In [14]:
# Make Predictions
predictions = model_gbm1.classify(test)

In [15]:
predictions.head()

class,probability
sarcastic,0.647453010082
sarcastic,0.996821284294
sarcastic,0.995850563049
non-sarcastic,0.511426717043
sarcastic,0.973896384239
non-sarcastic,0.980965374038
non-sarcastic,0.999190053495
non-sarcastic,0.871518865228
sarcastic,0.948249518871
non-sarcastic,0.999989347376


In [16]:
submit = gl.SFrame({'ID':test['ID'],'label':predictions['class']})
submit.head()

ID,label
T000543656,sarcastic
T000543657,sarcastic
T000543658,sarcastic
T000543659,non-sarcastic
T000543660,sarcastic
T000543661,non-sarcastic
T000543662,non-sarcastic
T000543663,non-sarcastic
T000543664,sarcastic
T000543665,non-sarcastic


In [17]:
submit.save('data/submit.csv', format='csv')

This scores the following F1 score,

- Validation: 0.904171
- Public LB: 0.6642
- Private LB: 0.619765

As you can see there is a huge gap between Validation and Public LB which makes it tricker.

Including 2-grams and/or 3-grams results in an exponential increase in the number of features in the order of millions. This doesn't improve the F1 score and infact becomes worse than the simple 1-gram model.

# Approach 2

This approach involves making the 1-gram model more efficient by reducing the number of features by,
- Trimming rare words
- Extracting Parts of Speech viz. Adjectives

In [18]:
# Create a RareWordTrimmer transformer
from graphlab.toolkits.feature_engineering import RareWordTrimmer
trimmer = RareWordTrimmer()

# Create a PartOfSpeechExtractor transformer
from graphlab.toolkits.feature_engineering import PartOfSpeechExtractor
transformer = PartOfSpeechExtractor()

In [19]:
# RareWordTrimmer
train_rare = trimmer.fit_transform(gl.SFrame(train))
test_rare = trimmer.fit_transform(gl.SFrame(test))

In [None]:
# PartOfSpeechExtractor
train_pos = trimmer.fit_transform(gl.SFrame(train))
test_pos = trimmer.fit_transform(gl.SFrame(test))

This doesn't improve the F1 score in Validation and Public LB (0.56).

# Approach 3

This approach uses `pandas` and `nltk` to reduce the 1-gram features through stemming

In [19]:
import pandas as pd
import re
from __future__ import unicode_literals
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [20]:
def tweet_to_words(raw_tweet):
    '''
    Funtion to ...
       Clean tweets
       Extract only text from tweets
       Drop stop words
       Stem the words
       Join words to sentence
    '''
    doc = re.sub("b'", "", raw_tweet)
    doc = re.sub('b"', "", doc)
    letters_only = re.sub("[^a-zA-Z]", " ", doc) 
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops] 
    stem_words = [stemmer.stem(mw) for mw in meaningful_words]
    return(str( " ".join(stem_words)))

In [8]:
train['tweet'] = train_tweet.apply(lambda x: tweet_to_words(x))
test['tweet'] = test_tweet.apply(lambda x: tweet_to_words(x))

This doesn't improve the F1 score in Validation and Public LB (0.56).