# Lemmatization, Stemming and StopWords for sarcasm detection

The goal of this notebook is to try the LogisticRegression model for detecting sarcasm comments of Reddit users. We will try different methods of preprocessing for text features (such as "stop words", stemming and lemmatization) and compare results.
The notebook consists of the following parts:
* Loading and preparing data
* EDA
* LogisticRegression model
*    Bag of Words
*    IfIdf
*    Stop words
*    Stemming
*    Lemmatization
* Adding non-text features
* Conclusion

## Loadind data

First we import the libraries, we will use. Most of them are well-known, and [eli5](https://eli5.readthedocs.io/en/latest/index.html) library is for inspecting model weights.

In [None]:
import os
import numpy as np
import pandas as pd
import eli5
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import seaborn as sns
from scipy.sparse import csr_matrix, hstack
from datetime import datetime
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [None]:
# Load the data. There is a test dataset in the folder, but we cannot use it because we have no answers. Therefore we create train and test sets from the training data
sarcasm_df = pd.read_csv("../input/sarcasm/train-balanced-sarcasm.csv")

Take the look at the data. Here we have not so many features:
* label - our target feature, we will separate it
* comment - our main text feature
* autor - this field should be treaten as categorial
* subreddit - the theme, where the comment was published
* score, ups, downs - the number of votes for/against the comment. Accordint to the Reddit faq ups and downs are fuzzed, so we will use only the score field
* date and created_uts - contains the same information, we will use created_utc
* parent_commet - we should inspect this field, because it's not clear, if we will use it

In [None]:
sarcasm_df.head()

In [None]:
# Here we inspect data for missing values
sarcasm_df.info()

In [None]:
# We can see, that comment row has missing values: comment 1010773 non-null and it should be 1010825. We will delete these rows completely.
# We also drop columns 'ups', 'downs', 'date' and convert string 'created_utc' to datetime format
sarcasm_df.dropna(subset=['comment'], inplace=True)
sarcasm_df.drop(['ups', 'downs', 'date'], axis=1)
sarcasm_df['created_utc'] = sarcasm_df['created_utc'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

In [None]:
# Now we compare the number of instances for each class (1 - sarcasm, 0 - not). We can see, that the dataset is balanced and classes have almost the same size
sarcasm_df['label'].value_counts()

In [None]:
# Here we split our data for the train and test sets
train_df, test_df, train_y, test_y = train_test_split(sarcasm_df, sarcasm_df['label'], test_size=0.33, random_state=17)

## EDA part

Now we will make the dataframe for EDA, we will add the *label* field again. Later we also will add some new features to it, which can help us to explore the data

In [None]:
eda_data = train_df[['comment', 'author', 'subreddit','created_utc', 'score','parent_comment']].copy()
eda_data['label'] = train_y

In [None]:
eda_data.head()

Fist we will look at additional features. Let's display the most sarcastic topics. We filter the data first to save only topics with more than 1000 comments. This may not help to build models, but allow to "feel" the data before diving into the modeling. And it's fun :)

In [None]:
filtered = eda_data.groupby(['subreddit']).filter(lambda x: x['comment'].count()>1000)
filtered.groupby(['subreddit']).agg({
    'comment':'count',
    'label': 'mean'}).sort_values(by='label', ascending=False).iloc[:10]

And now the most sarcastic authors with more than 20 comments

In [None]:
filtered_authors = eda_data.groupby(['author']).filter(lambda x: x['comment'].count()>20)
filtered_authors.groupby(['author']).agg({
    'comment':'count',
    'label': 'mean'}).sort_values(by='label', ascending=False).iloc[:10]

Let's inspect whether there are any relations between the length of the comment and its sarcastic tone. Maybe sarcastic comments tend to be short and laconic. We need to create a new feature for it. 

In [None]:
# Here we add the new field -  comment_len
eda_data['comment_len'] = eda_data['comment'].apply(len)

Here is a function for printing percentiles for the selected feature. For most datasets it's better and more informative to build a boxplot, but in our case it shows us nothing because of outliers. Therefore we will look at percentiles instead

In [None]:
def percentile_print(data, feature, percentile_list = [25, 50, 75]):
    for percentile in percentile_list:
        print ("Percentile",percentile,
               "Sarcasm", np.percentile(data[data['label']==1][feature], percentile),
               "Not", np.percentile(data[data['label']==0][feature], percentile))

In [None]:
# So we see, that there no difference in comment length for both classes. That'a a bit dissapointing, but we will move forward
percentile_print(eda_data, 'comment_len')

Now let's look at she score field. Maybe sarcastic comments are more popular that non, we shall look at the *score* feature to check it.

In [None]:
# But unfortunatelly, the situation is the same for scores too: no visible difference.
percentile_print(eda_data, 'score')

Now we add two more features: whether the comment was been written on the working day or weekend and was it day or night. Again we're trying to spot any difference, maybe people at night are more sacrastic

In [None]:
eda_data['weekend'] = eda_data['created_utc'].apply(lambda x: x.dayofweek==1 or x.dayofweek==6).astype(int)
eda_data['day']= eda_data['created_utc'].apply(lambda x: x.hour>7 and x.hour<20).astype(int)

In [None]:
# So what have we here? Working days don't make users more sarcastic.
sns.countplot(x='weekend', hue='label', data=eda_data )

In [None]:
# And finally here we can see the difference! Day - that's the time for the sarcasm =)
sns.countplot(x='day', hue='label', data=eda_data )

Now it's time to explore the text features. First let's look at the most popular words for sarcastic comments. We will throw away most common words, like 'a', 'the', 'and' etc. and will concider unigramms and bigramms separately

In [None]:
vectorizer_1 = CountVectorizer(stop_words='english', ngram_range=(1, 1))
vectorizer_2 = CountVectorizer(stop_words='english', ngram_range=(2, 2))

We will write a small function, because we don't want to repeat the same code

In [None]:
def freq_words(vectorizer, data):
    X = vectorizer.fit_transform(data)
    freqs = zip(vectorizer.get_feature_names(), np.asarray(X.sum(axis=0)).ravel())
    return sorted(freqs, key = lambda x: x[1], reverse=True)[:10]

In [None]:
# First column - sarcastic comments, second - not
l = [freq_words(vectorizer_1, eda_data[eda_data['label']==1]['comment']),
     freq_words(vectorizer_1, eda_data[eda_data['label']==0]['comment'])]
list(map(list, zip(*l)))

So we can see, that most popular words for both classes are almost the same. The bigramms vectorizer shows us better (more specific) result as you can see below.

In [None]:
# First column - sarcastic comments, second - not
l = [freq_words(vectorizer_2, eda_data[eda_data['label']==1]['comment']),
     freq_words(vectorizer_2, eda_data[eda_data['label']==0]['comment'])]
list(map(list, zip(*l)))

Now it's time to look at the *parent_comment* feature. It's not clear how to use this information and whether we should use it at all. My suggestion is to find intersection of words for *comment* and *parent_comment* and hope it will show us something =) Maybe sarcastic comments tend to repeat part of a phrase or some words of parent comment? 

In [None]:
# Again a small function for finding the intersection. This one does the following:
# 1) set characters in the string to lowercase, delete punctuation and split for the words
# 2) the same for the parent comment
# 3) find words in the comment, that are also in the parent comment
# 4) returns the rate of intersection length to the length of all words in the comment 

def find_intersection(comment, parent):
    comment_words = [x.strip(string.punctuation) for x in comment.lower().split()]
    parent_words = [x.strip(string.punctuation) for x in parent.lower().split()]
    intersection_words = [x for x in comment_words if x in parent_words]
    return len(intersection_words)/len(comment_words)

In [None]:
# Now we add this intersection feature to our dataframe and will look at it
eda_data['intersection'] = [find_intersection(x,y) for x,y in zip(eda_data['comment'], eda_data['parent_comment'])]

In [None]:
sns.boxplot(x = eda_data[eda_data['label']==1]['intersection'])

In [None]:
sns.boxplot(x = eda_data[eda_data['label']==0]['intersection'])

In [None]:
percentile_print(eda_data, 'intersection')

The difference is really small. Anyway we have tried. Now it's time to build our models

## Logistic Regression

Here we define a LogisticRegression model. First we will compare two methods of converting text features to vectors: CountVectorizer ("Bag of Words") and TfIdfVectorizer

In [None]:
log_reg = LogisticRegression(random_state=17, solver='lbfgs')

In [None]:
# Function for printing metrics for our predicted result
def print_report(model, x_test, y_test):
    y_pred = model.predict(x_test)
    report = metrics.classification_report(y_test, y_pred)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))

In [None]:
# This function helps us not to repeat the same lines of code many times. Here we:
# 1) make pipeline
# 2) train it
# 3) print metrics
# 4) return our trained regression and feature_names, so we will be able to look at the weights
def model_cycle(vectorizer, train_x=train_df['comment'], test_x=test_df['comment']):
    train_vect = vectorizer.fit_transform(train_x)
    test_vect = vectorizer.transform(test_x)
    log_reg.fit(train_vect,train_y)
    print_report(log_reg, test_vect , test_y)
    return (log_reg, vectorizer.get_feature_names())

### CountVectorizer

In [None]:
# This code takes some time to run, be patient
(model, features) = model_cycle(CountVectorizer(ngram_range=(1, 3), max_features=100000))

The accuracy is not bad, but but it's just the start. Let's look at the weights of the model to see, which features have greatest weights

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

It looks plausible so far. Let's try one more vectorizer TfIdf to extract features.

### TfIdf

In [None]:
# This code takes some time to run, be patient
(model, features) = model_cycle(TfidfVectorizer(ngram_range=(1, 3), max_features=100000))

The results of this vectorizer are better. Let's look at the weights

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

The most important features here differ from the CountVectorizer method. If we look more careful  features and 'yes because' and 'yeah because' logicaly should be the same feature. Maybe stemming or lemmization can help here, we will check it further

### Stop Words

There is the build-in vocabulary for english most commom words in sklearn library (stop_words='english'), let's try it

In [None]:
(model, features) = model_cycle(CountVectorizer(ngram_range=(1, 3), stop_words='english', max_features=100000))

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

In [None]:
(model, features) = model_cycle(TfidfVectorizer(ngram_range=(1, 3), stop_words='english', max_features=100000))

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

After applying 'stop words' we can see that accuracy is worse for both vectorizing methods. Maybe the reason lies in the relatively small length of comments and meager vocabulary, therefore when we throw away common words we throw away important information too.

### Stemming

Now it's time to check stemming. This method cuts the ending of the word like this "house, houses, house’s, houses’ => house". To test it is not a simple task, because the sklearn library hasn't such build-in method. We will use nltk library instead and write our own CountVectorizer (or we could preprocess our data).

In [None]:
stemmer = PorterStemmer()
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

vectorizer_s = StemmedCountVectorizer(analyzer="word", ngram_range=(1, 3), max_features=100000)

In [None]:
# This code runs realy for a long time. If you want to re-run it be patient.
(model, features) = model_cycle(vectorizer_s)

The precission here is slightly worse than without stemmer, but the recall for sarcastic comments is better (0.68 instead of 0.67)

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

The same for TfIdf

In [None]:
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

In [None]:
# This code also takes a lot of time. You can make a tea and talk with friends a bit
(model, features) = model_cycle(StemmedTfidfVectorizer(analyzer="word", ngram_range=(1, 3), max_features=100000))

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

Well. The accuracy of the model with and without stemming is almost the same, and again the recall for sarcastic comments is better (0.69 instead of 0.68). But the time of the calculation takes more time. And we can see, that our problem with doubled features is still here. Now let's try lemmatization

### Lemmatization

The idea of this method is to bring a wird to it's base form, for example: "seen => see", "drove => drive". The situation here is the same as with stemming,even more tricky. We need to write tokenizer here and strip the punctuation manually, but finally it works.

In [None]:
from nltk import word_tokenize
from nltk.corpus import wordnet 

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t,wordnet.VERB) for t in word_tokenize(articles)]

vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                             strip_accents = 'unicode',
                             lowercase = True,
                             ngram_range=(1, 3), max_features=100000)

#stripping punctuation here
train_stripped_comment = train_df['comment'].str.replace('[^\w\s]', '')
test_stripped_comment = test_df['comment'].str.replace('[^\w\s]', '')

In [None]:
(model, features) = model_cycle(vectorizer,train_stripped_comment, test_stripped_comment)

We can see a small improvement in precission here and rather strange features. Maybe we should remove digits as well? I'll leave it for next the commit. Now we move to the next vectorizer

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

In [None]:
tf_vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                                strip_accents = 'unicode',
                                lowercase = True,
                                ngram_range=(1, 3), max_features=100000)

In [None]:
(model, features) = model_cycle(tf_vectorizer,train_stripped_comment, test_stripped_comment)

In [None]:
eli5.show_weights(model,
                  feature_names=features,
                  target_names = ['0','1'],
                  )

For now this id the best result: average accuracy 0.725, recall for sarcastic comments 0.69

### Non-text features

Now let'l try an add other features, maybe we can improve our result. We will not make a pipe here, but prepare our features one by one. Taking into account our previous tests, we will use TfidfVectorizer here with lemma tokenizer.

In [None]:
# Define vectorizer and convert our text features
train_comment = tf_vectorizer.fit_transform(train_stripped_comment)
test_comment = tf_vectorizer.transform(test_stripped_comment)

In [None]:
# Subreddit feature should be codded with OneHotEncoder, after it every unique subreddit value will be the separate feature
enc_sub = OneHotEncoder(handle_unknown='ignore')
train_subreddit = enc_sub.fit_transform(train_df['subreddit'].values.reshape(-1,1))
test_subreddit = enc_sub.transform(test_df['subreddit'].values.reshape(-1,1))

In [None]:
# The same for the author field
enc_aut = OneHotEncoder(handle_unknown='ignore')
train_author = enc_aut.fit_transform(train_df['author'].values.reshape(-1,1))
test_author = enc_aut.transform(test_df['author'].values.reshape(-1,1))

In [None]:
# We will scale our real-valued features
scaler = StandardScaler()
train_scores = scaler.fit_transform(train_df['score'].values.reshape(-1,1))
test_scores = scaler.transform(test_df['score'].values.reshape(-1,1))
train_len = scaler.fit_transform((train_df['comment'].apply(len)).values.reshape(-1,1))
test_len = scaler.transform((test_df['comment'].apply(len)).values.reshape(-1,1))

In [None]:
# And finally here we append day, weekend and intersection features

train_df['day'] = train_df['created_utc'].apply(lambda x: x.hour>7 and x.hour<20).astype(int)
test_df['day'] = test_df['created_utc'].apply(lambda x: x.hour>7 and x.hour<20).astype(int)

train_df['intersection'] = [find_intersection(x,y) for x,y in zip(train_df['comment'], train_df['parent_comment'])]
test_df['intersection'] = [find_intersection(x,y) for x,y in zip(test_df['comment'], test_df['parent_comment'])]

train_df['weekend'] = train_df['created_utc'].apply(lambda x: x.dayofweek==1 or x.dayofweek==6).astype(int)
test_df['weekend'] = test_df['created_utc'].apply(lambda x: x.dayofweek==1 or x.dayofweek==6).astype(int)

In [None]:
# Here we combine all our features in the big sparse matrix
train_sparse = hstack([train_comment, train_subreddit, train_author, train_scores, train_len,
                       train_df['day'].values.reshape(-1,1), train_df['intersection'].values.reshape(-1,1),
                       train_df['weekend'].values.reshape(-1,1)]).tocsr()
test_sparse = hstack([test_comment, test_subreddit, test_author, test_scores, test_len,
                      test_df['day'].values.reshape(-1,1), test_df['intersection'].values.reshape(-1,1),
                      test_df['weekend'].values.reshape(-1,1)]).tocsr()

In [None]:
# Fit the model and print report
log_reg.fit(train_sparse, train_y)
print_report(log_reg, test_sparse, test_y)

That's disappinting, we haven't improved the best score. But anyway we've tried =)

## Conclusion

Well, guys, a lot of calculation here and most of them without any improvement in the model's results. Now let's drive the conclusions:
* The EDA part is not a vital one here, but I still recommend it to "feel" the data
* Using StopWords seems to be not suitable for relatively short text fields, like our comments. In our case it decreased accuracy approximatelly by 0.05
* TfidfVectorizer with Lemmatization showed the best result here and took slightly less time comparing to the CountVectorizer
* Non-text features gave us no improvement, but still it was worth trying
* The model tuning section is missed here, because the notebook is already huge and with many long-time calculatins. I left it aside, but you can do it by yourself you just need to uncomment the code bellow (and spoiler: the best value for C will be 1 - the default one) 

Thanks for reading!

In [None]:
"""This is the commented part with parameters tuning
In our LogisticRegression model we have only one parameter C
and we need to make the pipeline so the train and validation data will not blend.
This code runs a long time, be ready.
"""
"""
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

pipe_logit = make_pipeline(tf_vectorizer, log_reg)
param_grid_logit = {'logisticregression__C': np.logspace(-3, 1, 5)}

grid_logit = GridSearchCV(pipe_logit, 
                          param_grid_logit, 
                          return_train_score=True, 
                          cv=3, n_jobs=-1)

grid_logit.fit(train_stripped_comment, train_y)

grid_logit.best_params_, grid_logit.best_score_

grid_logit.score(test_stripped_comment,test_y)
"""