<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [None]:
!ls ../input/sarcasm/

In [None]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
sns.set()
from matplotlib import pyplot as plt

%config InlineBackend.figure_format = 'svg'

In [None]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [None]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [None]:
train_df['label'].value_counts()

We split data into training and validation parts.

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions

In [None]:
# plt.figure(figsize=(10,4))
train_df.loc[train_df['label'] == 1, 'comment'].str.len().apply(np.log1p).hist(alpha=.5, label='sarcastic')
train_df.loc[train_df['label'] == 0, 'comment'].str.len().apply(np.log1p).hist(alpha=.5, label='normal')
plt.legend()

In [None]:
train_df.head()

In [None]:
sub_label = train_df.groupby("subreddit")['label'].agg([np.mean, np.sum, np.size])
sub_label.sort_values(by='sum', ascending=False).head(10)

In [None]:
sub_label[sub_label['size'] > 1000].sort_values(by='mean', ascending=False).head(10)

for authors

In [None]:
author_label = train_df.groupby("author")['label'].agg([np.mean, np.sum, np.size])
author_label.sort_values(by='sum', ascending=False).head(10)

In [None]:
author_label[author_label['size'] > 400].sort_values(by='mean', ascending=False).head(10)

for scores

In [None]:
score_label = train_df[train_df['score'] >= 0].groupby('score')['label'].agg([np.mean, np.sum, np.size])
score_label[score_label['size'] > 400].sort_values(by='mean', ascending=False).head(10)

scores less than zero

In [None]:
score_label2 = train_df[train_df['score'] < 0].groupby('score')['label'].agg([np.mean, np.sum, np.size])
# score_label2.head(10)
score_label2[score_label2['size'] > 400].sort_values(by='mean', ascending=False).head(10)

In [None]:
print('Maximum score: ', train_df['score'].max(), '\n')
print('Minimum score: ', train_df['score'].min(), '\n')
print('Mean score: ', train_df['score'].mean(), '\n')
print('Standard Deviation score: ', train_df['score'].std(), '\n')
print('Median score: ', train_df['score'].median())

In [None]:
max_score = train_df['score'].max()
min_score = train_df['score'].min()

parent_comment_max_score = train_df.loc[train_df['score'] == max_score, 'parent_comment'].iloc[0]
parent_comment_min_score = train_df.loc[train_df['score'] == min_score, 'parent_comment'].iloc[0]

comment_max_score = train_df.loc[train_df['score'] == max_score, 'comment'].iloc[0]
comment_min_score = train_df.loc[train_df['score'] == min_score, 'comment'].iloc[0]

sarcasm_max_score = train_df.loc[train_df['score'] == max_score, 'label'].iloc[0]
sarcasm_max_score = (sarcasm_max_score == 1)

sarcasm_min_score = train_df.loc[train_df['score'] == min_score, 'label'].iloc[0]
sarcasm_min_score = (sarcasm_min_score == 1)

print('The comment "{}", scored the highest at {}, had a parent comment of "{}" and it is labelled as sarcastic: {}'
      .format(comment_max_score, max_score, parent_comment_max_score, sarcasm_max_score), '\n')

print('The comment "{}", scored the lowest at {}, had a parent comment of "{}" and it is labelled as sarcastic: {}'
      .format(comment_min_score, min_score, parent_comment_min_score, sarcasm_min_score))

In [None]:
train_df['date'] = pd.to_datetime(train_df['date'], yearfirst=True)
train_df['year'] = train_df['date'].apply(lambda d: d.year)
train_df.head()

In [None]:
year_comments = train_df.groupby('year')['label'].agg([np.mean, np.size, np.sum])
year_comments.sort_values(by='sum', ascending=False).head(10)

In [None]:
# plt.figure(figsize=(10,6))
year_comments['mean'].plot(kind='line')
plt.title('Rate of Sarcastic Comments by Year')
plt.ylabel('Mean Sarcastic Comments by Year')

In [None]:
X = train_df['comment']
y = train_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17) 

In [None]:
tf_idf = TfidfVectorizer(ngram_range=(1,2), max_features=60000, min_df=2)
logist = LogisticRegression(n_jobs=4, solver='lbfgs', random_state=17, verbose=1)
tf_idf_logist_pipeline = Pipeline([('tf_idf', tf_idf),
                                  ('logist', logist)])

In [None]:
# fit
tf_idf_logist_pipeline.fit(X_train, y_train)

In [None]:
# predict
pred = tf_idf_logist_pipeline.predict(X_test)

In [None]:
# accuracy
accuracy_score(y_test, pred)

In [None]:
print('Accuracy score is: {:.2%}'.format(accuracy_score(y_test, pred)))

In [None]:
from sklearn.metrics import classification_report

In [None]:
classification_report(y_test, pred)

In [None]:
confusion_matrix(y_test, pred)

In [None]:
# plot confusion matrix
plt.figure(figsize=(10, 6))

conmat = pd.DataFrame(confusion_matrix(y_test, pred), index=['Not Sarcastic', 'Sarcastic'], 
                      columns=['Not Sarcastic', 'Sarcastic'])

ax = sns.heatmap(conmat, annot=True, cbar=False, cmap='viridis', linewidths=0.5, fmt='.0f')
ax.set_title('Confusion Matrix for Sarcasm Detection', fontsize=18, y=1.05)
ax.set_ylabel('Real', fontsize=12)
ax.set_xlabel('Predicted', fontsize=12)
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
# ax.xaxis.set_label_position('bottom')
ax.tick_params(labelsize=10)

In [None]:
import eli5

In [None]:
eli5.show_weights(estimator=tf_idf_logist_pipeline.named_steps['logist'], 
                  vec=tf_idf_logist_pipeline.named_steps['tf_idf'])

In [None]:
# using grid cv
from sklearn.model_selection import GridSearchCV

In [None]:
model = Pipeline([('tfidf',TfidfVectorizer(min_df=2)),
                    ('logit',LogisticRegression(solver='lbfgs', max_iter=3000))])
params = {'tfidf__ngram_range':[(1,1),(1,2)],'tfidf__use_idf':(True,False)}
grid = GridSearchCV(estimator=model, param_grid=params, verbose=1, n_jobs=-1, cv=3)

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_params_

In [None]:
better_model = Pipeline([('tfidf',TfidfVectorizer(min_df=2, ngram_range=(1,2), use_idf=True)),
                    ('logit',LogisticRegression(solver='lbfgs', max_iter=3000))])
better_model.fit(X_train, y_train)

In [None]:
better_pred = better_model.predict(X_test)

In [None]:
accuracy_score(y_test, better_pred)

In [None]:
print('Accuracy score is: {:.2%}'.format(accuracy_score(y_test, better_pred)))

> **slightly better accuracy**

In [None]:
confusion_matrix(y_test, better_pred)

In [None]:
# plot confusion matrix again
plt.figure(figsize=(10, 6))

conmat = pd.DataFrame(confusion_matrix(y_test, better_pred), index=['Not Sarcastic', 'Sarcastic'], 
                      columns=['Not Sarcastic', 'Sarcastic'])

ax = sns.heatmap(conmat, annot=True, cbar=False, cmap='viridis', linewidths=0.5, fmt='.0f')
ax.set_title('Confusion Matrix for Sarcasm Detection', fontsize=18, y=1.05)
ax.set_ylabel('Real', fontsize=12)
ax.set_xlabel('Predicted', fontsize=12)
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.tick_params(labelsize=10)