<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [None]:
!ls ../input/sarcasm/

In [None]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [None]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [None]:
train_df['label'].value_counts()

We split data into training and validation parts.

In [None]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(train_df['comment'], train_df['label'], random_state=17)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions

## Tf-Idf + logistic regression pipeline

### preprocessing data

In [None]:
train_texts = train_texts.to_frame()
valid_texts = valid_texts.to_frame()
y_train = y_train.to_frame()
y_valid = y_valid.to_frame()

In [None]:
## Number of words in the text ##
train_texts["num_words"] = train_texts["comment"].apply(lambda x: len(str(x).split()))
valid_texts["num_words"] = valid_texts["comment"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train_texts["num_unique_words"] = train_texts["comment"].apply(lambda x: len(set(str(x).split())))
valid_texts["num_unique_words"] = valid_texts["comment"].apply(lambda x: len(set(str(x).split())))

In [None]:
train_texts = train_texts.reset_index()
valid_texts = valid_texts.reset_index()
y_train = y_train.reset_index()
y_valid = y_valid.reset_index()

In [None]:
train_texts.shape, y_train.shape

In [None]:
train_texts['comment'].tail()

### tfidf vectors

In [None]:
# Get the tfidf vectors #
tfidf_vec = TfidfVectorizer(ngram_range=(1,3))
tfidf_vec.fit_transform(train_texts['comment'].values.tolist() + valid_texts['comment'].values.tolist())
train_tfidf = tfidf_vec.transform(train_texts['comment'].values.tolist())
test_tfidf = tfidf_vec.transform(valid_texts['comment'].values.tolist())

## Some plots

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), 
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color='black',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  
    
plot_wordcloud(train_df["comment"], title="Word Cloud of Comments")

In [None]:
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

In [None]:

## target count ##
cnt_srs = train_df['label'].value_counts()
trace = go.Bar(
    x=cnt_srs.index,
    y=cnt_srs.values,
    marker=dict(
        color=cnt_srs.values,
        colorscale = 'Picnic',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Label Count',
    font=dict(size=18)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="LabelCount")

## target distribution ##
labels = (np.array(cnt_srs.index))
sizes = (np.array((cnt_srs / cnt_srs.sum())*100))

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(
    title='Label distribution',
    font=dict(size=18),
    width=600,
    height=600,
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="usertype")

### CV

In [None]:
from sklearn.model_selection import KFold
from sklearn import metrics

In [None]:


def runModel(train_X, train_y, test_X, test_y, test_X2):
    model = LogisticRegression(C=5., solver='sag')
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)[:,1]
    pred_test_y2 = model.predict_proba(test_X2)[:,1]
    return pred_test_y, pred_test_y2, model

print("Building model.")
cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train_texts.shape[0]])
kf = KFold(n_splits=5, shuffle=True, random_state=17)
for dev_index, val_index in kf.split(train_texts):
    dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]
    dev_y, val_y = y_train.iloc[dev_index].label, y_train.iloc[val_index].label
    pred_val_y, pred_test_y, model = runModel(dev_X, dev_y, val_X, val_y, test_tfidf)
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index] = pred_val_y
    cv_scores.append(metrics.log_loss(val_y, pred_val_y))

In [None]:
for thresh in np.arange(0.3, 0.36, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

In [None]:
np.mean(cv_scores)

In [None]:
pred_val_y.shape, val_y.shape

In [None]:
model.predict_proba(test_tfidf[3])[:,1]

In [None]:
train_texts['comment'].iloc[1]

In [None]:
test_tfidf[1].shape

### Model without CV

In [None]:

all_model = LogisticRegression(C=5., solver='sag')
all_model.fit(train_tfidf, y_train.label)
pred_test_y = all_model.predict_proba(test_tfidf)[:,1]

In [None]:
metrics.log_loss(y_valid.label, pred_test_y)

In [None]:
thresh = 0.36
metrics.f1_score(y_valid.label, (pred_test_y>thresh).astype(int))

In [None]:
for thresh in np.arange(0.34, 0.4, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(y_valid.label, (pred_test_y>thresh).astype(int))))

## most predictive words/bigrams

In [None]:
import eli5
eli5.show_weights(all_model, vec=tfidf_vec, top=50, feature_filter=lambda x: x != '<BIAS>')

## Bag of Words approach

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
%%time
from sklearn.pipeline import make_pipeline

text_pipe_logit = make_pipeline(CountVectorizer(),
                                # for some reason n_jobs > 1 won't work 
                                # with GridSearchCV's n_jobs > 1
                                LogisticRegression(C=5., solver='sag',
                                                   random_state=17))

text_pipe_logit.fit(train_texts.comment, y_train.label)
print(text_pipe_logit.score(valid_texts.comment, y_valid.label))

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_logit = {'logisticregression__C': np.logspace(-5, 0, 6)}
grid_logit = GridSearchCV(text_pipe_logit, 
                          param_grid_logit, 
                          return_train_score=True, 
                          cv=3, n_jobs=-1)

grid_logit.fit(train_texts.comment, y_train.label)

In [None]:
grid_logit.best_params_, grid_logit.best_score_

In [None]:
def plot_grid_scores(grid, param_name):
    plt.plot(grid.param_grid[param_name], grid.cv_results_['mean_train_score'],
        color='green', label='train')
    plt.plot(grid.param_grid[param_name], grid.cv_results_['mean_test_score'],
        color='red', label='test')
    plt.legend();

In [None]:
plot_grid_scores(grid_logit, 'logisticregression__C')

In [None]:
grid_logit.score(valid_texts.comment, y_valid.label)