<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [None]:
!ls ../input/sarcasm/

In [None]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,StratifiedKFold,learning_curve,GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score,plot_co
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [None]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [None]:
train_df['label'].value_counts()

We split data into training and validation parts.

In [None]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(train_df['comment'], train_df['label'], random_state=17)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=50000,min_df=5,max_df=0.8,sublinear_tf=True)
train_vectors = vectorizer.fit_transform(train_texts)
test_vectors = vectorizer.transform(valid_texts)

In [None]:
logit = LogisticRegression(C=1,n_jobs=-1,random_state=17)
logit.fit(train_vectors,y_train)

In [None]:
logit.score(test_vectors,y_valid)

In [None]:
logit_params = {'C':np.linspace(0.1,2,8)}
logit_grid = GridSearchCV(logit,logit_params,cv=5,n_jobs=-1,verbose=True)
logit_grid.fit(train_vectors,y_train)

In [None]:
logit_grid.best_params_['C'],logit_grid.best_score_

In [None]:
logit = LogisticRegression(C=logit_grid.best_params_['C'],n_jobs=-1,random_state=17)
logit.fit(train_vectors,y_train)

In [None]:
y_pred = logit.predict(test_vectors)
accuracy_score(y_valid,y_pred)

In [None]:
sns.heatmap(confusion_matrix(y_valid,y_pred),annot=True,cmap='viridis')

In [None]:
import eli5
eli5.show_weights(estimator=logit,
                  vec=vectorizer)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
SS = StandardScaler?

In [None]:
SS = StandardScaler()

In [None]:
subreddit = train_df['subreddit']
sr_vectorizer = TfidfVectorizer(ngram_range=(1,1))
subreddit_vect = sr_vectorizer.fit_transform(subreddit)

In [None]:
subreddit_vect.shape

In [None]:
train_sub,test_sub = train_test_split(subreddit_vect,random_state=17)
train_sub.shape,train_vectors.shape

In [None]:
from scipy.sparse import hstack
X_train = hstack([train_vectors,train_sub])
X_test =  hstack([test_vectors,test_sub])
X_train.shape,X_test.shape


In [None]:
logit.fit(X_train,y_train)
y_pred = logit.predict(X_test)
accuracy_score(y_valid,y_pred)

In [None]:
test_sarcasm = ["Very impressive can't you see my excitement"]
test_sarcasm = vectorizer.transform(test_sarcasm)
subredd = sr_vectorizer.transform(['politics'])
test_sarcasm.shape, subredd.shape

test = hstack([test_sarcasm,subredd])
test.shape

In [None]:
logit.predict(test)

In [None]:
sns.heatmap(confusion_matrix(y_valid,y_pred),annot=True,cmap='viridis')