<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [1]:
!ls ../input/sarcasm/

In [2]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt

In [3]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [4]:
train_df.head()

In [5]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [6]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [7]:
train_df['label'].value_counts()

We split data into training and validation parts.

In [8]:
train_texts, valid_texts, y_train, y_valid = train_test_split(train_df, train_df['label'], random_state=17)

# EDA

### 1. Looking at the dataset (head, info, describe, columns) 

For subsequent analysis I will use only (train_texts, y_train). </br>
Let's look at the data by printing first 10 rows and descriptive statistics

#### 1.1 Let's look at the data by printing its head

In [9]:
train_texts.head(10)

#### 1.2 Descriptive statistics

In [10]:
train_texts.describe(include='all').T

#### 1.3 Columns

As soon as we deleted rows with omitted comments we don't have missing values in the dataset.
* **label** - label 0/1 - non-sarcastic/sarcastic
* **comment** - comment itself. Later we will be able to predict which of them are sarcastic and which are not. </br>
    * find the number of words in each sentence and compare length distributions for both sarcastic and non-sarcastic comments
    * score & length & label interaction
* **author** - author's nickname
    * find unique values 
    * find counts
* **subreddit** - name of the subreddit. A subreddit is a specific online community, and the posts associated with it, on the social media website Reddit. Subreddits are dedicated to a particular topic that people write about, and they’re denoted by /r/, followed by the subreddit’s name, e.g., /r/gaming.
    * find unique values
    * calculate the "sarcasticness" of the top subreddits
* **score** - comment's score. Which is simply the number of upvotes minus the number of downvotes.
* **ups** and **downs** - number of ups and downs respectively
    * look for the errors in the data (for example, min value for ups is -261, which is strange)
* **date** - year and month when the comment was written
* **created_utc** - date + day, hours, minutes, secons
    * dependency between day of the week and/or hour of the day and the sarcasm?
* **parent_comment**

#### 1.4 Data instabilities

In [11]:
correct_score = sum(train_texts['score'] == train_texts['ups'] - train_texts['downs'])
print(f"The score is correct only for {correct_score} number of rows out of {train_texts.shape[0]}")

Data instabilities noticed: </br>
* score, ups and downs behave strangely. Score must be equal to ups - downs

#### 1.5 Dealing with missing data and outliers

Currently, this is out of scope

### 2. Target variable analysis

In [12]:
sns.countplot('label',data=train_texts)

The train dataset is balanced, thus we can use accuracy score without a second thought

### 3. Feature Analysis

#### 3.1 Looking at columns and determining feature types 

In [13]:
train_texts.info()

We have 10 columns: 1 label, 3 numerical (score, ups, downs) and 6 strings/timstamps. 

#### 3.2 Summarizing data and showing some statistics:

In [28]:
train_texts.describe(include=["object", "bool"])

Some thoughts:
* Almost all the comments are unique
* One third of authors are unique
* Approximatelly 13 000 subreddits are in the dataset
* Datasets spans 96 months = 8 years of observations

In [32]:
for label, dataset in train_texts.groupby('label'):
    print(f"\nFor label {label} the object data statistics is:\n")
    print(dataset.describe(include=['object', 'bool']))

Separate statistics for 0 and 1 labels doesn't look different

In [16]:
train_texts.describe()

#### 3.3 Analysis

##### 3.3.1 Comment

Let's add new variable which will denote the length of the comment

In [33]:
train_texts['length'] = [len(comment.split()) for comment in train_texts['comment']]

In [34]:
train_texts[['comment', 'length']].head()

In [45]:
train_texts['length'].describe().T

In [56]:
print(f"0.025 quantile of comment's length is equal to {train_texts['length'].quantile(0.025)}")
print(f"0.975 quantile of comment's length is equal to {train_texts['length'].quantile(0.975)}")

It looks like we have some outliers in the length of the comment column. </br>
For the purpose of visualization I'm going to filter out some outliers. In future whether it will be helpful for classification or not

In [111]:
cleaned_train_texts = train_texts[train_texts['length'] <= train_texts['length'].quantile(0.975)] 

Now we can compare the length distributions for 0/1 labels

In [69]:
g = sns.kdeplot(cleaned_train_texts[cleaned_train_texts["label"]==0]["length"], color="red", shade=True)
g = sns.kdeplot(cleaned_train_texts[cleaned_train_texts["label"]==1]["length"], color="blue", shade=True)
g = g.legend(["not sarcasm", "sarcasm"])

In [79]:
cleaned_train_texts[cleaned_train_texts["label"]==0]["length"].plot.hist(bins=50, alpha = 0.5, edgecolor="black", color="red")
cleaned_train_texts[cleaned_train_texts["label"]==1]["length"].plot.hist(bins=50, alpha = 0.5, edgecolor="black", color="blue")

'length' feature may not be the informative one. But maybe we can split it into bins - into three intervals: [1, 5), [5, 20), [21, 30] and it will improve the accuracy score.

##### 3.3.2 Score

In [89]:
cleaned_train_texts['score'].describe()

'score' feature also has outliers. We ought to get rid of them

In [112]:
cleaned_train_texts = cleaned_train_texts[(cleaned_train_texts['score'] <= cleaned_train_texts['score'].quantile(0.975)) & 
                                  (cleaned_train_texts['score'] >= cleaned_train_texts['score'].quantile(0.025))] 

In [117]:
cleaned_train_texts[(cleaned_train_texts["label"]==0)]["score"].plot.hist(bins=20, alpha=0.5, edgecolor="black", color="red")
cleaned_train_texts[(cleaned_train_texts["label"]==1)]["score"].plot.hist(bins=20, alpha=0.5, edgecolor="black", color="blue")

##### 3.3.3 Subreddits

'score' feature also seems to have minor effect on the value of the 'label' column

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions