In [None]:
%autosave 600

# Drunk Anteaters and Text Classification

This kernel demonstrates common techniques for analyzing text data and classifying documents. Examples include:
* Assigning categories to notes from call center reps
* Finding themes in medical records
* Identifying toxic comments on social media

The example here is based on [Quora Insincere Questions Classification](https://www.kaggle.com/c/quora-insincere-questions-classification). The goal is to identify questions that may be "insincere"(definition to follow).

We'll follow this approach to building a useful machine learning model.

<img src="https://i.imgur.com/M5eC2FT.png" width="700">


--
--

## Context
Quora is one of my favorite sites to visit. You can learn about useful things and also totally useless things. Of coures this is quite different than our objective here, which is to say whether or not a question is sincere. Here's an example - sounds sincere, but is it useful? Not to me since I don't interact with anteaters. I appreciate it just the same and love Quora for these types of questions!

<img src="https://s4.scoopwhoop.com/anj/cashkaro/27222808.png" width="600">


Back to the task at hand. From the Data tab of the competition, we have some explanation of what is insincere.
 
> An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:
> 
> * Has a non-neutral tone
> * Is disparaging or inflammatory
> * Isn't grounded in reality
> * Uses sexual content for shock value, and not to seek genuine answers


Identifying these characteristics for binary classification can be challenging. Be sure to look at past Kaggle competitions and academic papers for additional resources!


## Data Exploration
Let's first look at the data in tabular form.

In [None]:
import numpy as np 
import pandas as pd
pd.set_option('max_colwidth', 120)

numrows = None #none==all rows
train = pd.read_csv('train.csv', index_col=['qid'], nrows=numrows)
train.head()

All zeros? Let's count.

In [None]:
print(train.shape[0], "\n",
      train.target.value_counts(normalize="True"))

There are many things we can look at for further exploration. Here are a couple of brief examples:

In [None]:
train['qlength'] = train.question_text.str.len()
train['why'] = train.question_text.str.startswith("Why")
train.head()

In [None]:
import hvplot.pandas

display(train.hvplot.kde('qlength', by='target', xlim=(0,500)),
        pd.crosstab(train.target, train.why, normalize='index'))

## Pre-process the Data and Generate Features
Turning text data into numerical data for consumption by a machine learning model is where the magic comes in. Lucky for us there are now packages that make it easy. 


#### Preprocessing
Most packages have built-in methods for cleaning data. You can specify options that remove puctuation, trim spacing, and remove common words (known as *stop words*). Since social media posts are often defined by these features, we'll leave them in.


#### [Sci-kit Learn Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())  

#### [Sci-kit Learn Tf-Idf Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

Tf-idf stands for *Term Frequency, Inverse Document Frequency*. It is a way to account for a word's in a document vs. the word's occurrence in all documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())  

#### Word Embeddings
The approaches above are known as *Bag of Words* approaches because the order of words is not considered. Clearly this is a limitation. Newer approaches use *Word Embeddings*. Embeddings are made by mapping vectors and phrases from documents into vectors of numbers. The vectors are the output of deep neural networks trained on large bodies of text. Here is an example of a word embedding known as *GloVe*.

In [None]:
pd.read_csv('../input/embeddings/glove.840B.300d/glove.840B.300d.txt', 
                   header=None, sep=' ', skiprows=2, nrows=5, index_col=[0])

These vectors are typically fed into a neural network that uses Keras/Tensorflow or Pytorch. You can also use embeddings in standard machine learning methods such as logistic regression. The most accurate models usually come from word embeddings and deep learning.


#### Other Approaches and Packages
* spaCy - many features like part-of-speech, word dependencies, others
* NLTK - Natural Language Tool Kit with many features
* sklearn Latent Dirichlet Allocation - an implementation of [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model)

## Feature Generation and Feature Selection
You can add metafeatures such as those seen above in EDA. Sometimes they help. For feature selection, we'll limit the words analyzed to the most common words as part of building the model.

## The Classification Model
Logistic Regression is a good choice when trying to balance speed and accuracy. This model uses Tf-idf features.

In [None]:
#%% get libraries and data
import numpy as np 
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

numrows = None
train = pd.read_csv('../input/train.csv', index_col=['qid'], nrows=numrows)
test = pd.read_csv('../input/test.csv', index_col=['qid'], nrows=numrows)
y = train.target.values
display(train.head())

In [None]:
%%time
#%% make word vectors
word_vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                    min_df=3,
                                    max_df=0.9,
                                    token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'",
                                    max_features=50_000,  #basic feature selection
                                    strip_accents='unicode',
                                    use_idf=True,
                                    smooth_idf=True,
                                    sublinear_tf=True)

print("tokenizing")
word_vectorizer.fit(pd.concat((train['question_text'], test['question_text'])))
X = word_vectorizer.transform(train['question_text'])
X_test = word_vectorizer.transform(test['question_text'])

In [None]:
%%time
#%% split out validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=99, stratify=y)

# Logistic Regression
model = LogisticRegression(solver='saga', class_weight='balanced', 
                                C=0.5, max_iter=350, n_jobs=2, verbose=1) #seed not set
model.fit(X_train, y_train)
val_pred = model.predict_proba(X_val)
val_pred[0:5]

## Evaluate
The host client wants the predictions in binary format rather than probabilities. The model we built will predict binaries but it sets the threshold at a specific number. We can do better if we try a range of numbers and choose the best cutoff.

The score used to evaluate results is the F1 score. It's known to be effective for data sets imbalanced by a dominant class, in this case the 0 class.

In [None]:
%%time
#%% find best threshold
def thresh_search(y_true, y_proba):
    best_thresh = 0
    best_score = 0
    for thresh in np.arange(0, 1, 0.1):
        score = f1_score(y_true, y_proba > thresh)
        if score > best_score:
            best_thresh = thresh
            best_score = score
        print(thresh, score)
    return best_thresh, best_score

thresh, search = thresh_search(y_val, val_pred[:, 1])
print("\n", thresh, search)

In [None]:
# submit
test_pred = model.predict_proba(X_test)[:,1]
sub = pd.read_csv('../input/sample_submission.csv', index_col=['qid'], nrows=numrows)
sub['prediction'] = test_pred > thresh
sub.to_csv("submission.csv")
sub.head()

That's it!