# Approaching (Almost) Any NLP Problem on Kaggle

In this post I'll talk about approaching natural language processing problems on Kaggle. As an example, we will use the data from this competition. We will create a very basic first model first and then improve it using different other features. We will also see how deep neural networks can be used and end this post with some ideas about ensembling in general.

### This covers:
- tfidf 
- count features
- logistic regression
- naive bayes
- svm
- xgboost
- grid search
- word vectors
- LSTM
- GRU
- Ensembling

*NOTE*: This notebook is not meant for achieving a very high score on the Leaderboard for this dataset. However, if you follow it properly, you can get a very high score with some tuning. ;)

So, without wasting any time, let's start with importing some important python modules that I'll be using.

In [6]:
import pandas as pd
import numpy as np
# import xgboost as xgb
# from tqdm import tqdm
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
# from keras.preprocessing import sequence, text
# from keras.callbacks import EarlyStopping
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Let's load the datasets

In [7]:
train = pd.read_csv('../Data/train.csv')
test = pd.read_csv('../Data/test.csv')
sample = pd.read_csv('../Data/sample_submission.csv')

A quick look at the data

In [8]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [9]:
test.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


In [10]:
sample.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.403494,0.287808,0.308698
1,id24541,0.403494,0.287808,0.308698
2,id00134,0.403494,0.287808,0.308698
3,id27757,0.403494,0.287808,0.308698
4,id04081,0.403494,0.287808,0.308698


The problem requires us to predict the author, i.e. EAP, HPL and MWS given the text. In simpler words, text classification with 3 different classes.

For this particular problem, Kaggle has specified multi-class log-loss as evaluation metric. This is implemented in the follow way (taken from: https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/util.py)

In [11]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

We use the LabelEncoder from scikit-learn to convert text labels to integers, 0, 1 2

In [12]:
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author.values)

Before going further it is important that we split the data into training and validation sets. We can do it using `train_test_split` from the `model_selection` module of scikit-learn.

In [13]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)

In [14]:
xtest = test.text.values

In [15]:
print (xtrain.shape)
print (xtest.shape)
print (xvalid.shape)

(17621,)
(8392,)
(1958,)


## Building Basic Models

Let's start building our very first model. 

Our very first model is a simple TF-IDF (Term Frequency - Inverse Document Frequency) followed by a simple Logistic Regression.

In [5]:
# Always start with these features. They work (almost) everytime!
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv =  tfv.transform(xtrain) 
xtest_tfv = tfv.transform(xtest)
xvalid_tfv = tfv.transform(xvalid)

NameError: name 'xtrain' is not defined

In [12]:
# Fitting a simple Logistic Regression on TFIDF
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
# logloss: 0.626 

logloss: 0.626 


In [13]:
clf1_pred = clf.predict_proba(xtest_tfv)

In [14]:
clf1_pred

array([[ 0.33604103,  0.11832738,  0.54563159],
       [ 0.69641819,  0.2014162 ,  0.10216561],
       [ 0.42956329,  0.49145905,  0.07897767],
       ..., 
       [ 0.76259457,  0.12086463,  0.1165408 ],
       [ 0.29219795,  0.07010464,  0.63769741],
       [ 0.46859921,  0.48609867,  0.04530212]])

And there we go. We have our first model with a multiclass logloss of 0.626.

But we are greedy and want a better score. Lets look at the same model with a different data.

Instead of using TF-IDF, we can also use word counts as features. This can be done easily using CountVectorizer from scikit-learn.

In [15]:
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv =  ctv.transform(xtrain) 
xtest_ctv = ctv.transform(xtest)
xvalid_ctv = ctv.transform(xvalid)

In [16]:
# Fitting a simple Logistic Regression on Counts
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
# logloss: 0.528 

logloss: 0.528 


In [17]:
clf2_pred = clf.predict_proba(xtest_ctv)

Aaaaanddddddd Wallah! We just improved our first model by 0.1!!!

Next, let's try a very simple model which was quite famous in ancient times - Naive Bayes.

Let's see what happens when we use naive bayes on these two datasets:

In [18]:
# Fitting a simple Naive Bayes on TFIDF
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
# logloss: 0.578 

logloss: 0.578 


In [19]:
clf3_pred = clf.predict_proba(xtest_tfv)

Good performance! But the logistic regression on counts is still better! What happens when we use this model on counts data instead?

In [20]:
# Fitting a simple Naive Bayes on Counts
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
clf_predictions = clf.predict_proba(xvalid_ctv)
print ("logloss: %0.3f " % multiclass_logloss(yvalid, clf_predictions))


logloss: 0.485 


In [21]:
clf4_pred = clf.predict_proba(xtest_ctv)


Whoa! Seems like old stuff still works good!!!! One more ancient algorithms in the list is SVMs. Some people "love" SVMs. So, we must try SVM on this dataset.

Since SVMs take a lot of time, we will reduce the number of features from the TF-IDF using Singular Value Decomposition before applying SVM. 

Also, note that before applying SVMs, we *must* standardize the data.

In [22]:
# Apply SVD, I chose 120 components. 120-200 components are good enough for SVM model.
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xtest_svd = svd.transform(xtest_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

# Scale the data obtained from SVD. Renaming variable to reuse without scaling.
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xtest_svd_scl = scl.transform(xtest_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

Now it's time to apply SVM. After running the following cell, feel free to go for a walk or talk to your girlfriend/boyfriend. :P

In [23]:
# Fitting a simple SVM
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
# logloss: 0.739 

logloss: 0.733 


In [24]:
svd1_pred = clf.predict_proba(xtest_svd_scl)

Oops! time to get up! Looks like SVM doesn't perform well on this data...! 

Before moving further, lets apply the most popular algorithm on Kaggle: xgboost!

In [25]:
# Fitting a simple xgboost on tf-idf
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
# logloss: 0.782 

logloss: 0.782 


In [26]:
xg1_pred = clf.predict_proba(xtest_tfv.tocsc())

Seems like no luck with XGBoost! But that is not correct. I haven't done any hyperparameter optimizations yet. And since I'm lazy, I'll just tell you how to do it and you can do it on your own! ;). This will be discussed in the next section:


## Grid Search

Its a technique for hyperparameter optimization. Not so effective but can give good results if you know the grid you want to use. I specify the parameters that should usually be used in this post: http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/ Please keep in mind that these are the parameters I usually use. There are many other methods of hyperparameter optimization which may or may not be as effective.

In this section, I'll talk about grid search using logistic regression. 

Before starting with grid search we need to create a scoring function. This is accomplished using the `make_scorer` function of scikit-learn.


In [27]:
# mll_scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)

Next we need a pipeline. For demonstration here, i'll be using a pipeline consisting of SVD, scaling and then logistic regression. Its better to understand with more modules in pipeline than just one ;)

In [28]:
# # Initialize SVD
# svd = TruncatedSVD()
    
# # Initialize the standard scaler 
# scl = preprocessing.StandardScaler()

# # We will use logistic regression here..
# lr_model = LogisticRegression()

# # Create the pipeline 
# clf = pipeline.Pipeline([('svd', svd),
#                          ('scl', scl),
#                          ('lr', lr_model)])

Next we need a grid of parameters:

In [29]:
# param_grid = {'svd__n_components' : [120, 180],
#               'lr__C': [0.1, 1.0, 10], 
#               'lr__penalty': ['l1', 'l2']}

So, for SVD we evaluate 120 and 180 components and for logistic regression we evaluate three different values of C with l1 and l2 penalty. We can now start grid search on these parameters.

In [30]:
# # Initialize Grid Search Model
# model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
#                                  verbose=10, n_jobs=-1, iid=True, refit=True, cv=2)

# # Fit Grid Search Model
# model.fit(xtrain_tfv, ytrain)  # we can use the full data here but im only using xtrain
# print("Best score: %0.3f" % model.best_score_)
# print("Best parameters set:")
# best_parameters = model.best_estimator_.get_params()
# for param_name in sorted(param_grid.keys()):
#     print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 2 folds for each of 12 candidates, totalling 24 fits
[CV] lr__C=0.1, lr__penalty=l1, svd__n_components=120 ................
[CV] lr__C=0.1, lr__penalty=l1, svd__n_components=120 ................
[CV] lr__C=0.1, lr__penalty=l1, svd__n_components=180 ................
[CV] lr__C=0.1, lr__penalty=l1, svd__n_components=180 ................
[CV] lr__C=0.1, lr__penalty=l2, svd__n_components=120 ................
[CV] lr__C=0.1, lr__penalty=l2, svd__n_components=120 ................
[CV] lr__C=0.1, lr__penalty=l2, svd__n_components=180 ................
[CV] lr__C=0.1, lr__penalty=l2, svd__n_components=180 ................


Process ForkPoolWorker-12:
Process ForkPoolWorker-10:
Process ForkPoolWorker-11:
Process ForkPoolWorker-9:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/pcorr/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/pcorr/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/pcorr/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/pcorr/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/pcorr/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/pcorr/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._targe

KeyboardInterrupt: 

The score comes similar to what we had for SVM. This technique can be used to finetune xgboost or even multinomial naive bayes as below. We will use the tfidf data here:

In [None]:
# nb_model = MultinomialNB()

# # Create the pipeline 
# clf = pipeline.Pipeline([('nb', nb_model)])

# # parameter grid
# param_grid = {'nb__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# # Initialize Grid Search Model
# model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
#                                  verbose=10, n_jobs=-1, iid=True, refit=True, cv=2)

# # Fit Grid Search Model
# model.fit(xtrain_tfv, ytrain)  # we can use the full data here but im only using xtrain. 
# print("Best score: %0.3f" % model.best_score_)
# print("Best parameters set:")
# best_parameters = model.best_estimator_.get_params()
# for param_name in sorted(param_grid.keys()):
#     print("\t%s: %r" % (param_name, best_parameters[param_name]))

This is an improvement of 8% over the original naive bayes score!


### Ensemble and Save Models

In [32]:
print(clf1_pred.shape)
print(clf1_pred.shape)
print(clf1_pred.shape)
print(clf1_pred.shape)
print(svd1_pred.shape)
print(xg1_pred.shape)

(8392, 3)
(8392, 3)
(8392, 3)
(8392, 3)
(8392, 3)
(8392, 3)


In [44]:
def df_creator(predictions):
    df = pd.DataFrame(predictions, columns=['EAP','HPL','MWS']) 
    df['id'] = test['id']
    df = df[['id','EAP','HPL','MWS']]
    df.head()
    return(df)

In [45]:
pred_dfs = []
for predictions in [clf1_pred, clf2_pred, clf3_pred, clf4_pred, svd1_pred, xg1_pred]:
    pred_dfs.append(df_creator(predictions))

In [58]:
i = 0
for df in pred_dfs:
    i = i + 1
    file_name = 'alg' + str(i) + '.csv'
    print(file_name)
    df.to_csv('../predictions/' + file_name, index=False)

alg1.csv
alg2.csv
alg3.csv
alg4.csv
alg5.csv
alg6.csv


In [48]:
all_simple_pred = pd.concat(pred_dfs)

In [49]:
all_simple_pred.shape

(50352, 4)

In [50]:
all_simple_pred.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.336041,0.118327,0.545632
1,id24541,0.696418,0.201416,0.102166
2,id00134,0.429563,0.491459,0.078978
3,id27757,0.612087,0.302487,0.085426
4,id04081,0.718934,0.155472,0.125595


In [51]:
by_id_index = all_simple_pred.groupby('id', as_index=False)
simple_en_pred = by_id_index.mean()

In [52]:
simple_en_pred.shape

(8392, 4)

In [53]:
simple_en_pred.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id00008,0.870216,0.068997,0.060787
1,id00011,0.142247,0.834253,0.023501
2,id00013,0.160619,0.761866,0.077515
3,id00018,0.150658,0.617991,0.231351
4,id00020,0.102637,0.855016,0.042347


In [55]:
simple_en_pred.to_csv('../predictions/simple_en_pred.csv', index=False)