## Project description:
Excerpts from books of three (horror novels) authors (Edgar Allan Poe, Mary Shelley, and HP Lovecraft) are selected, and the goal is to build a supervised learning model that is able to classify exerpts of three different authors with acceptable accuracy.

The dataset was downloaded from Kaggle.

## Work flow:
To begin with, we will audit the data, to see what it looks like and the size and dimensionality of it. Then, we will preprocess the text, eliminating parts such as punctuation and stop words. Next, we will try the tf-idf vectorizer and different ML algorithms. Finally, we will try to use the count vectorizer and apply similar ML algorithms, to compare which vectorizer and which algorithm fits the documents better.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import xgboost as xgb
from tqdm import tqdm
import sklearn
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import lightgbm as lgb 
from sklearn.feature_selection import SelectKBest, chi2, f_regression 
nltk.download('averaged_perceptron_tagger')
stop_words = stopwords.words('english')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\30523\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Read and audit the data

In [2]:
train = pd.read_csv('train.csv')

In [3]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [4]:
train.shape

(19579, 3)

### Define the metrics for model performance

In [5]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

### Preprocess the text data
1. Lowercase
2. Remove Punctuation
3. Tokenize
4. Stopwords filtering
5. Lemmatization

In [6]:
def preprocess(text):
    text = text.lower()
    
    text_j = "".join([char for char in text if char not in string.punctuation])
    
    words = word_tokenize(text_j)
    
    stop_words = stopwords.words('english')
    filtered_words = [word for word in words if word not in stop_words]
    
    
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]
    
    
    return lemmatized

In [7]:
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s)) 

In [8]:
train['text']=train['text'].apply(preprocess).apply(listToString)

In [9]:
train.head()

Unnamed: 0,id,text,author
0,id26305,process however afforded mean ascertaining dim...,EAP
1,id17569,never occurred fumbling might mere mistake,HPL
2,id11008,left hand gold snuff box capered hill cutting ...,EAP
3,id27763,lovely spring looked windsor terrace sixteen f...,MWS
4,id12958,finding nothing else even gold superintendent ...,HPL


### Split the training set into a training set and a validation set

In [10]:
X=train.text.values
y=train.author

In [11]:
xtrain, xvalid, ytrain, yvalid = train_test_split(X, y, random_state=7, 
                                                  test_size=0.2)

### Try TF-IDF vectorizer

In [12]:
tf_v = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

In [13]:
tf_v.fit(xtrain) 
xtrain_tf_v =  tf_v.transform(xtrain) 
xvalid_tf_v = tf_v.transform(xvalid)

In [14]:
xtrain_tf_v.shape

(15663, 11706)

In [15]:
xvalid_tf_v.shape

(3916, 11706)

#### Try different ML algorithms: Naive Bayes, XGB classifier, LGBM classifier

In [16]:
# Naive Bayes
NB_alg = MultinomialNB()
NB_alg.fit(xtrain_tf_v, ytrain)
predictions = NB_alg.predict_proba(xvalid_tf_v)

sklearn.metrics.log_loss(yvalid,predictions)

0.5889268766461253

In [17]:
# XGB classifier
xgb_alg = xgb.XGBClassifier(max_depth=50, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
xgb_alg.fit(xtrain_tf_v, ytrain)
predictions = xgb_alg.predict_proba(xvalid_tf_v)

sklearn.metrics.log_loss(yvalid,predictions)

0.6408480979593377

In [18]:
# Light GBM

gbm_alg = lgb.LGBMClassifier(objective='multiclass',
                             num_class=3,
                             metric='multi_logloss',
                             max_depth=50,
                           num_leaves=30,
                        learning_rate=0.1,
                        n_estimators=1000)
gbm_alg.fit(xtrain_tf_v, ytrain,
        eval_metric='multi_logloss')
prediction = gbm_alg.predict_proba(xvalid_tf_v)

sklearn.metrics.log_loss(yvalid,predictions)

0.6408480979593377

### Count Vectorizer

In [19]:
ct_v = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

In [20]:
ct_v.fit(xtrain)
xtrain_ct_v =  ct_v.transform(xtrain) 
xvalid_ct_v = ct_v.transform(xvalid)

In [21]:
xtrain_ct_v.shape

(15663, 317772)

#### Reduction of dimensionality
Since the number of columns is huge, we will need to create another trainning set that sees reduction in dimensionality in order to run the xgboosting algorithm (or the trainning will take forever). There are some ways to do it, such as SVD, and here I will choose to use Select K Best.

In [22]:
selector = SelectKBest(chi2, k = 10000)
xtrain_ct_v2=selector.fit_transform(xtrain_ct_v, ytrain)
xvalid_ct_v2=selector.transform(xvalid_ct_v)

#### Try different ML algorithms: Naive Bayes, XGB classifier

In [23]:
# Naive Bayes
NB_alg = MultinomialNB()
NB_alg.fit(xtrain_ct_v, ytrain)
predictions = NB_alg.predict_proba(xvalid_ct_v)

sklearn.metrics.log_loss(yvalid,predictions)

0.4736483840975025

In [24]:
# XGB classifier
xgb_alg = xgb.XGBClassifier(max_depth=50, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
xgb_alg.fit(xtrain_ct_v2, ytrain)
predictions = xgb_alg.predict_proba(xvalid_ct_v2)

sklearn.metrics.log_loss(yvalid,predictions)

0.5916264465747549

## Thus, Naive Bayes and Count Vectorizer is the best combination so far