##### Problem:
In this year's Halloween playground competition, you're challenged to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft

# Import packages

In [1]:
import re
import pandas as pd
import string
import multiprocessing
from nltk.corpus import stopwords
from flashtext.keyword import KeywordProcessor
from sklearn.model_selection import train_test_split
from sklearn import metrics
import nltk
# libraries for dataset preparation, feature engineering, model training 
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

#import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c5250435\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Import Dataset

In [2]:
df=pd.read_csv('train.csv',encoding='utf-8')
df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


## Check the different values available for the column 'author'

In [3]:
df['author'].value_counts()

EAP    7900
MWS    6044
HPL    5635
Name: author, dtype: int64

# Preprocessing

In [4]:
# Define a function for removing regex
def regex_filtering(text):
        if text:
            #removing all email metadata fix it for email terms only
            text=re.sub(r"^(sender|to|copy|from|sent|subject|date|cc|e|von|datum|an|importance|bcc):.*$"," ",text,flags=re.M)
            #removing all mail ids
            text=re.sub(r"\S*@\S*\s?"," ",text)
            #removing all links
            text=re.sub(r"(((https?|ftp|file):\/\/)|www\\.)\\S+", ' ', text, flags=re.MULTILINE)
            text=re.sub(r"\w*\.\w{1,4}", '', text, flags=re.MULTILINE)
            #removing all non word character
            text=re.sub(r"([^a-zA-Z0-9\\u00C0-\\u00FF@]|[Ã£Ã¢])+",' ',text)
            #removing words with numbers 
            text=re.sub(r'\w*\d\w*', ' ', text)
            #removing single characters
            text=re.sub(r'\b\S{1}\s+',' ',text)
            #removing words with repeating characters
            text=re.sub(r'\b(\w)\1{1,}\s+',' ',text)
            #removing punkt
            text = text.translate(str.maketrans('','',string.punctuation))
            #removing extra whitespace
            text=re.sub(r"\s\s+",' ',text)
            #removing repeating words
            text=re.sub(r"(\w+\s+)\1{1,}",' ',text)
            #removing whitespaces
            text=text.strip()
            return text

In [5]:
#Tokenize Terms and remove stopwords
def tokenize_term(x):
        predefined_stopwords='horror perfectly'
        english_stopwords=stopwords.words("english")
        german_stopwords=stopwords.words("german")
        stopwords_list=(list(predefined_stopwords.split(' '))+english_stopwords+german_stopwords)         
        keyword_processor_stopwords = KeywordProcessor()
        for each in stopwords_list:
            keyword_processor_stopwords.add_keyword(each,' ')   
        sentence=keyword_processor_stopwords.replace_keywords(x)
        return sentence.strip()

In [6]:
#Combined function
def preprocess(text):
    return regex_filtering(tokenize_term(text))

## We'll check the preprocessing steps for a single text

In [7]:
x=df['text'][0]
print('Before processing: '+x+'\n')
y=preprocess(x)
print('After processing: '+y)

Before processing: This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.

After processing: process however afforded means ascertaining dimensions dungeon might make circuit return point whence set without aware fact uniform seemed wall


### Now we'll remove all the null and empty texts from dataframe

In [8]:
print('Before removing null/empty records from data, size of dataframe is {}'.format(df.shape))
df=df.loc[~df['text'].isnull()]
df=df.loc[df['text']!='']
df=df.loc[~df['author'].isnull()]
df=df.loc[df['author']!='']
print('After removing null/empty records from data, size of dataframe is {}'.format(df.shape))

Before removing null/empty records from data, size of dataframe is (19579, 3)
After removing null/empty records from data, size of dataframe is (19579, 3)


#### As can be seen , there were no text records present

In [9]:
df['text_preprocessed']=df['text'].apply(preprocess)
df

Unnamed: 0,id,text,author,text_preprocessed
0,id26305,"This process, however, afforded me no means of...",EAP,process however afforded means ascertaining di...
1,id17569,It never once occurred to me that the fumbling...,HPL,never occurred fumbling might mere mistake
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,left hand gold snuff box capered hill cutting ...
3,id27763,How lovely is spring As we looked from Windsor...,MWS,lovely spring looked Windsor Terrace sixteen f...
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,Finding nothing else even gold Superintendent ...
5,id22965,"A youth passed in solitude, my best years spen...",MWS,youth passed solitude best years spent gentle ...
6,id09674,"The astronomer, perhaps, at this point, took r...",EAP,astronomer perhaps point took refuge suggestio...
7,id13515,The surcingle hung in ribands from my body.,EAP,surcingle hung ribands body
8,id19322,I knew that you could not say to yourself 'ste...,EAP,knew could say stereotomy without brought thin...
9,id00912,I confess that neither the structure of langua...,MWS,confess neither structure languages code gover...


## Now we have to remove the empty processed text rows 

In [10]:
print('Before removing null/empty preprocessed texts from data, size of dataframe is {}'.format(df.shape))
df=df.loc[~df['text_preprocessed'].isnull()]
df=df.loc[df['text_preprocessed']!='']
print('After removing null/empty preprocessed texts from data, size of dataframe is {}'.format(df.shape))

Before removing null/empty preprocessed texts from data, size of dataframe is (19579, 4)
After removing null/empty preprocessed texts from data, size of dataframe is (19572, 4)


#### We saw seven records were removed which were either null or empty after the preprocessing

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19572 entries, 0 to 19578
Data columns (total 4 columns):
id                   19572 non-null object
text                 19572 non-null object
author               19572 non-null object
text_preprocessed    19572 non-null object
dtypes: object(4)
memory usage: 764.5+ KB


In [69]:
texts=df['text_preprocessed'].values
labels=df['author'].values
data = list(labels + '\t' + texts)
print('Total count of unique labels : {} \n Authors are : {}'.format(len(set(labels)),df['author'].value_counts()))

Total count of unique labels : 3 
 Authors are : EAP    7895
MWS    6043
HPL    5634
Name: author, dtype: int64


### Splitting Data into train and test set

In [70]:
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.1, random_state=0)

## Generate Tf-Idf vectorizer with different params

In [81]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(texts)
X_train_tfidf =  tfidf_vect.transform(X_train)
X_test_tfidf =  tfidf_vect.transform(X_test)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(texts)
X_train_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(texts)
X_train_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_train) 
X_test_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_test) 

### Create a function to generate accuracy dynamically

In [97]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    return metrics.accuracy_score(predictions, y_test)

### Let's try Naive-Bayes classifier first 

Implementing a naive bayes model using sklearn implementation with different features

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here .

In [98]:
# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print("Naive Bayes, WordLevel TF-IDF: {}".format(accuracy))

# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print("Naive Bayes,  N-Gram Vectors: {}".format(accuracy))

# Naive Bayes on Character Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print("Naive Bayes,CharLevel Vectors: {}".format(accuracy))

Naive Bayes, WordLevel TF-IDF: 0.7957099080694586
Naive Bayes,  N-Gram Vectors: 0.5715015321756894
Naive Bayes,CharLevel Vectors: 0.7017364657814096


### Now, we will be trying out linear classifiers with all three types of tf-idf vectorizers

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. 

In [90]:
# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), X_train_tfidf, y_train, X_test_tfidf)
print ("Logistic Regression, WordLevel TF-IDF: "+ str(accuracy))

# Linear Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print("Logistic Regression, N-Gram Vectors: "+ str(accuracy))

# Linear Classifier on Character Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print("Logistic Regression, CharLevel Vectors: "+ str(accuracy))

Logistic Regression, WordLevel TF-IDF: 0.7916241062308478
Logistic Regression, N-Gram Vectors: 0.5704800817160368
Logistic Regression, CharLevel Vectors: 0.7328907048008172


### Let us try SVC now

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both 
classification or regression challenges. The model extracts a best possible hyper-plane / line that segregates the two classes. 

In [96]:
# SVM on Ngram Level TF IDF Vectors
accuracy = train_model(svm.SVC(), X_train_tfidf, y_train, X_test_tfidf)
print("SVM, N-Gram Vectors: {}".format(accuracy))

SVM, N-Gram Vectors: 0.4187946884576098


### Using Random Forest 
Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family.

In [94]:
# RF on Word Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print("Random Forest, WordLevel TF-IDF: {}".format(accuracy))

Random Forest, WordLevel TF-IDF: 0.6726251276813074
