# Feature engineering on strings

In [3]:
import pandas as pd
import string
import operator
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegressionCV
import warnings
import gensim
from gensim.models import FastText
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
import gensim.corpora as corpora


In [2]:
data = pd.read_csv("train.csv",header=None,low_memory=False)
data_test = pd.read_csv("test.csv",header=None,low_memory=False)

In [4]:
sentences = data[1][1:]
labels = data[2][1:]
sentences_test = []
translator = str.maketrans('', '', string.punctuation)

# Some analysis on the data

$\textbf{Brief description of the data}$
 - Number of sentences (documents) in the whole corpus $\approx$ 1.3 milion
 - Number of words $\approx$ 16 milion
 - Number of unique words = 97500

$\textbf{Task description}$

Given a question (a string) of variable length, predict if that question is sincere or not.

In [13]:
vec = CountVectorizer().fit(sentences)
bag_of_words = vec.transform(sentences)
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in     vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

$\textbf{Most frequent 10 words in the dataset.}$

In [14]:
for word, freq in words_freq[:10]:
    print("Word " + "'\033[1m"+str(word) +"\033[0m'" + " appears " + str(freq) + " times.")

Word 'the' appears 665950 times.
Word 'what' appears 471294 times.
Word 'is' appears 443182 times.
Word 'to' appears 408009 times.
Word 'in' appears 378153 times.
Word 'of' appears 333515 times.
Word 'how' appears 290405 times.
Word 'and' appears 257922 times.
Word 'do' appears 253252 times.
Word 'are' appears 243038 times.


In [7]:
frequencies = np.array(words_freq)
frequencies = [int(x) for x in frequencies[:,1]]
distinct_words = int(len(vec.vocabulary_.items())/2)
print("Total number of words in the dataset are " + 
      "\033[1m"+str(np.sum(frequencies)) +"\033[0m"+ " of which " + 
      "\033[1m"+str(distinct_words)+"\033[0m" + " are distinct.")

Total number of words in the dataset are 15999712 of which 97500 are distinct.


$\textbf{10 of the least frequent words in the dataset (found only once in the corpus).}$

In [8]:
for word, freq in words_freq[-97500:-97490]:
    print("Word " + "\033[1m'"+str(word) +"'\033[0m" + " appears one time.")

Word 'uniersity' appears one time.
Word 'wheath' appears one time.
Word 'subjested' appears one time.
Word 'notafcation' appears one time.
Word 'faulds' appears one time.
Word 'abody' appears one time.
Word '15260' appears one time.
Word 'localbitcoins' appears one time.
Word 'issil' appears one time.
Word 'c720' appears one time.


In [9]:
sincere_question_count = (bag_of_words[0].todense())
sincere_question_words = np.argwhere(sincere_question_count>=1)
for word in sincere_question_words:
    print("("+vec.get_feature_names()[word[1]] + " " 
          + str(sincere_question_count[0,word[1]])+")", end="  ")
print("\n" + sentences[1]+"\n")
lbls = labels.values
insincere_example = np.argwhere(lbls == '1')[1,0]
insincere_question_count = (bag_of_words[insincere_example].todense())
insincere_question_words = np.argwhere(insincere_question_count>=1)
for word in insincere_question_words:
    print("("+vec.get_feature_names()[word[1]] + " " 
          + str(insincere_question_count[0,word[1]]) + ")", end="  ")
print("\n" + sentences[insincere_example+1])

(1960s 1)  (as 1)  (did 1)  (how 1)  (in 1)  (nation 1)  (nationalists 1)  (province 1)  (quebec 1)  (see 1)  (the 1)  (their 1)  
How did Quebec nationalists see their province as a nation in the 1960s?

(are 1)  (babies 3)  (dark 1)  (light 1)  (more 1)  (or 1)  (parents 1)  (skin 2)  (sweeter 1)  (their 1)  (to 1)  (which 1)  
Which babies are more sweeter to their parents? Dark skin babies or light skin babies?


# Training the model with bags of words

In [10]:
Y = data[2][1:]
Y = Y.values
vectorizer = CountVectorizer(min_df=5)
sentences = data[1][1:]
X = vectorizer.fit_transform(list(sentences))

$\textbf{Additional useful parameters for CountVectorizer:}$
 - lowercase = True. 
 - analyzer = ‘char_wb’: change words into character n-grams ( "apple" word has \["app", "ppl", "ple"\] 3-grams ).
 - max_df/min_df = maximum/minimum character frequency in the document (sentence in our case) for the word (or gram) to be taken into account.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20)
LR_model = LogisticRegressionCV(cv=5,random_state=0, solver='lbfgs').fit(X_train, y_train)
preds = LR_model.predict(X_test)
preds = [int(x) for x in preds]
y_test = [int(x) for x in y_test]

In [12]:
print("F1 Score on LR for bags of words: " + str(f1_score(y_test,preds)))

F1 Score on LR for bags of words: 0.5400183290056515


# Term frequency- inverse document frequency (TF-IDF)

### Motivation

Instead of using the frequency of a word (that is, the number of times it appears in a sentence) in a given document (in our case, a sentence), we are going to furtherly tune the parameters so that extremly frequent words throughout the whole set of documents - like the words "the", "what", "is" and so on, shown above to have very high frequencies - will be taken less into acount, as they do not tell much about the nature of a given question.
### $$tfidf(t,d,D) = tf(t,d)\cdot idf(t,D)$$
The above function multiplies the frequency of a word $t$ in the current document (sentence in our case) by the inverse frequency of that word in all the documents. These two functions, $tf$ and $idf$ are defined as:
### $$idf(t,D)=log\frac{N}{|\{d\in D : t\in d\}|}$$ 
with
 - $N$: number of documents in the corpus.
 - $|\{d \in D : t \in d\}|$: number of documents in the corpus in which t appears.
 and for the term-frequency fucntion
### $$tf(t,d) = 0.5 + 0.5 \cdot \frac{f_{t,d}}{max\{f_{t',d}:t'\in d\}}$$
Which is an augmented frequency defined as the raw frequency of a word in the given document, divided by the maximum frequency of any element in that document (although mostly useless in our case, since this adjustment is done to prevent bias towards longer documents, and all of our documents are roughly the same size, them being only one question each).

In [13]:
tfidf_vec = TfidfVectorizer(min_df = 5)
X = tfidf_vec.fit_transform(list(sentences))

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20)
LR_model = LogisticRegressionCV(cv=5,random_state=0, solver='lbfgs').fit(X_train, y_train)
preds = LR_model.predict(X_test)
preds = [int(x) for x in preds]
y_test = [int(x) for x in y_test]

In [15]:
print("F1 Score on LR for bags of words: " + str(f1_score(y_test,preds)))

F1 Score on LR for bags of words: 0.5530106434991914


# Word2Vec embeddings

### Skip gram and continuous bags of words

Summary of the Word2Vec methods:

$\textbf{Vocabulary}$
 - decide what vocabulary size (select words from the document based on their number of appearances)
 - for each word in the vocabulary, assign a unique index and with those indeces, create one_hot vectors for them $x=[0,0,...,1, ... 0]$ 


<img src="imgs/window.png"  style="width: 600px;">

$\textbf{Skip gram}$
 - Try predicting a context (words frequently found nearby), given a single word input.
 - Define a window size $n$ (for example, 5, as in the image above), and pick the words that are $n$ close to the word picked in a sentence (left and right)
 - Train a fully connected neural network with one hidden layer, predicting words found close to it. (For sentence "a b c d e", pick 'c' as the input and "a b d e" as the output).

<img src="imgs/SKIP_GRAM.png"  style="width: 400px;">

$\textbf{Continuous bags of words (CBOW)}$

 - similar to skip-gram, but "flip" the training inputs and outputs, i.e. predict a target output given context (words found close to it)

<img src="imgs/CBOW.png"  style="width: 500px;">

$\textbf{Gensim implementation:}$

$\textbf{Edit the sentences}$ (get rid of the punctuation, lower-case the strings and split them into words)

In [4]:
all_words = []
sentences = data[1][1:]
translator = str.maketrans('', '', string.punctuation)
embedding_sentences = []
for sentence in sentences:
    a = sentence.translate(translator).lower().split()
    embedding_sentences.append(a)
    all_words.extend(a)

$\textbf{Parameters used in our model:}$
 - size = 300: the size of the embedding of each word in the vocabulary
 - window = 5: window of context used, as described in the CBOW/Skip-gram models above
 - min_count = 5: minimum frequency for a word to be taken into account in the dictionary
 - workers = 10: paralel working specification when working with multi-core machines
 - sg = 1: not used here, but will be used for the skip-gram model (it just tells the model to train using skip-gram method)

$\textbf{Train a CBOW model}$

In [17]:
model = gensim.models.Word2Vec(
        embedding_sentences,
        size=300,
        window=5,
        min_count=5,
        seed=1,
        workers=10)
model.train(embedding_sentences, total_examples=len(embedding_sentences), epochs=10)

(120006542, 167042420)

$\textbf{Save the model}$

In [18]:
model.save("CBOW_model")

$\textbf{Reload it}$

In [19]:
loaded_gbow_model  = gensim.models.Word2Vec.load("CBOW_model")

$\textbf{Get similar words for some chosen words}$ 

Note that these models, the words need to be found in the dictionary - this problem will be dealt with in the FastText method.

In [20]:
print(loaded_gbow_model.wv.similar_by_word("woman",5))
print(loaded_gbow_model.wv.similar_by_word("king",5))
print(loaded_gbow_model.wv.similar_by_word("dog",5))

[('girl', 0.7663631439208984), ('man', 0.7301400303840637), ('women', 0.6900008916854858), ('lady', 0.6801614761352539), ('person', 0.6645864248275757)]
[('emperor', 0.5211741924285889), ('kings', 0.5113060474395752), ('queen', 0.4816433787345886), ('solomon', 0.4810170531272888), ('monarch', 0.4801141023635864)]
[('puppy', 0.7459352016448975), ('kitten', 0.7062520980834961), ('hamster', 0.6700388789176941), ('cat', 0.6575609445571899), ('dogs', 0.6490512490272522)]


$\textbf{Train a Skip-gram model}$

In [21]:
model = gensim.models.Word2Vec(
        embedding_sentences,
        size=300,
        window=5,
        seed=1,
        sg=1,
        min_count=5,
        workers=10)
model.train(embedding_sentences, total_examples=len(embedding_sentences), epochs=10)

(120005737, 167042420)

$\textbf{Save the model}$

In [22]:
model.save("SG_model")

$\textbf{Reload it}$

In [23]:
load_sg_model = gensim.models.Word2Vec.load("SG_model")

$\textbf{Get similar words for some chosen words}$ 

In [24]:
print(load_sg_model.wv.similar_by_word("woman",5))
print(load_sg_model.wv.similar_by_word("king",5))
print(load_sg_model.wv.similar_by_word("dog",5))

[('man', 0.7393679618835449), ('women', 0.6713191270828247), ('girl', 0.6492231488227844), ('35yearold', 0.5798226594924927), ('milf', 0.5780065059661865)]
[('tut', 0.5243228673934937), ('queen', 0.5077412128448486), ('kings', 0.5044434666633606), ('vikramaditya', 0.480135440826416), ('hrh', 0.4775933623313904)]
[('puppy', 0.6731647253036499), ('dogs', 0.6642792224884033), ('cats', 0.5948817729949951), ('kitten', 0.5760080218315125), ('husky', 0.563368022441864)]


### FastText
Extension of the Word2Vec models. The idea behind FastText is to use n-grams instead of words. An n-gram is a group of letters taken from the actual word (e.g., the 3-gram for "apple" will be "app", "ppl", "ple"), and the actual final embedding for the word will be the summ of all it's n-grams.

What is great about this method is that we can extract a context (meaning) vector even for words that do not exist at all in the dictionary we created.

In [25]:
model_ted = FastText(
    embedding_sentences, 
    size=300, 
    window=5, 
    min_count=5, 
    workers=10,
    sg=1)

In [26]:
model_ted.save("FT_model")

In [27]:
load_ft_model = gensim.models.FastText.load("FT_model")

In [28]:
print(load_ft_model.wv.similar_by_word("woman",5))
print(load_ft_model.wv.similar_by_word("king",5))
print(load_ft_model.wv.similar_by_word("dog",5))

[('girlwoman', 0.887147068977356), ('womanman', 0.88193678855896), ('manwoman', 0.8607190847396851), ('womanly', 0.8196729421615601), ('woman’s', 0.7845233678817749)]
[('kingpin', 0.8398256301879883), ('king’s', 0.8095637559890747), ('kingqueen', 0.7933975458145142), ('kingsley', 0.7807847857475281), ('kingsguard', 0.7293356657028198)]
[('dogcat', 0.7769787311553955), ('dogs', 0.76622474193573), ('dog’s', 0.7232863903045654), ('dogg', 0.7129536271095276), ('puppy', 0.7121647596359253)]


Verify that "Gastroenteritis" is not present in the vocabulary.

In [29]:
word_fq_array = np.array(words_freq)
unique_words = dict.fromkeys(word_fq_array[:,0],1)
if "Gastroenteritis" not in unique_words:
    print("The word is not in the document")

The word is not in the document


$\textbf{Get top 10 similar words to a word that is not found in the vocabulary}$

Get similar meaning words according to the model. Notice how the fast text model will still produce decent similar words for the word "Gastroenteritis", even if we didn't have that word in the training corpus. 

In [30]:
print(load_ft_model.wv.most_similar("Gastroenteritis"))

[('gastroenterologist', 0.6894631385803223), ('gastroenterology', 0.6843836307525635), ('dnb', 0.6726697087287903), ('orthopedics', 0.6438664197921753), ('orthopedic', 0.6283782720565796), ('pediatrics', 0.6272794008255005), ('dmc', 0.6225731372833252), ('ludhiana', 0.6222456693649292), ('pgi', 0.6205636262893677), ('visakhapatnam', 0.6173926591873169)]


$\textbf{Train a LR model with cross-validation using a fraction of the data*.}$

*Since both converting the words into word-vectors and the actual training on the data, after splitting, the training was done in a fraction of the data, containing: all the insincere questions and 4 times that number of sincere ones. Training was done on the average vector representation in a given sentence, out of all the word-embeddings in one sentence at a time. 

In [31]:
def train_lr_tf_model():
    i = 0
    Y = data[2][1:]
    Y = Y.values
    X = embedding_sentences
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.05)
    X_train = np.array(X_train)
    insincere = X_train[y_train == '1'][:]
    sincere = X_train [y_train == '0'][:]
    a = []
    y = []
    for sent in insincere:
        word_vectors = [] 
        for single_word in sent:
            if single_word in load_ft_model.wv.vocab: 
                word_vectors.append(load_ft_model.wv.word_vec(single_word))
        word_vectors = np.array(word_vectors)
        if len(word_vectors) != 0:                
            if word_vectors.shape[1] == 300:
                a.append(np.mean(word_vectors,axis=0))
                y.append(1)
    for sent in sincere[:64674*4]:
        word_vectors = []  
        for single_word in sent:
            if single_word in load_ft_model.wv.vocab: 
                word_vectors.append(load_ft_model.wv.word_vec(single_word))
        word_vectors = np.array(word_vectors)
        if len(word_vectors) != 0:
            if word_vectors.shape[1] == 300:

                a.append(np.mean(word_vectors,axis=0))
                y.append(0)
    a = np.array(a)
    y = np.array(y)
    LR_model = LogisticRegressionCV(cv=5,random_state=0, solver='lbfgs').fit(a, y)
    x_test = []
    indexes = []
    for index, sent in enumerate(X_test):
        word_vectors = []     
        for single_word in sent:
            if single_word in load_ft_model.wv.vocab: 
                word_vectors.append(load_ft_model.wv.word_vec(single_word))
        word_vectors = np.array(word_vectors)
        if len(word_vectors) != 0:   
            if word_vectors.shape[1] == 300:
                x_test.append(np.mean(word_vectors,axis=0))
            else:
                indexes.append(index)
        else:
            indexes.append(indexes)
    
    preds = LR_model.predict(x_test)
    preds = [int(x) for x in preds]
    y_test = [int(x) for x in y_test]
    y_test = np.array(y_test)
    if len(indexes) != 0:
        y_test = np.delete(y_test,indexes)
    print("F1 Score on LR for Fast text sentence embeddings: " + str(f1_score(y_test,preds)))
    return LR_model


In [32]:
train_lr_tf_model()

F1 Score on LR for Fast text sentence embeddings: 0.5380184331797235


LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

### Global Vectors for Word Representation (Glove)

 - matrix $X$ is the matrix of word co-occurances, in which $X_{ij}$ represents the number of times word $i$ occurs in the context of word $j$. 
 - $X_{i} = \sum_{k} X_{ik}$ be the total number of occurances in any context k for work i.
 - Finally, $P_{ij}=P(j|i)=X_{ij}/X_i$.

| Probability and Ratio |$k = solid$|$k = gas$|$k = water$|$k = fashion$|
|-----------------------|-----------|---------|-----------|-------------|
|  $P(k|ice)$           |$1.9\times 10^{-4}$|$6.6 \times 10^{-5}$|$3.0\times 10^{-3}$|$1.7\times 10^{-5}$|
|  $P(k|steam)$         |$2.2\times 10^{-5}$|$7.8\times 10^{-4}$|$2.2\times 10^{-3}$|$1.8\times 10^{-5}$|
|  $P(k|ice)/P(k|steam)$|$$8.9$$            |$$8.5\times 10^{-2}$$|$$1.36           $$|$$0.96$$|



Notice how the probability ratio for various words $k$ illustrates pretty well the semantic similarity of two words $i$ and $j$.
 - k related to ice and not to steam: high ratio
 - k related to steam but not to ice: small ratio
 - k related to both: ratio close to 1

The minimization function proposed by Glove tries to minimize the below J function. Note that $V$ here represents the number of words in the whole vocabulary and the $f$ function will weight down the very frequent words (similar to the problem of link words described in the Word2Vec approaches)

### $$J=\sum_{i,j=1}^{V} f(X_{ij})(w_i^T\tilde{w_j}+b_i+\tilde{b_j}-logX_{ij})^2$$

### $$f(X_{ij})= \begin{cases} 
      (x/x_{max})^\alpha & if x < x_{max} \\
      1 & otherwise \\
   \end{cases}
$$

$\textbf{Load a pretrained spacy glove model}$

In [5]:
glove_model =  spacy.load('en_core_web_lg')

In [6]:
glove_model.remove_pipe('ner')

('ner', <spacy.pipeline.EntityRecognizer at 0x2995b3fc518>)

In [7]:
X = sentences.values

$\textbf{Train a Glove embedded model on the same amounts of data as in FastText embedding}$

In [9]:
Y = data[2][1:]
Y = Y.values
sentences = data[1][1:]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.05)
insincere = X_train[y_train == '1']
sincere = X_train [y_train == '0']
a = []
y = []
for sent in insincere:
    a.append(glove_model(sent).vector)
    y.append(1)
for sent in sincere[:64674*4]:
    a.append(glove_model(sent).vector)
    y.append(0)

In [10]:
LR_model = LogisticRegressionCV(cv=5,random_state=0, solver='lbfgs').fit(a, y)

$\textbf{Test the model on 5% of the data}$

In [11]:
x_test = []
for sent in X_test:
    x_test.append(glove_model(sent).vector)
preds = LR_model.predict(x_test)
preds = [int(x) for x in preds]
y_test = [int(x) for x in y_test]


In [12]:
print("F1 Score on LR for Glove sentence embeddings: " + str(f1_score(y_test,preds)))

F1 Score on LR for Glove sentence embeddings: 0.5488594481522063


# Model comparisons:

All the models were trained using Logistic Regression - Cross Validation method. Since we only trained the FastText and Glove enbedded data on only 25% of the dataset and tested them on a separate 5% of it (due to time-limit constraints given by the high computational complexity of the algorithms), please compare the Glove and FastText separate from the Word-bagging methods!

$\textbf{Average across the sentence wrod2vec embeddings}$

| Word embedding |  F1 score|
|----------------|  --------|
|Glove       |$$0.5488$$|
|FastText    |$$0.5380$$|

$\textbf{Word bagging methods}$

|Bagging method | F1 score |
|---------------|----------|
|Simple  bagging|$$0.5400$$|
|Tf-idf bagging|$$0.5530$$|

# Future considerations

Until now, we only dealt with string classification problem by tweaking the strings in different ways until we got a similar vector dimension to train our ML models. Either through averaging the word2vec embeddings or through word-bagging on a fixed vocabulary size, the LR-CS models received fixed input dimensions. This does not, however, fully represent a crucial aspect of this problem: the original data is varying in size and the ordering of it (the order in which the words come into a sentence) matters.

A key method to better solve NLP problems is given by deep learning architectures. Both CNNs and RNNs are a common practice in this field. Usually, using one of the embeddings presented (Word2Vec with Skip-Gram or CBOW methods, it's FastText extension, Glove etc.), we get these feature vectors of the words and train a model with all of these, separately.

$\textbf{CNN}$

CNN architectures have been used a lot in the NLP tasks, even though it would seem that RNNs should be better suited for these problems. One pretty interesting thing that is currently being researched now is using CNN architectures for character-level convolutions and feeding those CNN features to an NLP layers.

More commonly used in CNN (and in RNN architectures for that matter) NLP tasks is to use a pre-trained word-embedding that has a fixed size for each word in the dictionary. Say that we have:
 - a sequence $x$ of $n$ entries 
 - an encoding that gives us a $d$ dimensional vector for every word in the vocabulary

Then we can solve our problem by using a convolutional architecture that has:
 - $d$ width of the filter, with a window size of an arbitrary $w$ (notice how this represents a sliding $w$-gram)
 - a padding large enough that we always have the same input size (padding can be defined as adding $d$ dimensional vectors of zeros until we have the same sentence length on all the questions).

<img src="imgs/NLP_CNN.png"  style="width: 1400px; height: 700px;">

$\textbf{RNN}$

GRUs and LSTMs have the primary beneficial property that they can receive truly variable length inputs (by comparison, CNNs needed 0-padding on the input, so that all the training data would end up the same length). A basic RNN cell will receive both an input and state from the previous time-step and calculate an output and a state for the next time-step.

Again, we need to consider some form of embedding for the words in our sentences. Using the same intuition, we can have a word embedding that gives us a $d$ dimentional input for each word that we found moving through the sentence. Having this, the way GRUs and LSTMs work is very similar

$\underline{GRU}$

$z = \sigma(x_tU^z + h_{t-1}W^z$

$r = \sigma(x_tU^r + h_{t-1}W^r$

$s_t = tanh(x_tU^s + h_{t-1}W^s$

$h_t = (1-z) \circ s_t + z \circ h_{t-1}$

$\underline{LSTM}$

$i_t = \sigma(x_tU^i + h_{t-1}W^i+b_i)$

$f_t = \sigma(x_tU^f+h_{t-1}W^f+b_f)$

$o_t = \sigma(x_tU^o + h_{t-1}W^o+b_o$

$q_t = tanh(x_tU^q+h_{t-1}W^q+b_q$

$p_t = f_t*p_{t-1}+i_t*q_t$

$h_t=o_t*tanh(p_t)$

<img src="imgs/LSTM3-chain.png"  style="width: 600px;">