# FEATURE EXTRACTION USING WORD EMBEDDINGS

Togheter with sparse representations we tried to use dense representation techniques using word embedding. We tried two different approach : the first one training our own word embeddings using the training set we had, and the second one using pre-trained word embeddings.

First of all we worked a little bit on the dataset. We aggregate training and validation set to have more data for training, and we keep test set to evaluate the performance of the model.
We also considered different scenarios :
- considering or not the Out-Of-Scope samples 
- using intents or domains as labels for classification.

The function *get_df(oos,domains)* creates the dataframe used for training , grouping by domains or not and considering or not the Out-of-Scope samples. In all cases it merges togheter validation and train set into the train set.

In [2]:
import nltk
import json
import gensim
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from matplotlib import pyplot as plt
from sklearn import preprocessing

def get_df(oos=False,domains=False) :
    with open('data_full.json') as json_file: 
        data_dict = json.load(json_file) 

    train_data = data_dict['train']
    val_data = data_dict['val']
    test_data = data_dict['test']

    oos_train = data_dict['oos_train']
    oos_val = data_dict['oos_val']
    oos_test = data_dict['oos_test']


    train_df = pd.DataFrame(train_data, columns =['query', 'intent'])
    val_df = pd.DataFrame(val_data, columns =['query', 'intent'])
    test_df = pd.DataFrame(test_data, columns =['query', 'intent'])

    train_oos_df = pd.DataFrame(oos_train,columns=['query','intent'])
    val_oos_df = pd.DataFrame(oos_val,columns=['query','intent'])
    test_oos_df = pd.DataFrame(oos_test,columns=['query','intent'])

    if oos :
        # Concatenate dataframes to consider oos as a specific intent
        train_df = pd.concat([train_df,train_oos_df])
        val_df = pd.concat([val_df,val_oos_df])
        test_df = pd.concat([test_df,test_oos_df])
    
    train_df =pd.concat([train_df,val_df])

    if domains:
        with open('domains.json') as json_file:
            domain_dict = json.load(json_file)
        inv_domain_dict = {}
        for domainKey in domain_dict.keys():
            for intent in domain_dict[domainKey]:
                inv_domain_dict[intent] = domainKey
        if oos:
            inv_domain_dict['oos']='oos'
        train_df['domain'] = train_df.apply(lambda row: inv_domain_dict[row['intent']],axis=1)
        test_df['domain'] = test_df.apply(lambda row: inv_domain_dict[row['intent']],axis=1)
    
    return train_df, test_df


df_train, df_test = get_df(oos=True,domains=False)



## PREPROCESSING

In the first step we pre-process our data. This step included only stemming or lemmatization, togheter with the elimination of stop words. Other pre-processing operations are not necessary because our dataset contains sentences already downcased and without special characters.

In [2]:
# Preprocessing

def utils_preprocess_text(text, flg_stemm=True, flg_lemm=False, lst_stopwords=None):
            
    ## Tokenize (convert from string to list)
    lst_text = text.split()
    ## remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in 
                    lst_stopwords]
                
    ## Stemming 
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    ## Lemmatisation 
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## back to string from list
    text = " ".join(lst_text)
    return text

lst_stopwords = nltk.corpus.stopwords.words("english")

In [3]:
# Apply preprocessing
df_train["query_clean"] = df_train["query"].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True))
df_test["query_clean"] = df_test["query"].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True))

## TRAINING WORD EMBEDDINGS


We use Word2Vec as model for training word embeddings and in particular we use the implementation given by the Gensim library.

To train our word embeddings using the train set we need to transform each row of the corpus into a list of unigrams. This is done by the function *corpus_as_lst(corpus)* .

In [4]:
def corpus_as_lst(corpus):
   ## create list of lists of unigrams
   lst_corpus = []
   for string in corpus:
      lst_words = string.split()
      lst_grams = [" ".join(lst_words[i:i+1]) for i in range(0, len(lst_words), 1)]
      lst_corpus.append(lst_grams)
   return lst_corpus

# Prepare the corpus to be trained by Word2Vec
train_corpus = corpus_as_lst(df_train['query_clean'])


Then we feed the Word2Vec model with the corpus transformed into a list of lists. We specify some parameters , in particular :
- *vector_size* is the size of each word embedding vector.
- *window* is the size of the window of words near the target word, that are implicitly considered as positive.
- *sg=1* indicates that we want to use skip-gram as training algorithm.
- *min_count=1* ignores all words with total frequency lower than 1.
- *epochs* is the number of iterations over the corpus

In [11]:
# Training word embeddings
wc_model = gensim.models.word2vec.Word2Vec(train_corpus, vector_size=300,   window=8, min_count=1, sg=1, epochs=30)

Instead of training our word embeddings we can also use pre-trained word embeddings. We use **word2vec-google-news-300** pre-trained vectors which are trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.

In [3]:
# we load pretrained word embeddings
google_300 = gensim.models.KeyedVectors.load("./models/google_300")

Now that we have the word embeddings, we want to transform each sentence into a word vector. This is done using the **text_to_mean(embeddings,text)** function which takes as inputs the word embeddings and a sentence. 

First of all it splits into tokens the sentence and for each token searches for the corresponding word vector. If the word is not present into the word embdeddings vocabulary it simply ignores it. 

At this point we have a NxM matrix where N is the number of words composing the sentence and M is the size of word vector (in our case 300). 

But we want to obtain only one vector for each sentence. So we simply take the mean along columns and we return the corresponding vector.

The function *get_word_embeddings(corpus)* simply computes the *text_to_mean* function to all sentences of the corpus.

In [4]:
def text_to_mean_vector(embeddings, text):
    tokens = text.split()
    vec = []
    for i in range(len(tokens)):
        try:
            vec.append(embeddings.get_vector(tokens[i]))
        except KeyError:
            True   # simply ignore out-of-vocabulary tokens
    if(len(vec)!=0):
        return [sum([row[j] for row in vec]) / len(vec) for j in range(len(vec[0]))]
    else : 
        return []  # if every token of the sentence is out-of-vocabulary we simply return an empty list

def get_word_embdeddings(corpus, model):
    embeddings_corpus = []
    for c in corpus:
        mean_vec = text_to_mean_vector(model, c)
        if(len(mean_vec)!=0):
            embeddings_corpus.append(mean_vec)
        else:
            embeddings_corpus.append(np.zeros(model.vector_size,)) # if every token of the sentence is out-of-vocabulary we represents that sentence as a list of zeros
    return np.array(embeddings_corpus)


In [22]:
# Extracting word embeddings
X_train = get_word_embdeddings(df_train['query_clean'], google_300)
X_test = get_word_embdeddings(df_test['query_clean'], google_300)

print(X_train.shape)
print(X_test.shape)


(18200, 300)
(5500, 300)


In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def train_model(model,X_train,df_train,df_test,domains=False):
        # Getting labels
    if domains:
        y_train = df_train['domain'].values
        y_test = df_test['domain'].values
    else:
        y_train = df_train['intent'].values
        y_test = df_test['intent'].values
        
    model.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    return y_pred,y_test

# Training 
lr = LogisticRegression(multi_class='multinomial', max_iter=300)
y_pred,y_test=train_model(lr,X_train,df_train,df_test,domains=False)


In [16]:
from sklearn.metrics import classification_report
target_names = df_test['intent'].unique()
print(classification_report(y_test,y_pred,target_names=target_names))

                           precision    recall  f1-score   support

                translate       0.96      0.87      0.91        30
                 transfer       0.72      0.87      0.79        30
                    timer       0.97      0.97      0.97        30
               definition       0.81      1.00      0.90        30
          meaning_of_life       0.90      0.93      0.92        30
         insurance_change       0.97      0.97      0.97        30
               find_phone       0.79      0.90      0.84        30
             travel_alert       0.68      0.70      0.69        30
              pto_request       0.81      0.87      0.84        30
     improve_credit_score       0.88      1.00      0.94        30
                 fun_fact       0.88      1.00      0.94        30
          change_language       0.57      0.80      0.67        30
                   payday       0.61      0.67      0.63        30
replacement_card_duration       0.70      0.93      0.80     

In [24]:
# confusion matrix
print('Confusion matrix shape:',confusion_matrix(y_test, y_pred).shape)

# accuracy, precision, recall, f1

accuracy = accuracy_score(y_test,y_pred)
precision = precision_score(y_test,y_pred,average='macro')
recall = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
print('Accuracy',accuracy)
print('Precision',precision)
print('Recall',recall)
print('F1',f1)


Confusion matrix shape: (151, 151)
Accuracy 0.7985454545454546
Precision 0.8012696503229092
Recall 0.8890242825607063
F1 0.8369035053953743
