# Train Models
<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">

**IMPORTANT INFO:**

The `train_models.ipynb` notebook:
- Is a responsability of all members of a group. All of you should execute this and ensure it works as expected.
- Has to use the code done by each member in the group to generate features for the challenge.


`models`: A folder containing the trained models. This folder should be cre- ated by `train_models.ipynb` and models should be stored there after running `train_models.ipynb` notebook. The code should check if the folder is there and in such a case do not overwrite/store the models.

</div>

## Imports

In [26]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import *
import numpy as np
import sklearn
import pickle
import scipy
import os
import sys

import seaborn as sns
sns.set()

from utils import *

## Load Data

From the problem guide the teacher says:

This is a Kaggle challenge: There is no validation/test data with labels.
Therefore you have to create the following split in order to share the same train validation and test splits across teams:

In [4]:
path_folder_quora = '../nlp_deliv1_materials/'

# Train and Validation data
train_df = pd.read_csv(os.path.join(path_folder_quora, "quora_train_data.csv"))
# use this to provide the expected generalization results
test_df = pd.read_csv(os.path.join(path_folder_quora,"quora_test_data.csv"))

A_df, te_df = sklearn.model_selection.train_test_split(train_df, test_size=0.05, random_state=123)
tr_df, va_df = sklearn.model_selection.train_test_split(A_df, test_size=0.05, random_state=123)

In [3]:
# dividng X and y for each dataset
y_tr = tr_df['is_duplicate'].values
X_tr_df = tr_df.drop(['is_duplicate'], axis =1)

y_va = va_df['is_duplicate'].values
X_va_df = va_df.drop(['is_duplicate'], axis =1)

y_te = te_df['is_duplicate'].values
X_te_df = te_df.drop(['is_duplicate'], axis =1)

print(f'Training:\n X train {X_tr_df.shape}\n y train {y_tr.shape}\n {"-"*20}')
print(f'Validation:\n X val {X_va_df.shape}\n y val {y_va.shape}\n {"-"*20}')
print(f'Test:\n X test {X_te_df.shape}\n y test {y_te.shape}\n {"-"*20}')

Training:
 X train (291897, 5)
 y train (291897,)
 --------------------
Validation:
 X val (15363, 5)
 y val (15363,)
 --------------------
Test:
 X test (16172, 5)
 y test (16172,)
 --------------------


# Simple Solution

In [6]:
# convert input data into list of strings

q1_train =  cast_list_as_strings(list(X_tr_df["question1"]))
q2_train =  cast_list_as_strings(list(X_tr_df["question2"]))

q1_val =  cast_list_as_strings(list(X_va_df["question1"]))
q2_val =  cast_list_as_strings(list(X_va_df["question2"]))

q1_test =  cast_list_as_strings(list(X_te_df["question1"]))
q2_test =  cast_list_as_strings(list(X_te_df["question2"]))

Use all the questions in train and test partitions to build a single list all_questions to fit the count_vectorizer

In [5]:
all_q_train = q1_train+q2_train

count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,1))
count_vectorizer.fit(all_q_train)

CountVectorizer()

In [6]:
# get features (concatenating q1+q2)
X_tr_q1q2 = get_features_from_df(X_tr_df, count_vectorizer) # it converts list as strings and performs count_vectorizer
X_va_q1q2 = get_features_from_df(X_va_df, count_vectorizer)
X_te_q1q2 = get_features_from_df(X_te_df, count_vectorizer)

In [7]:
print(f'Training:\n X train {X_tr_q1q2.shape}\n {"-"*20}')
print(f'Validation:\n X val {X_va_q1q2.shape}\n{"-"*20}')
print(f'Test:\n X test {X_te_q1q2.shape}\n{"-"*20}')

Training:
 X train (291897, 149650)
 --------------------
Validation:
 X val (15363, 149650)
--------------------
Test:
 X test (16172, 149650)
--------------------


In [8]:
# training a simple model
logistic = sklearn.linear_model.LogisticRegression(solver="liblinear",
                                                   random_state=123)
logistic.fit(X_tr_q1q2, y_tr)

LogisticRegression(random_state=123, solver='liblinear')

### Saving simple model
Creating model folder + saving

In [24]:
# save model
if not os.path.isdir("model"):
    os.mkdir("model")

if not os.path.isdir("model/simple_solution"):
        os.mkdir("model/simple_solution")
        
with open('model/simple_solution/simple_model.pkl','wb') as f:
    pickle.dump(logistic,f)

In [25]:
# save dataset with correct features:

# Save as model_name+(X/y)+(tr/va/te) (depending if its dataset or lavels and what type they are)

with open('model/simple_solution/simple_model_X_tr.pkl','wb') as f:
    pickle.dump(X_tr_q1q2,f)  
with open('model/simple_solution/simple_model_X_va.pkl','wb') as f:
    pickle.dump(X_va_q1q2,f)   
with open('model/simple_solution/simple_model_X_te.pkl','wb') as f:
    pickle.dump(X_te_q1q2,f)
with open('model/simple_solution/simple_model_y_tr.pkl','wb') as f:
    pickle.dump(y_tr,f)
with open('model/simple_solution/simple_model_y_va.pkl','wb') as f:
    pickle.dump(y_va,f)
with open('model/simple_solution/simple_model_y_te.pkl','wb') as f:
    pickle.dump(y_te,f)

# Improved Solution

### Text Cleaning and Basic Text Features


Our fisrt approach is to perform some text cleaning and add some basic text features (both are defined in `utils_Alba.ipynb`). We have also tested different classifiers but the logistic model obtains the best performance.

In [7]:
# APPLY PREPROCESSING TO THE DATA

q1_train_preprocessed = preprocess_data(q1_train)
q2_train_preprocessed = preprocess_data(q2_train)
q1_val_preprocessed = preprocess_data(q1_val)
q2_val_preprocessed = preprocess_data(q2_val)
q1_test_preprocessed = preprocess_data(q1_test)
q2_test_preprocessed = preprocess_data(q2_test)

In [6]:
# fit the countvectorizer like in the simple solution
all_q_train_preprocessed = q1_train_preprocessed+q2_train_preprocessed

count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,1))
count_vectorizer.fit(all_q_train_preprocessed)

In [7]:
# Create a DataFrame with preprocessed columns
X_tr_pre = pd.DataFrame({'question1': q1_train_preprocessed, 'question2': q2_train_preprocessed}, columns=['question1', 'question2'])
#print(X_tr_pre)
X_va_pre = pd.DataFrame({'question1': q1_val_preprocessed, 'question2': q2_val_preprocessed}, columns=['question1', 'question2'])
X_te_pre = pd.DataFrame({'question1': q1_test_preprocessed, 'question2': q2_test_preprocessed}, columns=['question1', 'question2'])

In [8]:
# get features (concatenating q1+q2)
X_tr_q1q2_pre = get_features_from_df(X_tr_pre, count_vectorizer) # it converts list as strings and performs count_vectorizer
X_va_q1q2_pre = get_features_from_df(X_va_pre, count_vectorizer)
X_te_q1q2_pre = get_features_from_df(X_te_pre, count_vectorizer)

In [9]:
# CONSTRUCT DataFrame WITH FEATURES

X_tr_features = build_numeric_features(q1_train, q2_train)
X_va_features = build_numeric_features(q1_val, q2_val)
X_te_features = build_numeric_features(q1_test, q2_test)

In [10]:
# combine count vectorizer data with features in sparse form

X_tr_features_sparse = scipy.sparse.hstack([X_tr_q1q2_pre, scipy.sparse.csr_matrix(X_tr_features)])
X_va_features_sparse = scipy.sparse.hstack([X_va_q1q2_pre, scipy.sparse.csr_matrix(X_va_features)])
X_te_features_sparse = scipy.sparse.hstack([X_te_q1q2_pre, scipy.sparse.csr_matrix(X_te_features)])

In [11]:
# TRAIN LOGISTIC MODEL

logistic2 = sklearn.linear_model.LogisticRegression(solver="liblinear", random_state=123)
logistic2.fit(X_tr_features_sparse, y_tr)

In [12]:
# SAVE MODEL AND DATASET

# save model
if not os.path.isdir("model"):
    os.mkdir("model")

if not os.path.isdir("model/features_solution"):
        os.mkdir("model/features_solution")
        
with open('model/features_solution/features_model.pkl','wb') as f:
    pickle.dump(logistic2,f)
    
# save data
with open('model/features_solution/features_model_X_tr.pkl','wb') as f:
    pickle.dump(X_tr_features_sparse,f)  
with open('model/features_solution/features_model_X_va.pkl','wb') as f:
    pickle.dump(X_va_features_sparse,f)   
with open('model/features_solution/features_model_X_te.pkl','wb') as f:
    pickle.dump(X_te_features_sparse,f)
with open('model/features_solution/features_model_y_tr.pkl','wb') as f:
    pickle.dump(y_tr,f)
with open('model/features_solution/features_model_y_va.pkl','wb') as f:
    pickle.dump(y_va,f)
with open('model/features_solution/features_model_y_te.pkl','wb') as f:
    pickle.dump(y_te,f)

### TF-IDF vecotrizer + Basic Features + Distance Features

In [13]:
import pandas as pd
from sklearn import *
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
import os
import pickle

sns.set()

In [4]:
# TF-IDF vectorizer

tr_df_clean = X_tr_df.fillna('')
va_df_clean = X_va_df.fillna('')
te_df_clean = X_te_df.fillna('')

from sklearn.feature_extraction.text import TfidfVectorizer

questions = list(tr_df_clean['question1']) + list(tr_df_clean['question2'])
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit(questions)

In [5]:
# getting TF-IDF representations
from utils import get_features_from_df
X_tr_q1q2 = get_features_from_df(tr_df_clean, tfidf) # it converts list as strings
X_va_q1q2 = get_features_from_df(va_df_clean, tfidf) # it converts list as strings
X_te_q1q2 = get_features_from_df(te_df_clean, tfidf) # it converts list as strings

  """


In [7]:
# loading basic features created in utils_Alba
X_tr_features = pd.read_csv('../X_tr_features.csv')
X_va_features = pd.read_csv('../X_va_features.csv')
X_te_features = pd.read_csv('../X_te_features.csv')

In [6]:
# read distance features (computed in utils_claudia due to conflicts with cython)

dist_df_train = pd.read_csv('model/features_solution/distances/dist_df_train.csv', header = 'infer', index_col=0)
dist_df_val = pd.read_csv('model/features_solution/distances/dist_df_val.csv', header = 'infer', index_col=0)
dist_df_test = pd.read_csv('model/features_solution/distances/dist_df_test.csv', header = 'infer', index_col=0)

In [9]:
# tf-idf+basic features
tfidf_features_bf_tr = scipy.sparse.hstack([X_tr_q1q2, scipy.sparse.csr_matrix(X_tr_features)])
tfidf_features_bf_va = scipy.sparse.hstack([X_va_q1q2, scipy.sparse.csr_matrix(X_va_features)])
tfidf_features_bf_te = scipy.sparse.hstack([X_te_q1q2, scipy.sparse.csr_matrix(X_te_features)])

In [10]:
#tf-idf+basic features + distance features
tfidf_features_dist_tr = scipy.sparse.hstack([tfidf_features_bf_tr, scipy.sparse.csr_matrix(dist_df_train)])
tfidf_features_dist_va = scipy.sparse.hstack([tfidf_features_bf_va, scipy.sparse.csr_matrix(dist_df_val)])
tfidf_features_dist_te = scipy.sparse.hstack([tfidf_features_bf_te, scipy.sparse.csr_matrix(dist_df_test)])

In [11]:
tfidf_feat_dist_logistic = linear_model.LogisticRegression(solver="liblinear",
                                                   random_state=123)
tfidf_feat_dist_logistic.fit(tfidf_features_dist_tr, y_tr)

In [14]:
# SAVE MODEL AND DATASET

# save model
if not os.path.isdir("model"):
    os.mkdir("model")

if not os.path.isdir("model/features_solution/distances/"):
        os.mkdir("model/features_solution/distances/")
        
with open('model/features_solution/distances/TFIDF_BF_DF.pkl','wb') as f:
    pickle.dump(tfidf_feat_dist_logistic,f)
    
# save data
with open('model/features_solution/distances/TFIDF_BF_DF_X_tr.pkl','wb') as f:
    pickle.dump(tfidf_features_dist_tr,f)  
with open('model/features_solution/distances/TFIDF_BF_DF_X_va.pkl','wb') as f:
    pickle.dump(tfidf_features_dist_va,f)   
with open('model/features_solution/distances/TFIDF_BF_DF_X_te.pkl','wb') as f:
    pickle.dump(tfidf_features_dist_te,f)
with open('model/features_solution/distances/TFIDF_BF_DF_y_tr.pkl','wb') as f:
    pickle.dump(y_tr,f)
with open('model/features_solution/distances/TFIDF_BF_DF_y_va.pkl','wb') as f:
    pickle.dump(y_va,f)
with open('model/features_solution/distances/TFIDF_BF_DF_y_te.pkl','wb') as f:
    pickle.dump(y_te,f)

#### Testing different classifiers

In [15]:
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB

In [16]:
xgb_model = XGBClassifier(random_state=123)
nb_model = MultinomialNB()

In [18]:
xgb_trained = xgb_model.fit(tfidf_features_dist_tr, y_tr)
xgb_trained

In [19]:
nb_trained = nb_model.fit(tfidf_features_dist_tr, y_tr)
nb_trained

In [24]:
# saving models

with open('model/features_solution/distances/TFIDF_BF_DF_xgboost.pkl','wb') as f:
    pickle.dump(xgb_trained,f)

with open('model/features_solution/distances/TFIDF_BF_DF_nb.pkl','wb') as f:
    pickle.dump(nb_trained,f)

## Incorporate embeddings

### Word2Vect

In [23]:
#Train
vec = get_Word2Vect_from_clean(pd.DataFrame({'q1': q1_train_preprocessed, 'q2':q2_train_preprocessed}))
dfw2v = pd.DataFrame({'q1':vec[0],'q2':vec[1],'dup':y_tr})
dfw2v.dropna(inplace=True)
y_tr_w2v = dfw2v['dup'].values
X_tr_w2v = np.hstack((np.array([x for x in dfw2v['q1'].values]),np.array([x for x in dfw2v['q2'].values])))
#Validation
vec = get_Word2Vect_from_clean(pd.DataFrame({'q1': q1_val_preprocessed, 'q2':q2_val_preprocessed}))
dfw2v = pd.DataFrame({'q1':vec[0],'q2':vec[1],'dup':y_va})
dfw2v.dropna(inplace=True)
y_va_w2v = dfw2v['dup'].values
X_va_w2v = np.hstack((np.array([x for x in dfw2v['q1'].values]),np.array([x for x in dfw2v['q2'].values])))
#Test
vec = get_Word2Vect_from_clean(pd.DataFrame({'q1': q1_test_preprocessed, 'q2':q2_test_preprocessed}))
dfw2v = pd.DataFrame({'q1':vec[0],'q2':vec[1],'dup':y_te})
dfw2v.dropna(inplace=True)
y_te_w2v = dfw2v['dup'].values
X_te_w2v = np.hstack((np.array([x for x in dfw2v['q1'].values]),np.array([x for x in dfw2v['q2'].values])))

In [24]:
logistic3 = sklearn.linear_model.LogisticRegression(solver="liblinear", random_state=123)
logistic3.fit(X_tr_w2v, y_tr_w2v)

In [26]:
# SAVE MODEL AND DATASET
folder = 'word2vect'
# save model
if not os.path.isdir("model"):
    os.mkdir("model")

if not os.path.isdir(f"model/{folder}"):
        os.mkdir(f"model/{folder}")
        
with open(f"model/{folder}_solution/{folder}_model.pkl",'wb') as f:
    pickle.dump(logistic3,f)
    
# save data
with open(f'model/{folder}_solution/{folder}_model_X_tr.pkl','wb') as f:
    pickle.dump(X_tr_w2v,f)  
with open(f'model/{folder}_solution/{folder}_model_X_va.pkl','wb') as f:
    pickle.dump(X_va_w2v,f)   
with open(f'model/{folder}_solution/{folder}_model_X_te.pkl','wb') as f:
    pickle.dump(X_te_w2v,f)
with open(f'model/{folder}_solution/{folder}_model_y_tr.pkl','wb') as f:
    pickle.dump(y_tr_w2v,f)
with open(f'model/{folder}_solution/{folder}_model_y_va.pkl','wb') as f:
    pickle.dump(y_va_w2v,f)
with open(f'model/{folder}_solution/{folder}_model_y_te.pkl','wb') as f:
    pickle.dump(y_te_w2v,f)

#### Word2Vect pre-trained

In [8]:
#Load pretrained word2vec model
pre = KeyedVectors.load("model/word2vect/pretrained.model")

In [9]:
#Train
vec = get_Word2Vect_from_clean(pd.DataFrame({'q1': q1_train_preprocessed, 'q2':q2_train_preprocessed}), pre)
dfw2v = pd.DataFrame({'q1':vec[0],'q2':vec[1],'dup':y_tr})
dfw2v.dropna(inplace=True)
y_tr_w2v = dfw2v['dup'].values
X_tr_w2v = np.hstack((np.array([x for x in dfw2v['q1'].values]),np.array([x for x in dfw2v['q2'].values])))
#Validation
vec = get_Word2Vect_from_clean(pd.DataFrame({'q1': q1_val_preprocessed, 'q2':q2_val_preprocessed}), pre)
dfw2v = pd.DataFrame({'q1':vec[0],'q2':vec[1],'dup':y_va})
dfw2v.dropna(inplace=True)
y_va_w2v = dfw2v['dup'].values
X_va_w2v = np.hstack((np.array([x for x in dfw2v['q1'].values]),np.array([x for x in dfw2v['q2'].values])))
#Test
vec = get_Word2Vect_from_clean(pd.DataFrame({'q1': q1_test_preprocessed, 'q2':q2_test_preprocessed}), pre)
dfw2v = pd.DataFrame({'q1':vec[0],'q2':vec[1],'dup':y_te})
dfw2v.dropna(inplace=True)
y_te_w2v = dfw2v['dup'].values
X_te_w2v = np.hstack((np.array([x for x in dfw2v['q1'].values]),np.array([x for x in dfw2v['q2'].values])))

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [10]:
logistic4 = sklearn.linear_model.LogisticRegression(solver="liblinear", random_state=123)
logistic4.fit(X_tr_w2v, y_tr_w2v)

In [12]:
# SAVE MODEL AND DATASET
folder = 'word2vect_pre'
# save model
if not os.path.isdir("model"):
    os.mkdir("model")

if not os.path.isdir(f"model/{folder}_solution"):
        os.mkdir(f"model/{folder}_solution")
        
with open(f"model/{folder}_solution/{folder}_model.pkl",'wb') as f:
    pickle.dump(logistic4,f)
    
# save data
with open(f'model/{folder}_solution/{folder}_model_X_tr.pkl','wb') as f:
    pickle.dump(X_tr_w2v,f)  
with open(f'model/{folder}_solution/{folder}_model_X_va.pkl','wb') as f:
    pickle.dump(X_va_w2v,f)   
with open(f'model/{folder}_solution/{folder}_model_X_te.pkl','wb') as f:
    pickle.dump(X_te_w2v,f)
with open(f'model/{folder}_solution/{folder}_model_y_tr.pkl','wb') as f:
    pickle.dump(y_tr_w2v,f)
with open(f'model/{folder}_solution/{folder}_model_y_va.pkl','wb') as f:
    pickle.dump(y_va_w2v,f)
with open(f'model/{folder}_solution/{folder}_model_y_te.pkl','wb') as f:
    pickle.dump(y_te_w2v,f)

### Sentence embeddings

In [None]:
# The remaining cells take a lot of time to compute, it is recommended to use
# GPU or reduce the dataset size

# However, all the calculation for the embeddings and distances should be
# already calculated and stored in files in the model folder,
# this is only to reproduce the results

In [5]:
from sentence_transformers import SentenceTransformer

# Load selected model
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
X_tr = [(df["question1"], df["question2"]) for index, df in tr_df.iterrows()]
y_tr = tr_df["is_duplicate"].values

X_va = [(df["question1"], df["question2"]) for index, df in va_df.iterrows()]
y_va = va_df["is_duplicate"].values

X_te = [(df["question1"], df["question2"]) for index, df in te_df.iterrows()]
y_te = te_df["is_duplicate"].values

In [14]:
# Calculate embeddings (this is the cell that takes most time)

emb1s_tr, emb2s_tr = calculate_embeddings(model, X_tr)
emb1s_va, emb2s_va = calculate_embeddings(model, X_va)
emb1s_te, emb2s_te = calculate_embeddings(model, X_te)

In [17]:
# Calculate distances

dists_tr = calculate_distances(emb1s_tr, emb2s_tr)
dists_va = calculate_distances(emb1s_va, emb2s_va)
dists_te = calculate_distances(emb1s_te, emb2s_te)

291897it [00:52, 5517.16it/s]
15363it [00:02, 5397.93it/s]
16172it [00:02, 5478.22it/s]


In [18]:
# Train a classifier for similarities
from sklearn.linear_model import LogisticRegression

data_tr = np.array(dists_tr).reshape(-1, 1)

logistic = LogisticRegression(solver="liblinear", random_state=123)
logistic.fit(data_tr, y_tr)

In [19]:
with open('model/pretrained_embeddings/logistic_distances.pkl','wb') as f:
    pickle.dump(logistic,f)

In [21]:
# Join embeddings
embs_tr = np.concatenate([emb1s_tr, emb2s_tr], axis=-1)
embs_va = np.concatenate([emb1s_va, emb2s_va], axis=-1)
embs_te = np.concatenate([emb1s_te, emb2s_te], axis=-1)

In [23]:
# Train a classifier for embeddings
from xgboost import XGBClassifier

xgb_model = XGBClassifier(random_state=123)
xgb_model.fit(embs_tr, y_tr)

In [24]:
with open('model/pretrained_embeddings/xgb_model.pkl','wb') as f:
    pickle.dump(xgb_model,f)

In [25]:
# Save all data for reproduce results
# As we will not be using train data in reproduce results, we can skip it

#with open('model/pretrained_embeddings/dists_tr.pkl','wb') as f:
#    pickle.dump(dists_tr,f)
with open('model/pretrained_embeddings/dists_va.pkl','wb') as f:
    pickle.dump(dists_va,f)
with open('model/pretrained_embeddings/dists_te.pkl','wb') as f:
    pickle.dump(dists_te,f)
    
#with open('model/pretrained_embeddings/embs_tr.pkl','wb') as f:
#    pickle.dump(embs_tr,f)
with open('model/pretrained_embeddings/embs_va.pkl','wb') as f:
    pickle.dump(embs_va,f)
with open('model/pretrained_embeddings/embs_te.pkl','wb') as f:
    pickle.dump(embs_te,f)
    
#with open('model/pretrained_embeddings/y_tr.pkl','wb') as f:
#    pickle.dump(y_tr,f)
with open('model/pretrained_embeddings/y_va.pkl','wb') as f:
    pickle.dump(y_va,f)
with open('model/pretrained_embeddings/y_te.pkl','wb') as f:
    pickle.dump(y_te,f)