# Train Models
<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">

**IMPORTANT INFO:**

The `train_models.ipynb` notebook:
- Is a responsability of all members of a group. All of you should execute this and ensure it works as expected.
- Has to use the code done by each member in the group to generate features for the challenge.


`models`: A folder containing the trained models. This folder should be cre- ated by `train_models.ipynb` and models should be stored there after running `train_models.ipynb` notebook. The code should check if the folder is there and in such a case do not overwrite/store the models.

</div>

## Imports

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import *
import numpy as np
import sklearn
import pickle
import scipy
import os

import sys

import seaborn as sns
sns.set()

from utils import *

## Load Data

From the problem guide the teacher says:

This is a Kaggle challenge: There is no validation/test data with labels.
Therefore you have to create the following split in order to share the same train validation and test splits across teams:

In [2]:
path_folder_quora = '../nlp_deliv1_materials/'

# Train and Validation data
train_df = pd.read_csv(os.path.join(path_folder_quora, "quora_train_data.csv"))
# use this to provide the expected generalization results
test_df = pd.read_csv(os.path.join(path_folder_quora,"quora_test_data.csv"))

A_df, te_df = sklearn.model_selection.train_test_split(train_df, test_size=0.05, random_state=123)
tr_df, va_df = sklearn.model_selection.train_test_split(A_df, test_size=0.05, random_state=123)

In [3]:
# dividng X and y for each dataset
y_tr = tr_df['is_duplicate'].values
X_tr_df = tr_df.drop(['is_duplicate'], axis =1)

y_va = va_df['is_duplicate'].values
X_va_df = va_df.drop(['is_duplicate'], axis =1)

y_te = te_df['is_duplicate'].values
X_te_df = te_df.drop(['is_duplicate'], axis =1)

print(f'Training:\n X train {X_tr_df.shape}\n y train {y_tr.shape}\n {"-"*20}')
print(f'Validation:\n X val {X_va_df.shape}\n y val {y_va.shape}\n {"-"*20}')
print(f'Test:\n X test {X_te_df.shape}\n y test {y_te.shape}\n {"-"*20}')

Training:
 X train (291897, 5)
 y train (291897,)
 --------------------
Validation:
 X val (15363, 5)
 y val (15363,)
 --------------------
Test:
 X test (16172, 5)
 y test (16172,)
 --------------------


# Simple Solution

In [4]:
# convert input data into list of strings

q1_train =  cast_list_as_strings(list(X_tr_df["question1"]))
q2_train =  cast_list_as_strings(list(X_tr_df["question2"]))

q1_val =  cast_list_as_strings(list(X_va_df["question1"]))
q2_val =  cast_list_as_strings(list(X_va_df["question2"]))

q1_test =  cast_list_as_strings(list(X_te_df["question1"]))
q2_test =  cast_list_as_strings(list(X_te_df["question2"]))

Use all the questions in train and test partitions to build a single list all_questions to fit the count_vectorizer

In [5]:
all_q_train = q1_train+q2_train

count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,1))
count_vectorizer.fit(all_q_train)

CountVectorizer()

In [6]:
# get features (concatenating q1+q2)
X_tr_q1q2 = get_features_from_df(X_tr_df, count_vectorizer) # it converts list as strings and performs count_vectorizer
X_va_q1q2 = get_features_from_df(X_va_df, count_vectorizer)
X_te_q1q2 = get_features_from_df(X_te_df, count_vectorizer)

In [7]:
print(f'Training:\n X train {X_tr_q1q2.shape}\n {"-"*20}')
print(f'Validation:\n X val {X_va_q1q2.shape}\n{"-"*20}')
print(f'Test:\n X test {X_te_q1q2.shape}\n{"-"*20}')

Training:
 X train (291897, 149650)
 --------------------
Validation:
 X val (15363, 149650)
--------------------
Test:
 X test (16172, 149650)
--------------------


In [8]:
# training a simple model
logistic = sklearn.linear_model.LogisticRegression(solver="liblinear",
                                                   random_state=123)
y_train = train_df["is_duplicate"].values
logistic.fit(X_tr_q1q2, y_tr)

LogisticRegression(random_state=123, solver='liblinear')

### Saving simple model
Creating model folder + saving

In [24]:
# save model
if not os.path.isdir("model"):
    os.mkdir("model")

if not os.path.isdir("model/simple_solution"):
        os.mkdir("model/simple_solution")
        
with open('model/simple_solution/simple_model.pkl','wb') as f:
    pickle.dump(logistic,f)

In [25]:
# save dataset with correct features:

# Save as model_name+(X/y)+(tr/va/te) (depending if its dataset or lavels and what type they are)

with open('model/simple_solution/simple_model_X_tr.pkl','wb') as f:
    pickle.dump(X_tr_q1q2,f)  
with open('model/simple_solution/simple_model_X_va.pkl','wb') as f:
    pickle.dump(X_va_q1q2,f)   
with open('model/simple_solution/simple_model_X_te.pkl','wb') as f:
    pickle.dump(X_te_q1q2,f)
with open('model/simple_solution/simple_model_y_tr.pkl','wb') as f:
    pickle.dump(y_tr,f)
with open('model/simple_solution/simple_model_y_va.pkl','wb') as f:
    pickle.dump(y_va,f)
with open('model/simple_solution/simple_model_y_te.pkl','wb') as f:
    pickle.dump(y_te,f)

# Improved Solution

### Text Claening
### Feature Selection (try different models with different features)
### Incorporate embeddings (word2vec, TF-IDF, ...) + train more models
### Pre-trained LLM