## **Quick approach model optimization**

Let's try to make a quick approach to which model can show better performance and then delve into aspects that can optimize performance.  
To do this we use the [PyCaret](https://pycaret.org/) library and test on a sample of the original dataset.

### Libraries📘

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import datatable as dt
import time
#import pickle
#import warnings
#warnings.simplefilter("ignore")
import matplotlib.pyplot as plt
%pylab
%matplotlib inline

> ## **Loading** and Reduce Memory  

We load the data, trying to reduce the memory overload  
Reference: https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype.name

        if col_type not in ['object', 'category', 'datetime64[ns, UTC]']:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

We select the last hundred thousand rows to check how our model works without overloading memory

In [None]:
%%time

df_train = (
    dt.fread('../input/jane-street-market-prediction/train.csv', max_nrows=100000)
      .to_pandas()
      .pipe(reduce_mem_usage)
)
df_train.head()

In [None]:
#size = 100000
#dftrain = df_train.tail(size)

In [None]:
#Index reset
df_train.reset_index(drop=True, inplace=True)
df_train.head()

## EDA

The goal of this notebook is **not to do a thorough data exploration**, as there are great jobs that develop this task. Based on these findings, we will try to develop a quick method optimized to predict Jane Street's challenge

Reference:https://www.kaggle.com/muhammadmelsherbini/jane-street-extensive-eda

## **Preprocesing**⚙

* We calculate the target from the variable 'resp'
* We erase cases where the weight is equal to 0

In [None]:
#Features selection
features = [c for c in df_train.columns if 'feature' in c]
#Target
#Reference: https://www.kaggle.com/iamleonie/utility-function-and-patterns-in-missing-values
df_train['action'] = (df_train['resp'] > 0).astype('int')
#We delete cases when weight is equal to 0
df_train = df_train.loc[df_train.weight != 0]
#We generate the Dataset selection to model
df_train = pd.concat([df_train.weight, df_train[features], df_train.action], axis=1)
#Reindex to keep numbers in a row
df_train.reset_index(drop=True, inplace=True)
df_train.head()

In [None]:
df_train.info()

We treat null values

In [None]:
#We check that we have removed all other null and infinite values
df_train.replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
#First estrategy
#Fill nan values with 0
#df_train.fillna(0,inplace=True)

#Second estrategy
#Fill nan values with median
train_median = df_train.median()
df_train = df_train.fillna(train_median)

#Third estrategy
#Lineal Interpolation
#df_train.interpolate(method='linear', inplace=True)

In [None]:
#print('Original size:', df_train.shape[0])
#print('Sample size:', df_train.shape[0])
#print('% Sample:', round((df_train.shape[0]/df_train.shape[0]),2)*100,'%')

In [None]:
#We save is df and then work after
#dftrain.to_csv('/kaggle/working/df_tail_slice.csv')

In [None]:
#%%time
#dftrain = pd.read_csv('/kaggle/working/df_tail_slice.csv', index_col=0)

The missing values would be filled directly by setting the parameters in PyCaret with numeric_imputation='zero'

### Approaching the best model with PyCaret (AutoML)

In [None]:
!pip install pycaret

In [None]:
%%time

from pycaret.classification import *
exp1 = setup(df_train, target = 'action', feature_selection = True, feature_selection_threshold=0.8,
             remove_multicollinearity=True, multicollinearity_threshold=0.6, numeric_imputation='zero')

Reducing dimensionality to 26 variables features eliminating multicolinearity by 60%

In [None]:
%%time

compare_models()

The best results are provided by the **Extra Trees Classifier** model.  
Better results are obtained when filled with the means of all data. 
#Second estrategy  
#Fill nan values with median    
train_median = df_train.median()  
df_train = df_train.fillna(train_median)  

In [None]:
#We create the model that has given us the best performance: "Extra Trees Classifier"
et_model = create_model('et')

In [None]:
#Try to improve the choice of hyperparameters to optimize performance
tuned_et_model = tune_model(et_model)

In [None]:
#We save the model
save_model(et_model, '/kaggle/working/et_model_saved_15012021')

#Loading the saved model
#et_saved = load_model('/kaggle/working/et_model_saved_21122020')

In [None]:
%%time

# tune multiple models dynamically
top3 = compare_models(n_select = 3)
tuned_et_modeltop3 = [tune_model(i) for i in top3]

In [None]:
tuned_et_modeltop3

In [None]:
#We save the model
save_model(tuned_et_modeltop3, '/kaggle/working/tuned_et_top3_saved_21122020')

In [None]:
#We predict the model with the testset generated in the Pycaret model and check for overfitting
et_model_pred = predict_model(et_model)

Looks like there's not too much overfit.

In [None]:
%%time

#We save the model to test performance in a traditional way and others dataset
et_final = finalize_model(et_model);

We tested the model again in all train set

In [None]:
%%time

df_train = (
    dt.fread('../input/jane-street-market-prediction/train.csv')
      .to_pandas()
      .pipe(reduce_mem_usage)
)
df_train.head()

In [None]:
##PREPROCESING##

#Features selection
features = [c for c in df_train.columns if 'feature' in c]
#Target
#Reference: https://www.kaggle.com/iamleonie/utility-function-and-patterns-in-missing-values
df_train['action'] = (df_train['resp'] > 0).astype('int')
#We delete cases when weight is equal to 0
df_train = df_train.loc[df_train.weight != 0]
#We generate the Dataset selection to model
df_train = pd.concat([df_train.weight, df_train[features], df_train.action], axis=1)

#Fill nan values with 0
df_train.fillna(0,inplace=True)
#We check that we have removed all other null and infinite values
df_train.replace([np.inf, -np.inf], np.nan, inplace=True)
#Reindex to keep numbers in a row
df_train.reset_index(drop=True, inplace=True)

In [None]:
#Function to measure model metrics
from sklearn.metrics import accuracy_score, auc, confusion_matrix, f1_score, precision_score, recall_score, roc_curve

def metrics_models(y_true, y_pred):
    from sklearn.metrics import accuracy_score, auc, confusion_matrix, f1_score, precision_score, recall_score, roc_curve

    # Obtaining a confusion matrix
    confusion_matrix = confusion_matrix(y_true, y_pred)

    print("La matriz de confusión es ")
    print(confusion_matrix)

    print('Precisión:', accuracy_score(y_true, y_pred))
    print('Exactitud:', precision_score(y_true, y_pred))
    print('Exhaustividad:', recall_score(y_true, y_pred))
    print('F1:', f1_score(y_true, y_pred))

    false_positive_rate, recall, thresholds = roc_curve(y_true, y_pred)
    roc_auc = auc(false_positive_rate, recall)

    print('AUC:', auc(false_positive_rate, recall))

    plot(false_positive_rate, recall, 'b')
    plot([0, 1], [0, 1], 'r--')
    title('AUC = %0.2f' % roc_auc)

In [None]:
#We calculate performance tradicional way with metricas_modelos() function
y_test = df_train['action']

y_pred_test = predict_model(et_final, data = df_train)
y_pred_test = y_pred_test['Label']
y_pred_test = pd.to_numeric(y_pred_test)
metrics_models(y_test, y_pred_test)

## Submitting test in ALL dataset

In [None]:
%%time
folder_path = '../input/jane-street-market-prediction/'
test = dt.fread(folder_path + 'example_test.csv').to_pandas()

In [None]:
##PREPROCESING##

#Features selection
features = [c for c in test.columns if 'feature' in c]

#Fill nan values with 0
#test.fillna(0,inplace=True)

#Fill nan values with median
test_median = test.median()
test = test.fillna(test_median)
#We check that we have removed all other null and infinite values
#df_train.replace([np.inf, -np.inf], np.nan, inplace=True)
#Reindex to keep numbers in a row
#df_train.reset_index(drop=True, inplace=True)

prediction = predict_model(et_final, data = test)
prediction.head()

In [None]:
#prediction = predict_model(etc_final, data = test)
#sample_prediction_df = pd.DataFrame([prediction.Label], columns=['action'], index=prediction.ts_id)
sample_prediction_df = pd.concat([prediction.Label], axis=1, keys=prediction.ts_id)
sample_prediction_df.rename(columns={ sample_prediction_df.columns[0]: "action" }, inplace=True)
#sample_prediction_df.rename(columns={'0':'action'}, inplace = True)
sample_prediction_df.head()

In [None]:
%%time
#First option
import janestreet
try:
    env = janestreet.make_env() # initialize the environment
    iter_test = env.iter_test()
except:
    env = janestreet.make_env.__called__ = False
    env = janestreet.make_env() # initialize the environment again
    iter_test = env.iter_test()

    
for (test_df, sample_prediction_df) in iter_test:
    wt = test_df.iloc[0].weight
    if(wt == 0):
        sample_prediction_df.action = 0 
    else:
        #test_median = test_df.median()
        #test_df = test_df.fillna(test_median)
        test_df.fillna(0,inplace=True)
        predictions = predict_model(et_final, data = test_df)['Label'].astype(int)
        sample_prediction_df = predictions.to_frame()
        sample_prediction_df.rename(columns ={'Label':'action'}, inplace = True)
    env.predict(sample_prediction_df)

In [None]:
#Second option
import janestreet
try:
    env = janestreet.make_env() # initialize the environment
    iter_test = env.iter_test()
except:
    env = janestreet.make_env.__called__ = False
    env = janestreet.make_env() # initialize the environment again
    iter_test = env.iter_test()

for (test_df, sample_prediction_df) in iter_test:
    #X_test = test_df.loc[:, test_df.drop(columns=(drop_col+['resp']))]
    if test_df['weight'].item() > 0:
        sample_prediction_df = predict_model(et_final, data = test)['Label'].astype(int)
        sample_prediction_df = predictions.to_frame()
        sample_prediction_df.rename(columns ={'Label':'action'}, inplace = True)
        env.predict(sample_prediction_df)

### Test Load model 

In [None]:
#Loading the saved model
#et_final = load_model('../input/et-model-pycaret/et_model_saved_13012021.pkl')

In [None]:
import pickle

# path file
pkl_path = "../input/et-model-pycaret/et_model_saved_13012021.pkl"
# Load from file
with open(pkl_path, 'rb') as file:
    rf_final = pickle.load(file)

In [None]:
# Calculate the accuracy score and predict target values
score = pickle_model.score(Xtest, Ytest)
print("Test score: {0:.2f} %".format(100 * score))
Ypredict = pickle_model.predict(Xtest)

for (test_df, sample_prediction_df) in iter_test:
    X_test = test_df.loc[:, test_df.columns.str.contains('feature')]
    y_preds = clf.predict(X_test)
    sample_prediction_df.action = y_preds
    env.predict(sample_prediction_df)

### Test to restart the environment

In [None]:
#Test to restart the environment
import janestreet
try:
    env = janestreet.make_env() # initialize the environment
    iter_test = env.iter_test()
except:
    env = janestreet.make_env.__called__ = False
    env = janestreet.make_env() # initialize the environment
    iter_test = env.iter_test()
    

iter_test