Hello.

I find AutoML tools the best for baseline models, so here I'm trying another one called EvalML. You may find my other AutoML notebooks [here ](https://www.kaggle.com/kritidoneria/code?userId=1260510&sortBy=dateRun&tab=profile&language=Python&privacy=public)

A huge shoutout to [this](https://www.kaggle.com/gauravduttakiit/automate-the-ml-pipelines-with-evalml) Notebook for introducing me to this library.
[I've also used EvalML to compete in TPS May](https://www.kaggle.com/kritidoneria/automl-tps-may21-using-evalml)
Another reference for this work is [here](https://www.kaggle.com/tsnarendran14/jane-street-simple-xgb-model/data)

<h1> Introduction to library </h1>

Source: https://github.com/alteryx/evalml

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

**Key Functionality**

1. Automation - Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
2. Data Checks - Catches and warns of problems with your data and problem setup before modeling.
3. End-to-end - Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
4. Model Understanding - Provides tools to understand and introspect on models, to learn how they'll behave in your problem domain.
5. Domain-specific - Includes repository of domain-specific objective functions and an interface to define your own.

<h1> Installation from Pypi </h1>

In [1]:
!pip install evalml

Collecting evalml
  Downloading evalml-0.24.0-py3-none-any.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 4.4 MB/s 
[?25hCollecting graphviz>=0.13
  Downloading graphviz-0.16-py2.py3-none-any.whl (19 kB)
Collecting requirements-parser>=0.2.0
  Downloading requirements-parser-0.2.0.tar.gz (6.3 kB)
Collecting kaleido>=0.1.0
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[K     |████████████████████████████████| 79.9 MB 102.9 MB/s 
Collecting lightgbm<3.1.0,>=2.3.1
  Downloading lightgbm-3.0.0-py2.py3-none-manylinux1_x86_64.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 17.8 MB/s 
Collecting pyzmq<22.0.0
  Downloading pyzmq-21.0.2-cp37-cp37m-manylinux1_x86_64.whl (6.7 MB)
[K     |████████████████████████████████| 6.7 MB 19.4 MB/s 
[?25hCollecting sktime>=0.5.3
  Downloading sktime-0.6.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.7 MB)
[K     |████████████████████████████████| 5.7 MB 18.4 MB/

<h1> Load the Dataset </h1>

In [2]:
import evalml
from evalml import AutoMLSearch
import pandas as pd

In [3]:
X = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv',nrows=1000)
#limiting rows here because of computational bottlenecks
y = pd.read_csv('/kaggle/input/jane-street-market-prediction/example_test.csv')

<h2> Preprocessing</h2>

In [4]:
# Only selecting the columns where missing values is less than7 percent
final_cols = X.isnull().mean()[X.isnull().mean() < 0.07]

In [5]:
# Selecting only the required columns
X = X[final_cols.index]

In [6]:
# Filling NA values with median
X = X.fillna(X.median())
import numpy as np

In [7]:
X['action'] = np.where((X.resp_1 > 0) & (X.resp_2 > 0) & (X.resp_3 > 0) & (X.resp_4 > 0) & (X.resp > 0),1,0)

In [8]:
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X.drop(columns = ['date', 'weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4','resp', 'ts_id','action']),X['action'], problem_type='binary')
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 99), (200, 99), (800,), (200,))

# Run the search for the best classification model.

In [9]:
#limiting search for efficiency
automl = AutoMLSearch(X_train=X_train, y_train=y_train,   problem_type='binary',allowed_model_families=['xgboost', 'lightgbm','catboost'],max_batches=5)
automl.search() 

Generating pipelines to search over...



*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary. 
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 5 batches for a total of 24 pipelines. 
Allowed model families: xgboost, catboost, lightgbm



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 9.153

*****************************
* Evaluating Batch Number 1 *
*****************************

XGBoost Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.649
CatBoost Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.630
LightGBM Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.733

*****************************
* Evaluating Batch Number 2 *
*****************************

CatBoost Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.934
CatBoost Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.803
CatBoost Classifier w/ Imputer:


<h1> Model rankings and best pipeline </h1>

In [10]:
automl.rankings

Unnamed: 0,id,pipeline_name,mean_cv_score,standard_deviation_cv_score,validation_score,percent_better_than_baseline,high_variance_cv,parameters
0,8,CatBoost Classifier w/ Imputer,0.536917,0.023235,0.555947,94.133785,False,{'Imputer': {'categorical_impute_strategy': 'm...
1,16,LightGBM Classifier w/ Imputer,0.562128,0.003867,0.566204,93.858329,False,{'Imputer': {'categorical_impute_strategy': 'm...
2,13,XGBoost Classifier w/ Imputer,0.564013,0.011957,0.577386,93.837743,False,{'Imputer': {'categorical_impute_strategy': 'm...
23,0,Mode Baseline Binary Classification Pipeline,9.152696,0.055031,9.184469,0.0,False,{'Baseline Classifier': {'strategy': 'mode'}}


In [11]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])


**********************************
* CatBoost Classifier w/ Imputer *
**********************************

Problem Type: binary
Model Family: CatBoost

Pipeline Steps
1. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : median
	 * categorical_fill_value : None
	 * numeric_fill_value : None
2. CatBoost Classifier
	 * n_estimators : 74
	 * eta : 0.05671392060446587
	 * max_depth : 8
	 * bootstrap_type : None
	 * silent : True
	 * allow_writing_files : False

Training
Training for binary problems.
Total training time (including CV): 11.6 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary  Sensitivity at Low Alert Rates # Training # Validation
0                      0.556       0.156 0.654      0.529 0.205                     0.543            0.738                           0.059        533          267
1                      0.544       0.344 0.671      0.

<h1> Making predictions </h1>

In [12]:
winner = automl.best_pipeline
df_submission = winner.predict_proba(y.drop(columns=['ts_id'])).to_dataframe()
df_submission['ts_id'] = y['ts_id']

In [13]:
df_submission.set_index('ts_id').to_csv('submission.csv')