<a href="https://colab.research.google.com/github/CastHash532/flaml-automl/blob/main/Kaggle_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Authenticating with Kaggle using kaggle.json

Navigate to https://www.kaggle.com. Then go to the [Account tab of your user profile](https://www.kaggle.com/me/account) and select Create API Token. This will trigger the download of kaggle.json, a file containing your API credentials.

Then run the cell below to upload kaggle.json to your Colab runtime.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

## Load data and preprocess



In [None]:
!kaggle competitions download -c repo
#!kaggle datasets download -d repo

In [None]:
!ls

In [None]:
import pandas as pd
import numpy as np

ds_train = pd.read_csv('train.csv')
ds_test = pd.read_csv('test.csv')

In [None]:
X_train = ds_train.drop('Survived', axis=1)
y_train = ds_train['Survived']
X_test = ds_test

## Run FLAML
In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default ML learners of FLAML are `['lgbm', 'xgboost', 'catboost', 'rf', 'extra_tree', 'lrl1']`. 

In [None]:
!pip install flaml[notebook];



In [None]:
''' import AutoML class from flaml package '''
from flaml import AutoML
automl = AutoML()

In [None]:
settings = {
    "time_budget": 120,  # total running time in seconds
    "metric": 'accuracy',  # can be: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr',
                           # 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'
    "task": 'classification',  # task type    
    "log_file_name": 'airlines_experiment.log',  # flaml log file
    "seed": 7654321,    # random seed
}

In [None]:
'''The main flaml automl API'''
automl.fit(X_train=X_train, y_train=y_train, **settings)

[flaml.automl: 09-19 15:23:35] {1427} INFO - Evaluation method: cv
[flaml.automl: 09-19 15:23:35] {1473} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 09-19 15:23:35] {1505} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'lrl1']
[flaml.automl: 09-19 15:23:35] {1735} INFO - iteration 0, current learner lgbm
[flaml.automl: 09-19 15:23:35] {1920} INFO -  at 0.3s,	best lgbm's error=0.2132,	best lgbm's error=0.2132
[flaml.automl: 09-19 15:23:35] {1735} INFO - iteration 1, current learner lgbm
[flaml.automl: 09-19 15:23:35] {1920} INFO -  at 0.5s,	best lgbm's error=0.2132,	best lgbm's error=0.2132
[flaml.automl: 09-19 15:23:35] {1735} INFO - iteration 2, current learner lgbm
[flaml.automl: 09-19 15:23:35] {1920} INFO -  at 0.7s,	best lgbm's error=0.2065,	best lgbm's error=0.2065
[flaml.automl: 09-19 15:23:35] {1735} INFO - iteration 3, current learner xgboost
[flaml.automl: 09-19 15:23:36] {1920} INFO -  at 0.9s,	best xgboost's error

### Best model and metric

In [None]:
''' retrieve best config and best learner'''
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

Best ML leaner: xgboost
Best hyperparmeter config: {'n_estimators': 5, 'max_leaves': 8, 'min_child_weight': 1.4140048746882663, 'learning_rate': 0.5043918896401698, 'subsample': 0.9603062701962963, 'colsample_bylevel': 0.8820072449625802, 'colsample_bytree': 0.8520230481408825, 'reg_alpha': 0.0015989484628624363, 'reg_lambda': 0.05765164593991627}
Best accuracy on validation data: 0.835
Training duration of best run: 0.2084 s


In [None]:
automl.model.estimator

XGBClassifier(colsample_bylevel=0.8820072449625802,
              colsample_bytree=0.8520230481408825, grow_policy='lossguide',
              learning_rate=0.5043918896401698, max_depth=0, max_leaves=8,
              min_child_weight=1.4140048746882663, n_estimators=5, n_jobs=-1,
              reg_alpha=0.0015989484628624363, reg_lambda=0.05765164593991627,
              subsample=0.9603062701962963, tree_method='hist',
              use_label_encoder=False, verbosity=0)

In [None]:
''' compute predictions of testing dataset ''' 
y_pred = automl.predict(X_test)
print('Predicted labels', y_pred)
y_pred_proba = automl.predict_proba(X_test)[:,1]

Predicted labels [0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1
 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 1
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0
 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0
 0 1 1 1 1 1 0 1 0 0 0]


### Submit results to Kaggle

In [None]:
submission = pd.DataFrame (
    {'Survived':y_pred},
    X_test['PassengerId'],

)
submission.to_csv('/content/submission.csv')

In [None]:
!kaggle competitions submit -c titanic -f submission.csv -m "Flaml AutoML"

100% 2.77k/2.77k [00:05<00:00, 554B/s]
Successfully submitted to Titanic - Machine Learning from Disaster