## Introduction 📝
🎯 Goal:Binary classification based on features

📖 Data:

train.csv / test.csv - the training and testing set

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target. <br>
______________________________________________________________________________________________________________________

### Whats in the Notebook ?
#### We are going to use CUDF + TPOT AutoML

##### TPOT

TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

<p style="text-align:center;">
<kbd><img src="https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-logo.jpg" width="200"
    align="center"><kbd></p>

You can read about it here :http://epistasislab.github.io/tpot/

##### cuDF
Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

You can read about it here :https://github.com/rapidsai/cudf

    ______________________________________________________________________________________________________________

In [None]:
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import re
import time
import spacy
import gc
import shutil
import datatable as dt
from pathlib import Path
import warnings
import os

import cupy as cp
import pandas as pd
import cudf
import dask_cudf


In [None]:
train= cudf.read_csv('../input/tabular-playground-series-oct-2021/train.csv')

In [None]:
train.head(3)

In [None]:
train.drop('id', axis=1, inplace=True)

In [None]:
num_cols=train.select_dtypes(include=np.number).columns.tolist()

We can cast the numeric cols to float32 to reduce the memory usage.

In [None]:
for col in num_cols:
    train[col]=train[col].astype('float32')

In [None]:

gc.collect()

In [None]:
features=[f for f in train.columns.tolist() if 'f' in f]

In [None]:
X=cp.array(train[features].as_gpu_matrix())
Y=cp.array(train['target'])

In [None]:
del train
gc.collect()

In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    train_size=0.75, test_size=0.25)

tpot=TPOTClassifier( generations=3,
   population_size=2,
   config_dict="TPOT cuML",
   memory='auto',
   scoring='roc_auc',
   max_time_mins=40,
   cv=2,
   verbosity=2)
tpot.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
print(tpot.score(cp.asnumpy(X_test), cp.asnumpy(y_test)))
tpot.export('tpot_digits_pipeline.py')
gc.collect()


#### Create Predictions for Test Data 

In [None]:
test_data=cudf.read_csv('../input/tabular-playground-series-oct-2021/test.csv')

In [None]:
test=cp.array(test_data[features].as_gpu_matrix())

In [None]:
results = tpot.predict_proba(cp.asnumpy(test))

In [None]:
test_data['target']=results[:,1]
submission =test_data[['id','target']]
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)