                             

![](https://drive.google.com/uc?id=13IkWjxQcDEX8UMYP4U8YN6O6zc7TVRQD)



For Tabular Playground Series - October 2021 , we have a synthetic dataset generated using CTGAN and  the original dataset deals with predicting the biological response of molecules given various chemical properties.. Although the features are anonymized, they have properties relating to real-world features.

# **<span style="color:#e76f51;">Goal</span>**
 
The goal is to a binary target based on a number of feature columns given in the data. The columns are a mix of scaled continuous features and binary features.The data is synthetically generated by a GAN that was trained on real-world molecular response data.

# **<span style="color:#e76f51;">Data</span>**

**Training Data**

> - ```train.csv``` -  the training data with the target column
> - ```test.csv``` - the test set; you will be predicting the target for each row in this file (the probability of the binary target)
> - ```sample_submission.csv``` - a sample submission file in the correct format

# **<span style="color:#e76f51;">Metric</span>**

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

> I will be integrating W&B for visualizations and logging artifacts!
> 
> [TPS October Project on W&B Dashboard]
(https://wandb.ai/usharengaraju/TPSOctober)
> 
> - To get the API key, create an account in the [website](https://wandb.ai/site) .
> - Use secrets to use API Keys more securely 

In [None]:
!pip3 install tensorflow_decision_forests --upgrade

import os
import wandb
import logging
import datetime
import warnings
import gc

import numpy as np
import pandas as pd


from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')

import tensorflow_decision_forests as tfdf

sns.set_style('whitegrid')
sns_params = {"palette": sns.color_palette(["#2a9d8f", "#e9c46a"])}

In [None]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("api_key")
    wandb.login(key=secret_value_0)
    anony=None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')
    
CONFIG = dict(competition = 'TPSOctober',_wandb_kernel = 'tensorgirl')

In [None]:
train = pd.read_csv("../input/tabular-playground-series-oct-2021/train.csv",nrows = 1000)
test = pd.read_csv("../input/tabular-playground-series-oct-2021/test.csv",nrows = 1000)


# **<span style="color:#e76f51;">Observations</span>**

There are no missing values in both train ans test dataset.
The train consists of 1000000 data, and the test consists of 500000 data.
The binary features are from f22, f43, f242~f284 and rest of the features are continuous .


Source : https://www.kaggle.com/subinium/tps-oct-simple-eda

# **<span style="color:#e76f51;">Basic Statistics of Features</span>**

In [None]:
train.loc[:, 'f0':'f284'].describe().style.background_gradient(cmap='Pastel1')

# **<span style="color:#e76f51;">Target Variable Distribution</span>**

In [None]:
plt.figure(figsize=(15, 7))
sns.kdeplot(train["target"] ,fill=True, color = "#2a9d8f")


# **<span style="color:#e76f51;">Target Class Balance</span>**

In [None]:
plt.figure(figsize=(15, 7))
plt.pie([508,492], labels = ["0" , "1"],autopct='%1.1f%%',colors = ["#2a9d8f", "#e9c46a"])

# **<span style="color:#e76f51;">Distribution of features Vs Target</span>**

In [None]:
#code copied from https://www.kaggle.com/subinium/tps-oct-simple-eda

fig, axes = plt.subplots(11,11,figsize=(12, 12))
axes = axes.flatten()
sns.set_palette(sns.color_palette(["#2a9d8f", "#e9c46a"]))

for idx, ax in enumerate(axes):
    sns.kdeplot(data=train, x=f'f{idx}',ax=ax,palette = ["#2a9d8f"])
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Average by class (by feature f0-f120)', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()

In [None]:
#code copied from https://www.kaggle.com/craigmthomas/tps-oct-2021-eda

cat_features = ["f22", "f43"]
cat_features.extend(["f{}".format(x) for x in range(242, 285)])

fig, axs = plt.subplots(11, 4, figsize=(4*4, 11*3), squeeze=False, sharey=True)

ptr = 0
for row in range(11):
    for col in range(4):  
        x = train[[cat_features[ptr], "target"]].value_counts().sort_index().to_frame().rename({0: "# of Samples"}, axis="columns").reset_index()
        sns.barplot(x=cat_features[ptr], y="# of Samples", hue="target", data=x, ax=axs[row][col], **sns_params)
        plt.xlabel(cat_features[ptr])
        ptr += 1
        del(x)
plt.tight_layout()    
plt.show()

_ = gc.collect()

# **<span style="color:#e76f51;">W & B Artifacts</span>**

An artifact as a versioned folder of data.Entire datasets can be directly stored as artifacts .

W&B Artifacts are used for dataset versioning, model versioning . They are also used for tracking dependencies and results across machine learning pipelines.Artifact references can be used to point to data in other systems like S3, GCP, or your own system.

You can learn more about W&B artifacts [here](https://docs.wandb.ai/guides/artifacts)

![](https://drive.google.com/uc?id=1JYSaIMXuEVBheP15xxuaex-32yzxgglV)

In [None]:
# Save train data to W&B Artifacts
run = wandb.init(project='TPSOctober', name='training_data', anonymous=anony,config=CONFIG) 
artifact = wandb.Artifact(name='training_data',type='dataset')
artifact.add_file("../input/tabular-playground-series-oct-2021/train.csv")

wandb.log_artifact(artifact)
wandb.finish()

Snapshot of the artifacts created  

![](https://drive.google.com/uc?id=1w8g5VUO34Wy6Mi3y2M6-Yu6yOdz7qXqA)


# **<span style="color:#e76f51;">Logging to W & B environment</span>**

In [None]:
# Log Plots to W&B environment
title = "Distribution of Target Feature"
run = wandb.init(project='TPSOctober', name=title,anonymous=anony,config=CONFIG)
fig = sns.kdeplot(train["target"] , color = "#E4916C")
wandb.log({"Distribution of Target Feature": fig})
wandb.finish()

# **<span style="color:#e76f51;">Tensorflow Decision Forests</span>**

Source : https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html

![](https://drive.google.com/uc?id=1u8C0iutX50ajnYPdvnoTyCrtNNECvI_R)

Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you’re working with tabular data. They’re built from many decision trees, which makes them easy to use and understand - and you can take advantage of a plethora of interpretability tools and techniques that already exist today.

TF-DF brings this class of models along with a suite of tailored tools to TensorFlow users:

Beginners will find it easier to develop and explain decision forest models. There is no need to explicitly list or pre-process input features (as decision forests can naturally handle numeric and categorical attributes), specify an architecture (for example, by trying different combinations of layers like you would in a neural network), or worry about models diverging. Once your model is trained, you can plot it directly or analyse it with easy to interpret statistics.
Advanced users will benefit from models with very fast inference time (sub-microseconds per example in many cases). And, this library offers a great deal of composability for model experimentation and research. In particular, it is easy to combine neural networks and decision forests.
If you’re already using decision forests outside of TensorFlow, here’s a little of what TF-DF offers:

It provides a slew of state-of-the-art Decision Forest training and serving algorithms such as random forests, gradient-boosted trees, CART, (Lambda)MART, DART, Extra Trees, greedy global growth, oblique trees, one-side-sampling, categorical-set learning, random categorical learning, out-of-bag evaluation and feature importance, and structural feature importance.
This library can serve as a bridge to the rich TensorFlow ecosystem by making it easier for you to integrate tree-based models with various TensorFlow tools, libraries, and platforms such as TFX.
And for users new to neural networks, you can use decision forests as an easy way to get started with TensorFlow, and continue to explore neural networks from there.


In [None]:


# Convert the dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train, label="target")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test)

# Train the model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

# Look at the model.
#model.summary()

# Evaluate the model.
output = model.predict(test_ds)

# Export to a TensorFlow SavedModel.
# Note: the model is compatible with Yggdrasil Decision Forests.
# **<span style="color:#AB51E9;">References</span>**model.save("project/model")

# **<span style="color:#e76f51;">Visualize the Output</span>**

In [None]:
sns.histplot(pd.DataFrame(output),legend = False)

# **<span style="color:#e76f51;">References</span>**

https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html

https://www.kaggle.com/subinium/tps-oct-simple-eda

https://www.kaggle.com/craigmthomas/tps-oct-2021-eda



# Work in progress 🚧