![](https://drive.google.com/uc?id=1fI0ySFEBL9eUfAJjsta27PEF1neZ6DqZ)


Fluctuations are common in the financial market irrespective of the investment strategy which has been incoporated . Risks and returns differ based on investment types and other factors affect the stability and volatility of the market . Investment professionals estimate the overall returns taking in to account the market fluctuations . AI based algorithms are predominantly being used in financial market trading and data science has huge potential to help improve quantitative researchers ability to forecast an investment's return


# **<span style="color:#F7B2B0;">Goal</span>**
 
The goal of this competition is to build a model that forecasts an investment's return rate.

# **<span style="color:#F7B2B0;">Data</span>**

The WiDS Datathon 2022 focuses on a prediction task involving roughly 100k observations of building energy usage records collected over 7 years and a number of states within the United States. The dataset consists of building characteristics (e.g. floor area, facility type etc), weather data for the location of the building (e.g. annual average temperature, annual total precipitation etc) as well as the energy usage for the building and the given year, measured as Site Energy Usage Intensity (Site EUI). Each row in the data corresponds to the a single building observed in a given year. Your task is to predict the Site EUI for each row, given the characteristics of the building and the weather data for the location of the building.


**Files**

`train.csv`

`row_id -` A unique identifier for the row.

`time_id -` The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.

`investment_id`- The ID code for an investment. Not all investment have data in all time IDs.

`target -` The target.

`[f_0:f_299] -` Anonymized features generated from market data.

`example_test.csv` - Random data provided to demonstrate what shape and format of data the API will deliver to your notebook when you submit.

`example_sample_submission.csv` - An example submission file provided so the publicly accessible copy of the API provides the correct data shape and format.

`ubiquant/` - The image delivery API that will serve the test set. You may need Python 3.7 and a Linux environment to run the example test set through the API offline without errors.

`Time-series API Details - The API serves the data in batches, with all of rows for a single time time_id per batch.`


# **<span style="color:#F7B2B0;">Evaluation Metric</span>**

The evaluation metric for this competition is `Pearson correlation coefficient` 



<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

> I will be integrating W&B for visualizations and logging artifacts!
> 
> [ProbabilisticBNN](https://wandb.ai/usharengaraju/ProbabilisticBNN)
> 
> - To get the API key, create an account in the [website](https://wandb.ai/site) .
> - Use secrets to use API Keys more securely 

In [None]:
import numpy as np
import pandas as pd

import math
import numpy as np
import pandas as pd
import warnings
import wandb

from wandb.keras import WandbCallback


import matplotlib.pyplot as plt
import seaborn as sns

import random

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_probability as tfp

from tensorflow.keras.layers import StringLookup

#ignore warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("api_key")
    wandb.login(key=secret_value_0)
    anony=None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')
    
CONFIG = dict(competition = 'ProbabilisticBNN',_wandb_kernel = 'tensorgirl')

In [None]:
train = pd.read_pickle('../input/ump-train-picklefile/train.pkl')
#train = train.sample(10000)

# **<span style="color:#F7B2B0;">Exploratory Data Analysis</span>**

In [None]:
time_id_count=train.groupby("investment_id")['time_id'].count()
fig, ax = plt.subplots(figsize=(25,9))
sns.distplot(time_id_count, kde= True,hist=True, color = "#2a9d8f")
plt.title('Count Distribution of time_id\'s per Investment')
plt.show()

# **<span style="color:#F7B2B0;">Target Distribution</span>**


In [None]:
plt.figure(figsize=(25, 9))
plt.subplot(121)
sns.distplot(train.loc[:, 'target'], kde= True,hist = True ,color = "#2a9d8f")    
plt.subplot(122)
sns.boxplot(y="target", data=train, color="#ff355d")

# **<span style="color:#F7B2B0;">Feature Distribution</span>**

Lets analyze the distribution of first nine features starting from 'f_0' to 'f_8'

In [None]:
features_plt =  ['f_0','f_1','f_2','f_3','f_4','f_5','f_6','f_7','f_8']
fig, ax = plt.subplots(3,3, figsize=(18, 18))
for i, feature in enumerate(features_plt):
    sns.distplot(train[feature], color = "#2a9d8f", ax=ax[math.floor(i/3),i%3]).set_title(f'{feature} Distribution')
fig.show()

In [None]:
plt.figure(figsize=(25, 9))
sns.distplot(train['investment_id'], bins=100,color = "#2a9d8f", label='investment_id')


In [None]:
plt.figure(figsize=(25, 9))
sns.distplot(train['time_id'], bins=100,color = "#2a9d8f", label='time_id')


# **<span style="color:#F7B2B0;">Correlation Heatmap</span>**

Lets analyze the correlation of first nine features starting from 'f_0' to 'f_8'

In [None]:
plt.figure(figsize=(25, 9))
sns.heatmap(train[[f'{feature}' for feature in features_plt]].corr(),annot=True ,cmap=sns.color_palette("BrBG",2));

# **<span style="color:#F7B2B0;">Correlation of Target and Features</span>**

Lets analyze the correlation of first nine features starting from 'f_0' to 'f_8' with the target

In [None]:
corr = []
for feature in features_plt:
    corr.append( train['target'].corr(train[f'{feature}']) )
    
plt.figure(figsize=(25,9))
plt.plot(corr, 'k')
plt.xlabel('Features')
plt.ylabel('Target')
plt.title('Correlation between target and features')
plt.show()

# **<span style="color:#F7B2B0;">Preprocessing</span>**

In [None]:
all_features = ['investment_id','f_0','f_1','f_2','target']
TARGET_FEATURE_NAME ="target"
features =  ['investment_id','f_0','f_1','f_2']
train = train[all_features]

In [None]:
# random sampling to create train and validation data
random_selection = np.random.rand(len(train.index)) <= 0.85
train_data = train[random_selection]
valid_data = train[~random_selection]

#converting training and validation data to csv file
train_data_file = "train_data.csv"
valid_data_file = "valid_data.csv"
train_data.to_csv(train_data_file, index=False, header=False)
valid_data.to_csv(valid_data_file, index=False, header=False)

# **<span style="color:#e76f51;">W & B Artifacts</span>**

An artifact as a versioned folder of data.Entire datasets can be directly stored as artifacts .

W&B Artifacts are used for dataset versioning, model versioning . They are also used for tracking dependencies and results across machine learning pipelines.Artifact references can be used to point to data in other systems like S3, GCP, or your own system.

You can learn more about W&B artifacts [here](https://docs.wandb.ai/guides/artifacts)

![](https://drive.google.com/uc?id=1JYSaIMXuEVBheP15xxuaex-32yzxgglV)

In [None]:
# Save train data to W&B Artifacts
train.to_csv("train_wandb.csv", index = False)
run = wandb.init(project='ProbabilisticBNN', name='training_data', anonymous=anony,config=CONFIG) 
artifact = wandb.Artifact(name='training_data',type='dataset')
artifact.add_file("./train_wandb.csv")
wandb.log_artifact(artifact)
wandb.finish()

# **<span style="color:#e76f51;">🎯tf.data</span>**

[Source](https://www.tensorflow.org/guide/data)

tf.data API is used for building efficient input pipelines which can handle large amounts of data and perform complex data transformations . tf.data API has provisions for handling different data formats .

<img src="https://storage.googleapis.com/jalammar-ml/tf.data/images/tf.data.png" />

[Image Source](https://www.kaggle.com/jalammar/intro-to-data-input-pipelines-with-tf-data)

Data source is essential for building any input pipeline and tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices can be used to construct a dataset from data in memory .The recommended format for the iput data stored in file is TFRecord which can be created using TFRecordDataset() .The different data source formats supported are numpy arrays , python generators , csv files ,image , TFRecords , csv and text files. 

<img src="https://storage.googleapis.com/jalammar-ml/tf.data/images/tf.data-read-data.png" />

[Image Source](https://www.kaggle.com/jalammar/intro-to-data-input-pipelines-with-tf-data)

Construction of tf.data input pipeline consists of three phases namely Extract , Transform and Load . The extraction involves the loading of data from different file format and converting it in to tf.data.Dataset object .

## **<span style="color:#e76f51;">🎯tf.data.Dataset</span>**

tf.data.Dataset is an abstraction introduced by tf.data API and consists of sequence of elements where each element has one or more components . For example , in a tabular data pipeline , an element might be a single training example , with a pair of tensor components representing the input features and its label 

tf.data.Dataset can be created using two distinct ways

Constructing a dataset using data stored in memory by a data source

Constructing a dataset from one or more tf.data.Dataset objects by a data transformation

<img src="https://storage.googleapis.com/jalammar-ml/tf.data/images/tf.data-simple-pipeline.png" />

[Image Source](https://www.kaggle.com/jalammar/intro-to-data-input-pipelines-with-tf-data)



In [None]:
def get_dataset_from_csv(csv_file_path, shuffle=False, batch_size=128):

    dataset = tf.data.experimental.make_csv_dataset(
        csv_file_path,
        batch_size=batch_size,
        column_names=all_features,
        label_name=TARGET_FEATURE_NAME,
        num_epochs=1,
        header=False,
        shuffle=shuffle,
    )

    return dataset

In [None]:
def create_model_inputs():
    inputs = {}
    for feature_name in features:
        inputs[feature_name] = layers.Input(
            name=feature_name, shape=(1,), dtype=tf.float32
        )
    return inputs

# **<span style="color:#e76f51;">TensorFlow Probability</span>**

[Source](https://www.tensorflow.org/probability/examples/A_Tour_of_TensorFlow_Probability)

`TensorFlow Probability` is a library for probabilistic reasoning and statistical analysis in TensorFlow and it supports modeling, inference, and criticism through composition of low-level modular components.

`Low-level building blocks`

Distributions

Bijectors

`High(er)-level constructs`

Markov chain Monte Carlo

Probabilistic Layers

Structural Time Series

Generalized Linear Models

Optimizers

In this tutorial , we will be using distributions and probabilistic layers .

# **<span style="color:#e76f51;">tfp.distributions.Distribution</span>**

A `tfp.distributions.Distribution` is a class with two core methods: sample and log_prob. The distributions which will be used in this tutorial are `tfp.distributions.MultivariateNormalDiag` .The event shape and the batch shape are properties of a Distribution object .

`Event shape` describes the shape of a single draw from the distribution; it may be dependent across dimensions. 

`Batch shape` describes independent, not identically distributed draws, aka a "batch" of distributions.

`tfp.distributions.MultivariateNormalDiag` is used to  create a multivariate normal with a diagonal covariance . Multivariate distributions has an event shape of 2 . 

# **<span style="color:#e76f51;">tfp.layers</span>**

The layers which will be used in this tutorial are `tfp.layers.VariableLayer` , `tfp.layers.DenseVariational` , `tfp.layers.IndependentNormal`, `tfp.layers.DistributionLambda` and `tfp.layers.MultivariateNormalTriL` .



`tfp.layers.DenseVariational` : Dense layer with random kernel and bias.

`tfp.layers.DistributionLambda` : Keras layer enabling plumbing TFP distributions through Keras models.

`tfp.layers.IndependentNormal` : An independent normal Keras layer.

`tfp.layers.MultivariateNormalTriL` : A d-variate MVNTriL Keras layer from d + d * (d + 1) // 2 params.

`tfp.layers.VariableLayer`: Simply returns a (trainable) variable, regardless of input.



In [None]:
def run_experiment(model, loss, train_dataset, test_dataset):
    
    # Step1: Initialize W&B run
    wandb.init(project='ProbabilisticBNN')

    # 2. Save model inputs and hyperparameters
    config = wandb.config
    config.learning_rate = 0.01

    model.compile(
        optimizer=keras.optimizers.RMSprop(learning_rate=learning_rate),
        loss=loss,
        metrics=[keras.metrics.RootMeanSquaredError()]
    )

    print("Start training the model...")
    model.fit(train_dataset, epochs=num_epochs, validation_data=test_dataset,callbacks=[WandbCallback()])
    print("Model training finished.")
    _, rmse = model.evaluate(train_dataset, verbose=0)
    print(f"Train RMSE: {round(rmse, 3)}")

    print("Evaluating model performance...")
    _, rmse = model.evaluate(test_dataset, verbose=0)
    print(f"Test RMSE: {round(rmse, 3)}")

In [None]:
def create_probablistic_bnn_model(train_size):
    inputs = create_model_inputs()
    features = keras.layers.concatenate(list(inputs.values()))
    features = layers.BatchNormalization()(features)

    # Create hidden layers with weight uncertainty using the DenseVariational layer.
    for units in hidden_units:
        features = tfp.layers.DenseVariational(
            units=units,
            make_prior_fn=prior,
            make_posterior_fn=posterior,
            kl_weight=1 / train_size,
            activation="sigmoid",
        )(features)

    # Create a probabilisticå output (Normal distribution), and use the `Dense` layer
    # to produce the parameters of the distribution.
    # We set units=2 to learn both the mean and the variance of the Normal distribution.
    distribution_params = layers.Dense(units=2)(features)
    outputs = tfp.layers.IndependentNormal(1)(distribution_params)

    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
learning_rate = 0.001
dropout_rate = 0.15
batch_size = 256
num_epochs = 2
encoding_size = 16
train_size = 8500
hidden_units = [8, 8]

train_dataset = get_dataset_from_csv(
    train_data_file, shuffle=True, batch_size=batch_size
)
valid_dataset = get_dataset_from_csv(valid_data_file, batch_size=batch_size)

In [None]:
# Define the prior weight distribution as Normal of mean=0 and stddev=1.
# Note that, in this example, the we prior distribution is not trainable,
# as we fix its parameters.
def prior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    prior_model = keras.Sequential(
        [
            tfp.layers.DistributionLambda(
                lambda t: tfp.distributions.MultivariateNormalDiag(
                    loc=tf.zeros(n), scale_diag=tf.ones(n)
                )
            )
        ]
    )
    return prior_model


# Define variational posterior weight distribution as multivariate Gaussian.
# Note that the learnable parameters for this distribution are the means,
# variances, and covariances.
def posterior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    posterior_model = keras.Sequential(
        [
            tfp.layers.VariableLayer(
                tfp.layers.MultivariateNormalTriL.params_size(n), dtype=dtype
            ),
            tfp.layers.MultivariateNormalTriL(n),
        ]
    )
    return posterior_model

In [None]:
def negative_loglikelihood(targets, estimated_distribution):
    return -estimated_distribution.log_prob(targets)


#num_epochs = 1000
prob_bnn_model = create_probablistic_bnn_model(train_size)
run_experiment(prob_bnn_model, negative_loglikelihood, train_dataset, valid_dataset)

### Acknowledgements : 

Google supported this work by providing Google Cloud credit

## References

https://www.tensorflow.org/probability/overview

https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/keras_recipes/ipynb/bayesian_neural_networks.ipynb

https://www.kaggle.com/edwardcrookenden/eda-and-lgbm-baseline-feature-imp

https://www.kaggle.com/datafan07/ubiquant-market-prediction-what-do-we-have-here-

https://www.kaggle.com/sytuannguyen/ubiquant-market-prediction-eda

https://www.kaggle.com/columbia2131/speed-up-reading-csv-to-pickle




# Work in progress 🚧