# Importing packages

In [2]:
# Reading and saving data
import pandas as pd 

# Computations
import numpy as np 

# Deep learning
import tensorflow as tf 

# Keras API 
from tensorflow import keras

# Reading the data 

The data is split into two parts in a single file: dataset1 and dataset2.
    
Which data is which can be identified from the column **DatasetID**.

In [7]:
d = pd.read_csv('data/multitasklearnig_task.csv')

In [8]:
print(f'Shape of the dataset: {d.shape}')

Shape of the dataset: (1000030, 10)


In [9]:
print(d.head())

   DatasetID    x1    x2    x3    x4    x5    x6     z   y1  y2
0          1 -1.84 -4.66  2.13 -1.32 -5.86 -4.69  2.89  0.0 NaN
1          1  1.31  0.16  2.94 -0.88  0.15 -3.69  1.25  1.0 NaN
2          1  0.60  4.23 -0.10  0.52  3.04 -0.23 -3.00  1.0 NaN
3          1  1.94  1.68  0.15 -3.16  0.12 -3.80  8.14  1.0 NaN
4          1  2.95  0.25  0.24 -0.47  3.10  0.69 -0.53  0.0 NaN


In [10]:
print(d.tail())

         DatasetID    x1    x2    x3    x4    x5    x6   z  y1         y2
1000025          2 -3.28 -1.20  0.50 -0.43 -2.13 -1.62 NaN NaN -19.905467
1000026          2  2.41  5.34 -3.96 -0.62  3.90 -4.47 NaN NaN   3.440164
1000027          2  2.74  5.80 -4.03 -0.82  6.98  4.53 NaN NaN  17.879392
1000028          2 -0.79 -4.31  3.73  0.83  0.23 -1.59 NaN NaN   0.764875
1000029          2  5.15  2.43 -1.69  1.46  3.36 -0.63 NaN NaN  19.181821


The number of observations for each dataset:

In [11]:
print(d.groupby('DatasetID').size())

DatasetID
1         30
2    1000000
dtype: int64


The first data set contains 30 rows while the second one contains 1 million rows.

# Spliting the dataset into two subsets

The first data set target variable is binary - 1 or 0. Thus the objective here is to model a probability given the $X$ matrix: 

$$ p(Y | X, Z) $$

The second one's target variable is continues so the problem is a regression problem: 

$$ E[Y | X] $$

In [12]:
# Subseting
d1 = d[d['DatasetID']==1]
d2 = d[d['DatasetID']==2]

# Reseting the indexes
d1.reset_index(inplace=True, drop=True)
d2.reset_index(inplace=True, drop=True)

Creating the matrices $X_{1}$, $Y_{1}$ for the logistic problem and $X_{2}$, $Y_{2}$ for the regression problem:

In [14]:
# List of features
features_x = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6']
features_z = ['z']

# 'Probabalistic' dataset
X1, Z1, Y1 = d1[features], d1[features_z], d1['y1']

# 'Continues' dataset
X2, Y2 = d2[features], d2['y2']

# Assumptions

Both the datasets have dependencies from one another in their data generating process.  

$$ p(Y_{1}| X_{1}, Z) = \dfrac{1}{1 - e^{-(\beta_{1}X_{1} + \beta_{2}Z})} \in (0, 1) $$ 

$$ E[Y_{2}|X_{2}] = \beta_{3} X_{2} $$

$$ \beta_{1} = \alpha \beta_{3} $$ 