# Introduction

### Problem Description

Data-driven approaches are now used in many fields from business to science. Since data storage and computational power has become cheap, machine learning has gained popularity. However, the majority of tools that can extract dependencies from data, are designed for prediction problem. In this notebook a problem of decision support simulation is considered and it is shown that even good predictive models can lead to wrong conclusions under some condition, namely endogeneity. Being more general, accuracy of predictions does not guarantee causal relationships detection.

Suppose that situation is as follows. There is a manager that can assign treatment to items in order to increase target metric. Treatment is binary, i.e. it is assigned or it is absent. Because treatment costs something, its assignment should be optimized. Manager has a historical dataset of items performance, but he does not know that previously treatment was assigned predominantely based on values of just one parameter. Moreover, this parameter is not included in the dataset. To make the situation more weird, an extra assumption can be introduced - now it is impossible to measure values of the omitted parameter not only for old items, but also for a new ones too. By the way, manager wants to create a system that predicts an item's target metric in case of treatment and in case of absence of treatment. If this system is deployed, manager can compare these two cases and decide whether effect of treatment worths its costs.

If machine learning approach results in good prediction scores, chances are that users do not suspect that important variable is omitted (at least until some expenses are generated by wrong decisions). Hence, domain knowledge and data understanding are still required for modelling based on data. This is of particular importance when datasets contain values that are produced by someone's decisions, because there is no guarantee that future decisions will not change dramatically. On the flip side, if all factors that affect decisions are included into a dataset, i.e. there is selection on observables for treatment assignment, a model that is powerful enough is able to estimate treatment effect correctly.

Probably, sections of the notebook that illustrate ways to overcome lack of important unobservable variables, will be released after some time.

### References

To read more about causality in data analysis, it is possible to look at these papers:

1. *Angrist J, Pischke J-S. Mostly Harmless Econometrics. Princeton University Press, 2009.*

2. *Varian H. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives, 28(2): 3–28, 2013*

### Software Requirements

This notebook does not use any packages beyond a list of those that are quite popular in scientific computing. Use `conda` or `pip` to install any of them.

# Preparations

### General

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LinearRegression

# Startup settings can not suppress a warning from XGBRegressor and so this is needed.
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    from xgboost import XGBRegressor

In [2]:
np.random.seed(seed=361)

### Synthetic Dataset Generation

# Perfect Model and Poor Simulation