### 1. Using 04_hotel_cancellation.csv, estimate the treatment effects if a ‘different room is assigned’ as the treatment indicator and interpret its effect on the room being ‘canceled’. Treat all the other columns as the covariates. 

In [1]:
#!pip install causalinference
#!pip install python-dotenv
#!conda install -c r r-glmnet

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math

from causalinference import CausalModel

In [3]:
# Overview of the dataset
hotel = pd.read_csv("04_hotel_cancellation.csv")
hotel.head()

Unnamed: 0.1,Unnamed: 0,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,days_in_waiting_list,different_room_assigned,is_canceled
0,3,13,2015,27,1,0,False,False
1,4,14,2015,27,1,0,False,False
2,5,14,2015,27,1,0,False,False
3,7,9,2015,27,1,0,False,False
4,8,85,2015,27,1,0,False,True


In [25]:
# Create the treatment indicator
d = hotel[['different_room_assigned']]
d = d.replace({True:1, False:0})

# Create the observed outcome
y = hotel[['is_canceled']]
y = y.replace({True:1, False:0})

# Create the covariates matrix
X = hotel.iloc[:,1:6]

In [6]:
# Create the causal model
causal = CausalModel(D = d.values, Y = y.values, X = X.values)
print(causal.summary_stats)


Summary Statistics

                     Controls (N_c=91673)       Treated (N_t=11221)             
       Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
              Y        0.432        0.495        0.051        0.220       -0.381

                     Controls (N_c=91673)       Treated (N_t=11221)             
       Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
             X0      116.468      109.178       73.114       85.262       -0.443
             X1     2016.182        0.703     2015.949        0.695       -0.334
             X2       27.249       13.100       28.076       14.651        0.060
             X3       15.804        8.804       15.642        8.714       -0.019
             X4        2.644       18.955        2.422       17.457       -0.012



In [7]:
# includes the treatment variable and the covariates as the model predictors
causal.est_via_ols(adj=1)
print('adj=1', causal.estimates)

adj=1 
Treatment Effect Estimates: OLS

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE     -0.337      0.003   -118.276      0.000     -0.343     -0.332



  olscoef = np.linalg.lstsq(Z, Y)[0]


Observations: 
1. The estimated average treatment effect is -0.337
2. The p-value of the estimation is extremely small and the 95% confidence interval doesn't include zero, we can conclude that the whether be assigned a different room or not affects the likelihood for a customer to cancel a room
3. The negative estimated average treatment effect indicates that when a different room is assigned to a customer, the chance of the customer to cancel the reservation is estimated to reduce, controlling other covariates.

In [37]:
# Use logit model to estimate the treatment effect
x = pd.concat([X,d], axis = 1)

model = sm.Logit(y,x)
result = model.fit()
print(result.params)

Optimization terminated successfully.
         Current function value: 0.598443
         Iterations 7
lead_time                    0.005202
arrival_date_year           -0.000375
arrival_date_week_number    -0.004168
arrival_date_day_of_month   -0.001143
days_in_waiting_list         0.001325
different_room_assigned     -2.518556
dtype: float64


Observations:<br>
Logit model also gives a negative treatment effect: -2.52. It validated the result we got from the causal model that assigning a different room to a customer leads to lower chance of cancel the book, controling other covariates.

In [44]:
'''
Notes for CausalModel:
# only includes the treatment variable as the model predictors
causal.est_via_ols(adj=0)
print('adj=0', causal.estimates)

# includes the treatment variable, the covariates, and the interactions between the treatment variable and the covariates as model predictors
causal.est_via_ols(adj=2)
print('adj=2', causal.estimates)
'''

"\nNotes for CausalModel:\n# only includes the treatment variable as the model predictors\ncausal.est_via_ols(adj=0)\nprint('adj=0', causal.estimates)\n\n# includes the treatment variable, the covariates, and the interactions between the treatment variable and the covariates as model predictors\ncausal.est_via_ols(adj=2)\nprint('adj=2', causal.estimates)\n"

### 2. For 04_hotel_cancellation.csv, use double lasso regression to measure the effect of ‘different room is assigned’ on the room being ‘canceled’

In [56]:
# Estimate d_hat
logit_1 = sm.Logit(d, X).fit()
d_hat = logit_1.predict(X)

# Combine d_hat and the rest variables
d_hat = pd.DataFrame(data = {"d_hat": d_hat[:]})
x_new = pd.concat([x,d_hat], axis = 1)

# The second stage logistic regression
logit_2 = sm.Logit(y, x_new).fit()
print(logit_2.params)

Optimization terminated successfully.
         Current function value: 0.334283
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.596827
         Iterations 7
lead_time                     0.000658
arrival_date_year             0.000395
arrival_date_week_number      0.005383
arrival_date_day_of_month    -0.004647
days_in_waiting_list          0.004968
different_room_assigned      -2.504256
d_hat                       -11.692611
dtype: float64


### 3. Use bootstrap to estimate the standard error of the treatment effects measured in (2).

In [82]:
# Define the number of resampling
n = 1000

# Initialize the vector to store the treatment effects
treatment_effects = np.zeros(1000)

# Bootstrap the original dataset and collect treatment effects
i = 0
while i < n:
    resample_index = np.random.choice(hotel.index, size = hotel.index.size, replace = True)
    resample = hotel.iloc[resample_index]
    X_resample = X.iloc[resample_index]
    x_resample = x.iloc[resample_index]
    y_resample = y.iloc[resample_index]
    
    model1 = sm.Logit(y_resample, X_resample).fit()
    d_hat = np.array(model1.predict(X_resample)).reshape(X_resample.shape[0],1)
    x_new = np.hstack((x_resample,d_hat))
    
    model2 = sm.Logit(y_resample, x_new).fit()
    
    treatment_effects[i] = model2.params[-2]
    
    i += 1

Optimization terminated successfully.
         Current function value: 0.628765
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.596911
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.629470
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.597290
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.630679
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.598874
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.629691
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.597741
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.629883
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.598253
  

In [83]:
# Calculate the standard errors of treatment effects
treatment_effects_se = treatment_effects.std(axis=0) / math.sqrt(n)
treatment_effects_se

0.001434772753168267

Bootstrapping estimate the population standard error in a robust way. The estimation is relatively accurate, even when the sample is not representive of the population very well.