# Daylight Saving Effect on Heart Attack

We will use the [healthcare dataset from Kaggle](https://www.kaggle.com/datasets/prasad22/healthcare-dataset).

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dowhy.causal_identifier import backdoor
import networkx as nx
from pgmpy.estimators import PC
from pgmpy.models import BayesianModel
import dowhy
from dowhy import CausalModel

from warnings import filterwarnings
filterwarnings('ignore')

# Motivation and Data Processing - 10% of the grade

Motivation, description of dataset and causal questions, description of assumptions, show true causal graph or a reasonable guess (10% grade)

In [None]:
file_path = '/home/mara/workspace/Causality_Eurostat/data/healthcare/healthcare_dataset.csv'
df = pd.read_csv(file_path)

We only consider the first visit of patients with hypertension.

In [None]:
filtered_df = df[df['Medical Condition'] == 'Hypertension']
filtered_df = filtered_df[~filtered_df['Name'].duplicated(keep='first')]

Let's see how many unique values each variable has.

In [None]:
for col_name in filtered_df.columns:
    print(f'{col_name}:{filtered_df[col_name].nunique()}')

We decide that the following variables are irrelevant for our analysis or have arbitrary data: ```['Doctor', 'Hospital', 'Room Number', 'Discharge Date']```. We can also drop ```'Name'``` since we already filtered by the first visit of unique individuals.

In [None]:
filtered_df = filtered_df.drop(columns=['Doctor', 'Hospital', 'Room Number', 'Discharge Date', 'Name'])

Let's see the unique values of the columns with less than 10 unique values. We print the column names contianing continuous data.

In [None]:
categoricals = []

for col_name in filtered_df.columns:
    if filtered_df[col_name].nunique() <= 10:
        print(f'{col_name}:{filtered_df[col_name].unique()}')
        categoricals.append(col_name)
    else:
        print(f'{col_name} is a continuous variable.')

We aim to introduce a binary column: assigning a value of 1 if the admission date falls within a 3-month window surrounding the annual daylight saving time change in March (a period during which individuals typically adjust their schedules due to a one-hour reduction in sleep), and 0 otherwise. Our dataset spans the years 2018-2023, and we construct a dictionary pairing each year with the specific date of the daylight saving time transition.

In [None]:
daylight_saving_dates = {
    2018: '2018-03-25',
    2019: '2019-03-31',
    2020: '2020-03-29',
    2021: '2021-03-28',
    2022: '2022-03-27',
    2023: '2023-03-26'}

In [None]:
daylight_saving_dates = {year: pd.to_datetime(date) for year, date in daylight_saving_dates.items()}

filtered_df['daylight_saving_march'] = filtered_df['Date of Admission'].apply(lambda date: 1 
                                                                        if daylight_saving_dates[pd.to_datetime(date).year] <= pd.to_datetime(date) <= daylight_saving_dates[pd.to_datetime(date).year] + pd.DateOffset(months=3)
                                                                        else 0)

filtered_df['daylight_saving_before_march'] = filtered_df['Date of Admission'].apply(lambda date: 1 
                                                                        if daylight_saving_dates[pd.to_datetime(date).year] >= pd.to_datetime(date) >= daylight_saving_dates[pd.to_datetime(date).year] - pd.DateOffset(months=3)
                                                                        else 0)

Let's see how balanced is our data.

In [None]:
for col_name in filtered_df.columns:
    if filtered_df[col_name].nunique() <= 10:
        print(filtered_df[col_name].value_counts())
        print()

We notice that 25% of Hypertension cases happen within 3 months after the daylight saving in March, whil 22% of the hypertension cases happen before the daylight saving time. Of course this is not enough, we shall see if the daylight saving actually affects the changes of getting hypertension. We come with our own hypothesis before applying any causal discovery algorithm.

In [None]:
for col_name in filtered_df.columns:
    print(col_name)

We can already form some hypothesis. We can motivate to follow this work in this case.

In [None]:
graph_variables = ['Age', 'Gender', 'Medical Condition',
             'Insurance Provider', 'Billing Amount', 'Admission Type', 
             'Medication', 'Test Results', 'daylight_saving_march']

G = nx.DiGraph()
G.add_nodes_from(graph_variables)

edges = [
    ('daylight_saving_march', 'Medical Condition'),
    ('Age', 'Medical Condition'),
    ('Age', 'daylight_saving_march'),
    ('Gender', 'Medical Condition'),
    ('Admission Type', 'daylight_saving_march'),
    ('Medical Condition', 'Medication'),
    ('Age', 'Insurance Provider'),
    ('Admission Type', 'Test Results'),
    ('Billing Amount', 'Medical Condition'),
    ('Age', 'Billing Amount'),
    ('Insurance Provider', 'Billing Amount'),
    ('Test Results', 'Medication')
]

G.add_edges_from(edges)

pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, font_size=8, node_size=700, node_color='skyblue', font_color='black', font_weight='bold', arrowsize=10)

plt.show()

# Exploratory Data Analysis - 15% of the grade

Testing correlations/conditional independences (15% grade, follow Tutorial 1&2)

Find all paths from between two nodes: ```daylight_saving_march``` and ```Medical Condition```.

In [None]:
all_paths = list(nx.all_simple_paths(G.to_undirected(), source='daylight_saving_march', target='Medical Condition'))

fig, axs = plt.subplots(nrows=1, ncols=len(all_paths), figsize=(50, 10))
for path, ax in zip(all_paths, axs):
    edges_on_path = [(source, target) for source, target in zip(path[:-1], path[1:])]
    edge_color = ['r' if edge in edges_on_path else 'black' for edge in G.to_undirected().edges()]
    nx.draw_shell(G, with_labels=True, node_color=['w', 'w', 'r', 'w', 'w', 'w', 'w', 'w', 'r'], edgecolors='black', ax=ax, edge_color=edge_color)
    ax.set_title(path)

In [None]:
from itertools import combinations

X = ['daylight_saving_march']
Y = ['Medical Condition']

Z = [variable for variable in graph_variables if variable not in X + Y]

all_comb = []
for r in range(1, len(Z) + 1):
    all_comb.extend(combinations(Z, r))

for combination in all_comb:
    print('Node daylight_saving_march and Medical Condition are d-separated by {}: {}'.format(path, nx.algorithms.d_separated(G=G, x=set(['daylight_saving_march']), y=set(['Medical Condition']), z=set(combination))))

In [None]:
from itertools import permutations
import pingouin as pg

def test_all(df, vars):
    # Marginal
    for var1, var2 in permutations(vars, 2):
        p_val = pg.partial_corr(data=df, x=var1, y=var2, covar=[], method='pearson')['p-val'].item()
        print('{} and {}: p-value is {}'.format(var1, var2, p_val))

    # Conditional
    for var1, var2, cond in permutations(vars, 3):
        p_val = pg.partial_corr(data=df, x=var1, y=var2, covar=[cond], method='pearson')['p-val'].item()
        print('{} and {} given {}: p-value is {}'.format(var1, var2, cond, p_val))

In [None]:
def encode_categoricals(df):
    from sklearn.preprocessing import LabelEncoder

    label_encoder = LabelEncoder()

    for column in df.columns:
        if df[column].dtype == 'object':
            df[column] = label_encoder.fit_transform(df[column])
            
    return df


In [None]:
df_encoded = encode_categoricals(filtered_df)
df_encoded

In [None]:
test_all(df_encoded, graph_variables)

# Identify estimands for backdoor, frontdoor criterion and IVs - 20% of the grade

If they apply, or explain why they don't apply (20% grade, follow Tutorial 3 and 4)

# Estimate the causal effects - 15% of the grade

(e.g. linear, inverse propensity weighting, two stage linear-regression etc) to the estimands you have previously identified (15% grade, follow Tutorial 4)

# Causal discovery results - 20% of the grade

for at least one constraint-based (e.g. SGS, PC) and score-based algorithm (e.g. GES), explain why it works or it doesn't and what is identifiable (20% grade, follow Tutorials 5 and 6)

# Validation and sensitivity analysis - 20% of the grade

(e.g. refutation analysis in DoWhy) and Discussion on the assumptions and results (20% grade)