# Assignment 3

In [9]:
# Initialize and import
#import otter
#grader = otter.Notebook()
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plots
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind

### Assignment instructions

* **How to install 'otter'**: Run `pip install otter-grader` in your Anaconda Command Prompt
* **Otter**: It is an autograder that we will be using for grading your notebooks.
* **grader.check('qn')**: This command runs test cases for nth question `qn` provided to you and displays the result. But these are not the only test cases, these are provided just for your basic testing. There will be some *hidden test cases* on which your answers will be tested as well.
* You are **not** allowed to edit any pre-defined variables and as per the instructions for every question you have to assign your answers to these variables.

### Submission instructions

* Rename your notebook as **YourName_DA3.ipynb**. (e.g. *`JohnDoe_DA3.ipynb`*)
* Download your notebook as a PDF and rename it to **YourName_DA3.pdf**
* Only submit your notebook and PDF in a zip file named **YourName_DA3.zip**

## Dataset
According to the World Health Organization (WHO), strokes are the 2nd leading cause of deaths globally, responsible for approximately 11% of the total deaths. As a researcher, it is important for you to be able to understand what factors impact the likelihood of getting a stroke.

It is believed that smoking and tobacco products in general are a significant contributor. However, a good data scientists uses all available resources at his/her disposal before making any such claims. That is what we will attempt to do in this assignment.

For the sake of simplicitly, we will use a limited-sized publicly-available dataset that provides information regarding 11 clinical features in around 5000 patients, along with whether or not they have experienced any strokes in the past. Each row represents a patient.

In [10]:
# Import dataset
df = pd.read_csv("C:\\Users\\LP\\Desktop\\DAs\\DA3\\strokes.csv")
print(df.head())


      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  


## Data summary

Each row in this dataset represents a patient with a specific smoking history. We are interested to see if smoking had any effect on whether or not they experienced a stroke. Several attributes of the individuals are stored in this dataset:

- **id**: the patient's unique ID for the study
- **gender**: whether the patient is a 'Male' or a 'Female'
- **age**: the patient's age at the time of the study
- **hypertension**: 1 if the patient is diagnosed with hypertension, and 0 otherwise
- **heart_disease**: 1 if the patient has a history of heart disease, and 0 otherwise
- **ever_married**: 'Yes' if the patiend has ever been married, and 'No' otherwise
- **work_type**: 'children', 'Govt_jov', 'Never_worked', 'Private' or 'Self-employed'
- **Residence_type**: 'Rural' or 'Urban'
- **avg_glucose_level**: average glucose level in the patient's blood
- **bmi**: the patient's body mass index (BMI)
- **smoking_status**: 'formerly smoked', 'never smoked', 'smokes', or 'Unknown' (this is our treatment variable)
- **stroke**: 1 is the patient has had a stroke, and 0 otherwise (this is our outcome variable)

## Data cleaning

**Question 1**: A patient's `id` is unimportant to us. Drop this column.

In [None]:
# Code here #
# dropping column 'id'
#converting all row values to lower case
df = df.applymap(lambda x: x.lower() if type(x) == str else x)

df = df.drop('id', axis=1)
print(df.shape)

(5109, 11)


In [12]:
grader.check("q1")

NameError: name 'grader' is not defined

**Question 2**: There are patients for whom we have no smoking history data (labeled 'Unknown' under `smoking_status`). Drop the rows corresponding to such patients and reset the dataframe index

In [None]:
# Code here #
# dropping rows with smoking_status 'Unknown'
df = df.drop(df[df['smoking_status'] == 'Unknown'].index)
#reset the index
df = df.reset_index(drop=True)
print(df.shape)

(3565, 11)


In [None]:
grader.check("q2")

**Question 3**: We want to be able to work with numeric data. So you will convert the data entries for some columns as described below:

- **gender**: 0 if 'Male' and 1 if 'Female'
- **ever_married**: 0 if 'No', and 1 if 'Yes'
- **work_type**: 0 if 'children', 1 if 'Govt_jov', 2 if 'Never_worked', 3 if 'Private', and 4 if 'Self-employed'
- **Residence_type**: 0 if 'Rural' and 1 if 'Urban'

For simplicity, we will group former and current smokers in the same category.
- **smoking_status**: 1 if 'formerly smoked' or 'smokes' (this will be our treatment group), and 0 if 'never smoked' (this will be our control group)

In [None]:
# Code here #
df.columns = df.columns.str.lower()
# converting gender to binary
df['gender'] = df['gender'].apply(lambda x: 0 if x=='Male' else 1)

#converting ever_married to binary
df['ever_married'] = df['ever_married'].apply(lambda x: 0 if x=='No' else 1)

# converting categories to lower case
#df['work_type'] = df['work_type'].apply(lambda x: x.lower())
#print(df['work_type'].unique())
"""
def work_categories(category):

  if category=='children':
    return 0
  elif category=='govt_job':
    return 1
  elif category=='never_worked':
    return 2
  elif category=='private':
    return 3
  elif category=='self-employed':
    return 4

df['work_type'] = df['work_type'].apply(work_categories)
"""
#converting residence_type to binary
df['residence_type'] = df['residence_type'].apply(lambda x: 0 if x=='Rural' else 1)

# convert smoking_status to binary
df['smoking_status'] = df['smoking_status'].apply(lambda x: 1 if (x=='formerly smoked' or x== 'smokes') else 0)
print(df.head())
#print(df.info())
#


   gender   age  hypertension  heart_disease  ever_married  work_type  \
0       1  67.0             0              1             1          3   
1       1  61.0             0              0             1          4   
2       1  80.0             0              1             1          3   
3       1  49.0             0              0             1          3   
4       1  79.0             1              0             1          4   

   residence_type  avg_glucose_level   bmi  smoking_status  stroke  
0               1             228.69  36.6               0       1  
1               1             202.21   NaN               0       1  
2               1             105.92  32.5               0       1  
3               1             171.23  34.4               0       1  
4               1             174.12  24.0               0       1  


In [None]:
grader.check("q3")

**Question 4**: Finally, drop all the rows with NULL values and reset the index

In [17]:
# Code here #
# Code here #
# dropping all rows with NULL values and resetting index
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
print(df.shape)

(4908, 11)


In [None]:
grader.check("q4")

## Exploratory Data Analysis

We are curious to see how different groups of patients react to the treatment (smoking status). The `smoking_status` variable is 1 if the individual is a smoker.

**Question 5:** 

a) Complete the function given below. Given a column name and dataframe, `treatment_plot` should plot the estimated average treatment effect for **all** groups of that column variable.  *For reference, refer to q5.png, which is what should be output if you input 'gender' (you can have minor differences like colors, but the axis should be the same).*

b) What do you observe in your treatment plot for the column `gender`? Write your answer as a comment.

In [None]:
#df['stroke'] = df['stroke'].astype(int)
#df.drop(columns=['stroke#'],inplace=True)
#print(df['gender'].value_counts())
#print(df['smoking_status'].value_counts())
#print(df['stroke'].value_counts())

#print(df.info())
group = df.groupby(['gender','smoking_status'])['stroke'].mean()
#group['ATE'] = group[1] - group[0]
#print(group)

#print(df.head())
#df['stroke'] = df['stroke'].astype(int)

def treatment_plot(data, col, x_labels=None):
  # ETA for all gender groups
 
  group = df.groupby(['gender','smoking_status'])['stroke'].mean().unstack()
  #
  group.columns = ['Non-Smoker', 'Smoker']
  
  # plot the data
  plt.figure(figsize=(10,6))
  group.plot(kind='bar')
  plt.xlabel('gender')
  plt.ylabel('stroke')

  plt.show()
  

In [None]:
treatment_plot(data= df, col='gender', x_labels=['Male', 'Female'])

#print(df['gender'].unique())
#print(df.info())

**Answer**:Smokers have a higher rate of stroke than non smokers for both male and females. Difference is much larger for males.

**Question 6:** 

a) Plot a correlation heatmap for this dataset (it should be a color coded graph indicating correlation values for each of the columns against every other column).

b) Comment on any notable correlations. For any of these pairs, answer the following questions: Are they causally related? If so, is their causal relationship direct or indirect? Name any confounding variables you suspect. If you do not think they are causally related, comment why.

Note: You will be graded on how critically you have commented, not how much you write. So keep your answers crisp and to the point, but also think deeply.

In [None]:
# Code here #
# Code here #
# selecting numeric columns only
numeric_df = df.select_dtypes(include=[np.number])
corr = numeric_df.corr()
plt.figure(figsize=(10, 8))

# plotting heat map
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

**Answer**: 

## The Effect of Smoking (Treatment)

**Question 7:** 

a) Find the overall estimated average treatment effect (under certain assumptions) of smoking. Store it in the variable `estimated_ATE`.
Note: Your test case may pass even if your value has the wrong sign.

b) Is your result positive or negative? What do the sign (+ve or -ve) and magnitude of your result tell you about the effect of smoking? Write your answer as a comment.

In [None]:
# Code here #
"""
count_0 = (df['stroke']==0).sum()
count_1 = (df['stroke']==1).sum()

print("No Stroke:", count_0)
print("Stroke:", count_1)
"""
estimated_ATE = df[df['smoking_status'] == 1]['stroke'].mean() - df[df['smoking_status'] == 0]['stroke'].mean()
#estimated_ATE = round(estimated_ATE,2)
print("Estimated ATE is : ",estimated_ATE)

**Answer**: **Answer**:
It means smokers have 1.57% more chances of stroke than non smokers(percentage points higher probability) 

c) Does this value reflect the actual accurate treatment effect (effect of smoking) in our population? If not, under what assumptions will this be an accurate representation of the actual ATE?

**Answer**: 

In [None]:
grader.check("q7")

**Question 8**: Does this estimated average treatment effect make sense to you or are we missing something? Explore the data further and look at the distribution of different groups of patients (i.e., patients having different values for different attributes) across the treatment and control groups. Comment on how this distribution **may** impact your observed ATE.

In [None]:
# write any code you need here

**Answer**: 

**Question 9a**: Now write code to find the estimated treatment effect separately within the different groups you explored above. What do your observations tell you? Do you think Simpson's Paradox can be seen manifesting in these observations? If you do observe Simpson's paradox, adjust for these covariates and report the conditional average treatment effect.

In [None]:
# Code here #

**Answer**: 

**Question 9b**: Now we want to explore the effect of conditioning on certain attributes on the overall estimated ATE. Condition on at least 3 attributes from the dataset *one by one* to report how the estimated ATE changes (this will require trial and error). For example, in the first step, condition on attribute *x*. In the next step, condition on some *x* and *y* together, and so on.

Report on your observations. How does the ATE change with every step? What does this tell you about the effect of these attributes on the probability of getting a stroke among smokers and non-smokers.

Hint: refer to lecture slides to see how you can adjust for covariates to find a conditional ATE.

In [None]:
# Code here #

**Answer**: 

**Question 10:** 

a) Calculate the p-value for the treatment and store it in `p_value`.

b) Comment on the statistical significance of your result. What does this p-value say about smoking and strokes? Clearly state your null and alternative hypotheses and the significance level you have chosen for your p-value. Should you reject the null hypothesis?

Note: You are allowed to use scipy for calculating the p-value.

In [None]:
# Code here #
p_value = 

**Answer**: 

In [None]:
grader.check("q10")

We now introduce a biased sample of our dataset.

In [None]:

from google.colab import drive
drive.mount('/content/drive',force_remount = True)
path = "/content/drive/MyDrive/ColabNotebooks/DA3/strokes_bias.csv"
bias_df = pd.read_csv(path)
#print(bias_df.head())
#print(bias_df.info())
print(bias_df.shape)
#print(bias_df.describe())
#print(bias_df.isnull().sum())

**Question 11:** 

a) Clean this data as you did for the previous dataset.

b) Plot estimated average treatment plots for all groups of each of the columns of `bias_df` (except the column `stroke` and any columns that have `float` type data). Hint: Use a loop and the function you made earlier.

In [None]:

# Code here # dropping id column
bias_df.drop(columns=['id'],inplace=True,  errors='ignore')
#print(bias_df.shape)

# converting all column names to lower
bias_df.columns = bias_df.columns.str.lower()

#converting all row values to lower case
bias_df = bias_df.apply(lambda x: x.astype(str).str.lower())
#print(bias_df.head())

# as my null values are 'nan' . its handling these as strings. so replacing back to NA
bias_df = bias_df.replace({'nan': pd.NA, 'na': pd.NA, 'none': pd.NA, '': pd.NA})  

#dropping unknown smoking status and resetting the index
bias_df.drop(bias_df[bias_df['smoking_status']=='unknown'].index, inplace=True)

#resetting index
bias_df.reset_index(drop=True,inplace=True)

# converting gender to binary
if not bias_pd.api.types.is_numeric_dtype(bias_df['gender']):
  bias_df['gender'] = bias_df['gender'].apply(lambda x: 0 if x=='male' else 1)

#converting ever_married to binary
if not bias_pd.api.types.is_numeric_dtype(bias_df['ever_married']):
   bias_df['ever_married'] = bias_df['ever_married'].apply(lambda x: 0 if x=='no' else 1)

def work_categories(category):

  if category=='children':
    return 0
  elif category=='govt_job':
    return 1
  elif category=='never_worked':
    return 2
  elif category=='private':
    return 3
  elif category=='self-employed':
    return 4

bias_df['work_type'] = bias_df['work_type'].apply(work_categories)

#converting residence_type to binary
if not bias_pd.api.types.is_numeric_dtype(bias_df['residence_type']):
  bias_df['residence_type'] = bias_df['residence_type'].apply(lambda x: 0 if x=='rural' else 1)

# convert smoking_status to binary
bias_df['smoking_status'] = bias_df['smoking_status'].apply(
    lambda x: 1 if (pd.notna(x) and x in ['formerly smoked','smokes']) else 0
).astype('Int64')

# dropping all rows with NULL values and resetting index
bias_df = bias_df.dropna().reset_index(drop=True)

print(bias_df.shape)
print(bias_df.head())      



In [None]:
grader.check("q11")

**Question 12:** Find the estimated average treatment effect for this dataset and store it in `bias_ATE`.

In [None]:
# Code here #

In [None]:
bias_ATE

In [None]:
grader.check("q12")

**Question 13**: What is the difference in the estimated treatment effect between the real and biased datasets? Store your answer in `bias_magnitude`

In [None]:
# Code here #
bias_magnitude = 

In [None]:
grader.check("q13")

In [None]:
grader.check_all()