## Exploratory Data Analysis of Stroke Data
By Mahfuz Miah, September 5, 2019

## 1. Synopsis
For this project, I performed some exploratory analysis on stroke data. This dataset was found on Kaggle, uploaded by user SaumyaAgarwal and can be retrieved at this link: [Stroke Data](https://www.kaggle.com/asaumya/healthcare-dataset-stroke-data). The dataset we have executed our code on was downloaded on Sept 5th, 2019.

The goal of our project was to investigate the relationship between incidence of stroke and other categorical knowledge about a patient. In this dataset, we have data about gender, age, hypertension, heart disease, marriage status, work type, residence type, glucose level BMI, smoking status and incidence of stroke. We are curious to see how these different measures relate to stroke incidence.

## 2. Background
Here are the definitions of the columns of the data:

- id-Patient ID
- gender-Gender of Patient
- age-Age of Patient
- hypertension-0 - no hypertension, 1 - suffering from hypertension
- heart_disease-0 - no heart disease, 1 - suffering from heart disease
- ever_married-Yes/No
- work_type-Type of occupation
- Residence_type-Area type of residence (Urban/ Rural)
- avg_glucose_level-Average Glucose level (measured after meal)
- bmi-Body mass index
- smoking_status-patient’s smoking status
- stroke-0 - no stroke, 1 - suffered stroke

## 3. Understanding and Preparing the Data Set

In [2]:
# import all necessary libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import statistics
import seaborn as sns


In [3]:
# Set up the data

filename = 'train_2v.csv'
def setdata(filename):
    df = pd.read_csv(filename)
    # Strip and make column names lowercase so that they are easy to manage. 
    df.columns = df.columns.str.strip().str.lower()

    # Replace characters or remove spaces and make all words lowercase.
    df.columns = df.columns.str.replace(' ', '_').str.replace('/', '_').str.replace('(', '').str.replace(')', '')

    return df

df = setdata(filename)

In [4]:
# What's inside the file? Let's take a preview.
def whats_inside(df):
    print("Column values in dataframe: ", list(df.columns.values)) 
    print(df.describe())
    
whats_inside(df)

Column values in dataframe:  ['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke']
                 id           age  hypertension  heart_disease  \
count  43400.000000  43400.000000  43400.000000   43400.000000   
mean   36326.142350     42.217894      0.093571       0.047512   
std    21072.134879     22.519649      0.291235       0.212733   
min        1.000000      0.080000      0.000000       0.000000   
25%    18038.500000     24.000000      0.000000       0.000000   
50%    36351.500000     44.000000      0.000000       0.000000   
75%    54514.250000     60.000000      0.000000       0.000000   
max    72943.000000     82.000000      1.000000       1.000000   

       avg_glucose_level           bmi        stroke  
count       43400.000000  41938.000000  43400.000000  
mean          104.482750     28.605038      0.018041  
std            43.111751      7.770020      0.13310

In [5]:
#List of unique values in the df['name'] column
def uniqueval(df):
    column_names = list(df.columns.values)
    for column_name in column_names:
        print("There are " + str(df[column_name].nunique()) + " unique values in column \'" + column_name + "\'.")
uniqueval(df)

There are 43400 unique values in column 'id'.
There are 3 unique values in column 'gender'.
There are 104 unique values in column 'age'.
There are 2 unique values in column 'hypertension'.
There are 2 unique values in column 'heart_disease'.
There are 2 unique values in column 'ever_married'.
There are 5 unique values in column 'work_type'.
There are 2 unique values in column 'residence_type'.
There are 12543 unique values in column 'avg_glucose_level'.
There are 555 unique values in column 'bmi'.
There are 3 unique values in column 'smoking_status'.
There are 2 unique values in column 'stroke'.


Our dataframe has 43400 entries and many of the categorical columns have binary information (Yes/No responses in the form of 1 and 0, respectively).

# 4. Feature Engineering

# 5. Exploratory Data Analysis

#### Pros and Cons of the Dataset:
Pros: 
- Data is relatively clean and easy to understand
- Has a variety of categories one can consider in investigating stroke
- Analysis of dataset replicates views present in the literature regarding characteristics that correlate with stroke incidence
Cons: 
- We don't know where the dataset originates from (is it from a particular state or country, when was this dataset collected, etc).
- We don't know if there was any bias in collecting this data - this ties in with not knowing the origin of this dataset
- We don't have details on # of incidences of stroke for patients; this removes resolution on whether some charactertics are predictive of repeat strokes. 
- Ethnicity and socioeconomic situation are not listed; these may be contributing (or in this dataframe's case, confounding) variables to consider.


## 6. Future Research Plan
- Analysis that highlights experimental hypothesis.
- A rollout plan showing how to implement and rollout the experiment
- An evaluation plan showing what constitutes success in this experiment

### 6a. Analysis and Hypothesis
We note that increased *blah - fill later* leads to increases in stroke incidence. We hypothesize that giving patients low dose metformin on a daily basis can reduce *blah - fill later* and therefore reduce incidence of stroke.

See link for further background: https://www.ncbi.nlm.nih.gov/pubmed/24119365

### 6b. Rollout Plan
50 patients ages 40-50 that have (x range) of (y measure) will be divided into two groups: 25 participants in the control group and 25 participants in the experimental group (aka A and B groups, in line with A/B testing).

Patients in the control group will be given a placebo pill to 



### 6b. Evaluation Plan
