## DS 3001 Project 2 - Divya Kuruvilla and Glory Gurrola
This project is about building predictive algorithms that predict the likelihood a person has a stroke.
The strategy to determine the best predictive model was to start by preparing and cleaning the variables in the data set, and then building different models. The models we build were linear models, k means clustering models, and decision trees; the model with the lowest RMSE on the testing data will be the final model.  

### Reading in the Data

In [5]:
import numpy as np
import pandas as pd 

# Reading in the Training and Testing Data 
test_df = pd.read_csv('./data/testing_data.csv')
train_df = pd.read_csv('./data/training_data.csv')

### Variables
The data included for this project contained 12 variables, and their respective descriptions are as follows: 

- age: Patient age, numeric
- avg_glucose_level: Blood sugar levels, numeric
- bmi: Body mass index, numeric
- ever_married: Ever married, dummy/character (Yes, No)
- gender: Male, Female, or Other, character
- heart_disease: Has heart disease, dummy
- hypertension: Has hypertension, dummy
- id: Study identification number
- Residence_type: Type of residence, dummy/character (Urban, Rural)
- smoking_status: Former, never, or current smoker, categorical
- work_type: Employment type (Never worked (Never_worked), homemaker ("children"), - Public sector employment (Govt_job), Private sector employment (Private), - Self-employed (Self-employed) )
- stroke: Suffered a stroke in the sample period

In [3]:
# Look at the Data 
test_df.head()

Unnamed: 0.1,Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,251,30468,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
1,252,16523,Female,8.0,0,0,No,Private,Urban,110.89,17.6,Unknown,0
2,253,56543,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
3,255,32257,Female,47.0,0,0,Yes,Private,Urban,210.95,50.1,Unknown,0
4,259,28674,Female,74.0,1,0,Yes,Self-employed,Urban,205.84,54.6,never smoked,0


From the output of "test_df.head()", some observations on how to clean the data were made. Since the variable "id" was the identification number, this variable, along with "Unnamed: 0" can be dropped. For sake of having the variables presented in an uniform manner, the "Residence_type" variable was renamed to be in lowercase to match the other variables. 

In [6]:
# Clean Data as noted above 

# Drop "id" and "Unnamed: 0" columns
test_df = test_df.drop(columns=['id', 'Unnamed: 0'])
train_df = train_df.drop(columns=['id', 'Unnamed: 0'])

# Rename "Residence_type" to be "residence_type"
test_df = test_df.rename(columns={'Residence_type':'residence_type'})
train_df = train_df.rename(columns={'Residence_type':'residence_type'})

In [9]:
# Look at the Data again after First Round of Cleaning
test_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
1,Female,8.0,0,0,No,Private,Urban,110.89,17.6,Unknown,0
2,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
3,Female,47.0,0,0,Yes,Private,Urban,210.95,50.1,Unknown,0
4,Female,74.0,1,0,Yes,Self-employed,Urban,205.84,54.6,never smoked,0


In [10]:
# Look for Missing Values
missing_train = train_df.isnull().sum()
missing_test = test_df.isnull().sum()
print(missing_train)
print(missing_test)

gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
residence_type         0
avg_glucose_level      0
bmi                  159
smoking_status         0
stroke                 0
dtype: int64
gender                0
age                   0
hypertension          0
heart_disease         0
ever_married          0
work_type             0
residence_type        0
avg_glucose_level     0
bmi                  42
smoking_status        0
stroke                0
dtype: int64


When checking for missing values, it was observed that the only missing values were found in the "bmi" variable. To resolve this issue, the NaN values can be imputed with the mean. It might not be the best decision to drop or discard the missing values because that could lead to loss of variable data that might be useful in the future. Imputing the mean for the missing values helps maintain the size of the dataset. and can help reduce bias (that could be introduced by missing values) if the missing data is randomly distriubted. 

In [11]:
# Handle the NaN for "bmi" as noted above by imputing the mean 
train_df['bmi'] = train_df['bmi'].fillna(train_df['bmi'].mean())
test_df['bmi'] = test_df['bmi'].fillna(test_df['bmi'].mean())

In [12]:
# Do a Final Check and look for Missing Values
missing_train = train_df.isnull().sum()
missing_test = test_df.isnull().sum()
print(missing_train)
print(missing_test)

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64


### Splitting the Data 
The training and testing data can now be split into X and y data sets. Since the goal is to predict the likelihood a person has a stroke, the "stroke" variable is the target variable.

In [None]:
y_train = train_df['stroke']
y_test = test_df['stroke']
X_train = train_df.drop(columns='stroke')
X_test = test_df.drop(columns='stroke')

Additionally, the numeric and categorical columns can be identified to make it easier when the linear models need to be created.

In [None]:
numeric = ['age', 'avg_glucose_level', 'bmi', 'hypertension', 'heart_disease', ]
categorical = ['ever_married', 'gender', 'residence_type', 'smoking_status', 'work_type']

### Graphs and Visualizations
- Summarize Data and Visualize with KDE and histograms
- Address Outliers
- Explain quantitative features of data

### Linear Model - Numeric

### Linear Model - Categorical

### Linear Model - Combined Model (Numeric and Categorical Columns)

### Polynomial Expansion

### Decision Trees

### Analysis 
- Talk about research strategy
- Main findings (summarize tables/plots/statistics)