# TITLE

## Final Project Submission

Please fill out:
* __Student name:__ Cassarra Groesbeck
* __Student pace:__ Part Time/ Flex
* __Scheduled project review date/time:__ 
* __Instructor name:__ 
* __Blog post URL:__



# 1. Introduction 

## 1a. Objectives

## 1b. Business Understanding

# 2. Data Understanding
This data contains 5110 observations with 12 attributes (11 clinical features) for predicting stroke events. 


## 2a. Attribute Information
| Column     | Description   |
|------------|:--------------|
| `id`               | **unique identifier**  |
| `gender`           | **"Male", "Female" or "Other"**  |
| `age`              | **age of the patient** |
| `hypertension`     | **0 if the patient doesn't have hypertension, 1 if the patient has hypertension**  |
| `heart_disease`    | **0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease**   |
| `ever_married`     | **"No" or "Yes"**  |
| `work_type`        | **"children", "Govt_jov", "Never_worked", "Private" or "Self-employed"**   |
| `Residence_type`   | **"Rural" or "Urban"**  |
| `avg_glucose_level`| **average glucose level in blood**  |
| `bmi`              | **body mass index** |
| `smoking_status`   | **"formerly smoked", "never smoked", "smokes" or "Unknown"***  |
| `stroke`           | **1 if the patient had a stroke or 0 if not**  |
|    **_*Note:_**      | _"Unknown" in_ `smoking_status` _means that the information is unavailable for this patient_ |


## 2b. Acknowledgements
Data comes from the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) and can be found on [kaggle](https://www.kaggle.com).

# 3. Imports

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
#plt.style.use('seaborn')
sns.set_style('darkgrid', {'axes.facecolor': '0.9', "grid.color": ".6", "grid.linestyle": ":"})
%matplotlib inline

from imblearn.over_sampling import SMOTE, SMOTENC
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.combine import SMOTETomek

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report, plot_roc_curve, auc

import warnings
warnings.filterwarnings('ignore')

# 4. Exploring the data

In [3]:
data = pd.read_csv('Data/healthcare-dataset-stroke-data.csv')

In [3]:
# Visual check of raw df
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [4]:
# Drop 'id' column
data = data.drop('id', axis=1)

In [5]:
# Identify Target Feature
target = 'stroke'

In [6]:
data.shape

(5110, 11)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


In [8]:
data.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [24]:
print('-'*30)
print('Distribution of Target Feature')
print('-'*30)
print('COUNTS:')
print(data[target].value_counts())
print('- '*15)
print('PERCENTAGES:')
for feature in [norm_value_count for norm_value_count \
                in enumerate(data[target].value_counts(normalize=True))]:
    print(f'{feature[0]}\t{feature[1]*100:.4}%')
print(f'Name: {target}, dtype: int64') # just for symmetry
print('-'*30)

------------------------------
Distribution of Target Feature
------------------------------
COUNTS:
0    4861
1     249
Name: stroke, dtype: int64
- - - - - - - - - - - - - - - 
PERCENTAGES:
0	95.13%
1	4.873%
Name: stroke, dtype: int64
------------------------------


In [20]:
show = data.drop(target, axis=1)

print("-"*30)
print(f'Distribution of Other Features')
print("-"*30)
for column in show.columns:
    print("-"*30)
    print(f"UNIQUE VALUES: {len(show[column].unique())}")
    if len(show[column].unique()) <= 5:
        print("- "*15)
        print(show[column].value_counts())
    else:
        if show[column].dtype == 'float64':
            print("- "*15)
            print(f'\t\t  MIN: {show[column].min()}')
            print(f'\t\t  MEAN: {round(show[column].mean())}')
            print(f'\t\t  MAX: {show[column].max()}')
            print((f'Name: {column}, dtype: float64')) 
    print("-"*30)

------------------------------
Distribution of Other Features
------------------------------
------------------------------
UNIQUE VALUES: 3
- - - - - - - - - - - - - - - 
Female    2994
Male      2115
Other        1
Name: gender, dtype: int64
------------------------------
------------------------------
UNIQUE VALUES: 104
- - - - - - - - - - - - - - - 
		  MIN: 0.08
		  MEAN: 43.0
		  MAX: 82.0
Name: age, dtype: float64
------------------------------
------------------------------
UNIQUE VALUES: 2
- - - - - - - - - - - - - - - 
0    4612
1     498
Name: hypertension, dtype: int64
------------------------------
------------------------------
UNIQUE VALUES: 2
- - - - - - - - - - - - - - - 
0    4834
1     276
Name: heart_disease, dtype: int64
------------------------------
------------------------------
UNIQUE VALUES: 2
- - - - - - - - - - - - - - - 
Yes    3353
No     1757
Name: ever_married, dtype: int64
------------------------------
------------------------------
UNIQUE VALUES: 5
- 

# 5. Functions

## 4a. Visualizing the raw data

# NOTES: Not sure if i should include this or where. I will first need to split and transform my data before doing any visualizations

Start with a pairplot. 

Note: this will automatically only plot non-object features.

In [None]:
sns.pairplot(data, plot_kws={'linewidth':0.1}, diag_kws={'linewidth':0.1, 'alpha':1});

Pairplot does not reveal too much for me other than visually showcasing similar distributions of `stroke`, `heart_disease`, and `hypertension`. 

I will re plot these three features.

In [None]:
# data
plots = num[['stroke', 'heart_disease', 'hypertension']]
# setup
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(10, 3))
fig.set_tight_layout(True)
# plot
for i, col in enumerate(plots.columns):
    sns.histplot(data=plots[col], ax=axes[i], linewidth=0.1, alpha=1, binwidth=.5).set_ylim(0, 5000)

NOTES GO HERE

In [28]:
"""
# seperate into numeric and categorical 
cat = data.select_dtypes(include=object)
num = data.select_dtypes(exclude=object)

# plot
plots = num.drop(target, axis=1)

fig, axes = plt.subplots(ncols=5, nrows=1, figsize=(15, 3))
fig.set_tight_layout(True)

for index, col in enumerate(plots.columns):
    ax = axes[index]#//3][index%3]
    sns.regplot(x = col, y = target, data = num, ax=ax, line_kws={"color": "tab:red"})
    ax.set_xlabel(col)
    ax.set_ylabel("Stroke")
"""
print("Probably won't use this but I do not want to delete yet.")

Probably won't use this but I do not want to delete yet.
