# Lab One: Exploring Table Data
Team: Jack Babcock, Hayden Center, Fidelia Navar, Amory Weinzierl

### Assignment Description
You are to perform preprocessing and exploratory analysis of a data set: exploring the statistical summaries of the features, visualizing the attributes, and addressing data quality. This report is worth 10% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output.

Additional information and requirements can be found at https://smu.instructure.com/courses/81978/assignments/465788

## Part I -  Business Understanding

The data set (which can be found at https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) that we have chosen to utilize for this lab consists of data that may be used to identify whether or not an individual is at risk for strokes. 

## Part II - Data Understanding

### Data Description

#### Importing

In [None]:
import numpy as np
import pandas as pd

print('Pandas:', pd.__version__)
print('Numpy:',  np.__version__)

df = pd.read_csv('healthcare-dataset-stroke-data.csv')

df.head()

#### Formatting

To clean up the data a little bit, we're going to normalize the values of the non-numeric columns to have the same format by setting all values to lowercase and replacing spaces with underscores.

In [None]:
for c in df.columns:
    if df[c].dtype == 'object':
        df[c] = df[c].str.lower()
        
df = df.replace(' ', '_', regex=True)
        
for c in df.columns:
    if df[c].dtype == 'object':
        print(df[c].unique())

All of the columns look good except for the smoking_status column. One of the values in that column is listed as 'unknown'. Though this is technically a value, what it is actually representing is missing information, so let's make that more clear.

In [None]:
df.smoking_status.mask(df.smoking_status == 'unknown', np.nan, inplace=True)

print(df.info())

Some of the categorical columns should be converted into numerical columns. Specifically the ever_married column should be converted into a binary column similar to the hypertension, heart_disease, and stroke columns, and the smoking_status column should be converted into an ordinal. We think this is a meaningful change because there is a very clear way to assign an order to the values: never_smoked is 0, formerly_smoked is 1 since it is worse for your health, and finally smokes is 2, since it is worse than formerly_smoked.

In [None]:
df.smoking_status.replace(to_replace= ['never_smoked', 'formerly_smoked', 'smokes'], value = [0, 1, 2], inplace=True)
df.info()

In [None]:
df.drop(df[df.gender == 'other'].index, inplace=True)

### Data Quality

#### Duplicate Values

The first thing we'll want to do to check the quality of the data is to check for duplicates. First, let's make sure there are no duplicate IDs in the dataset.

In [None]:
if df.id.unique().size == df.id.size:
    print("No duplicate IDs")

Now that we know there are no duplicate IDs, let's check if there are any rows with identical values (excluding the ID).

In [None]:
cols = df.columns.drop('id')

s = df.duplicated(subset=cols, keep='first')

s[s]

Seems like the dataset has no exact duplicates. We feel safe assuming that, finding no exact duplicates, each entry in the dataset is unique.

#### Missing Values

The second thing to check the dataset for is missing values. We can see them by checking df.info().

In [None]:
df.info()

This shows us that we're missing data from two columns: smoking_status and bmi. Now let's take a look at both of the columns with missing data and see if we want to impute or delete them.

##### BMI

Let's first look at the inital data from the column.

In [None]:
df.bmi.describe()

Now let's attempt to impute the bmi column using KNNImputer with the numerical and boolean columns of the dataset.

In [None]:
from sklearn.impute import KNNImputer
import copy

knn = KNNImputer(n_neighbors=3)

temp = df[[
    'age',
    'hypertension',
    'heart_disease',
    'avg_glucose_level',
    'bmi',
    'stroke'
]].to_numpy()

temp_imputed = knn.fit_transform(temp)

df_imputed = copy.deepcopy(df)
df_imputed[[
    'age',
    'hypertension',
    'heart_disease',
    'avg_glucose_level',
    'bmi',
    'stroke'
]] = temp_imputed

df_imputed.bmi = df_imputed.bmi.apply(lambda x: round(x, 1))
print("----- Original -----")
print(df.bmi.describe())
print("----- Imputed ------")
print(df_imputed.bmi.describe())

From the averages and five number summary of the dataset before and after the imputation, we see that there is very little difference. The dataset seemed to impute slightly more datapoints just above the mean and in a tighter grouping. Let's now visualize the imputation using a histogram.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

df_imputed.bmi.plot(kind='hist', alpha=0.5, label="imputed",bins=100)
df.bmi.plot(kind='hist', alpha=0.5, label="original",bins=100)
plt.legend()
plt.show()

The imputation looks very successful from this visualization, so we have decided to use the imputed data for our visualizations.

In [None]:
df = df_imputed
df.info()

##### SMOKING_STATUS

Let's first look at the inital data from the column.

In [None]:
df.smoking_status.describe()

Now let's attempt to impute the smoking_status column using KNNImputer with the numerical and boolean columns of the dataset.

In [None]:
from sklearn.impute import KNNImputer
import copy

knn = KNNImputer(n_neighbors=3)

temp = df[[
    'age',
    'hypertension',
    'heart_disease',
    'avg_glucose_level',
    'smoking_status',
    'stroke'
]].to_numpy()

temp_imputed = knn.fit_transform(temp)

df_imputed = copy.deepcopy(df)
df_imputed[[
    'age',
    'hypertension',
    'heart_disease',
    'avg_glucose_level',
    'smoking_status',
    'stroke'
]] = temp_imputed

df_imputed.smoking_status = df_imputed.smoking_status.apply(lambda x: round(x))
print("----- Original -----")
print(df.smoking_status.describe())
print("----- Imputed ------")
print(df_imputed.smoking_status.describe())

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

df_imputed.smoking_status.plot(kind='hist', alpha=0.5, label="imputed",bins=3)
df.smoking_status.plot(kind='hist', alpha=0.5, label="original",bins=3)
plt.legend()
plt.show()

This imputation seems to be much less successful than the previous one, imputing significantly more non-smoking and former-smoking datapoints than smoking datapoints. However, taking into account both how large the population with missing data is, and also real world factors that may influence this distribution, we have decided to use this imputated data as well. Regarding the real world factors, it seems plausible that smokers are more likely to be recorded as smokers due to the relevance of smoking to general health, whereas former- or non-smokers might not have that data recorded as often.

## Part III - Data Visualization

## Part IV - Exceptional Work