# TM10007 Assignment template -- ECG data

## Data loading and cleaning

Below are functions to load the dataset of your choice. After that, it is all up to you to create and evaluate a classification method. Beware, there may be missing values in these datasets. Good luck!

In [60]:
import os
import zipfile
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn import datasets as ds
import seaborn
import math
import statistics
import math
import numpy as np
from scipy.stats import shapiro 
from scipy.stats import lognorm

# Classifiers
from sklearn import model_selection
from sklearn import metrics
from sklearn import feature_selection 
from sklearn import preprocessing
from sklearn import neighbors
from sklearn import svm

cwd = os.getcwd() # This fn will return the Current Working Directory

zip_path = os.path.join(cwd, 'ecg', 'ecg_data.zip')
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(os.path.join(cwd, 'ecg'))

data_path = os.path.join(cwd, 'ecg', 'ecg_data.csv')
data = pd.read_csv(data_path, index_col=0)

Exploring data
- How many people have a normal ECG?
- How many people have an abnormal ECG?
- Is there any missing data?
- Are there outliers? 
- Is the data normally distributed


In [62]:
# split labels from data
x = data.loc[:, data.columns != 'label']  #alles behalve label
y = data['label']  # labels

# normal / abnormal ECGs
total_abnormal_ECG = np.count_nonzero(y)  # current dataset has 146 nonzeros
total_normal_ECG = y.size -np.count_nonzero(y)  # current dataset has 681 zeros
percentage_abnormal = total_abnormal_ECG / (total_abnormal_ECG + total_normal_ECG)*100  # 17.65 %

# Missing data
x = x.replace(0, np.nan)  # make all zeros to NaN
nan_count = x.isna().sum().sum()  # count missing data -> 10500 in our dataset

# Outliers



ShapiroResult(statistic=nan, pvalue=1.0)


Missing data
- Removing features if there is lot of data missing (replace all for a value)
- Removing samples (in this case patients) if there is a lot of data missing
- Imputation for generating data to fill us missing values
    - Fill with the mean or median of feature.
    - Fill with the value from a random other sample.
    - Fill with value with highest frequency (good for categorical features).
    - Use regression or machine learning to estimate the missing value
    - We need to make a choice


In [68]:
# Delete missing data when > --% of feature of sample is missing
x = x.dropna(axis='columns', how='all') # deletes a feature if all values of a column (so feature) are empty
x = x.dropna(axis='rows', how='all') # deletes a patient if all values of a row (so sample) are empty

# Missing data to median per feature
for column in x.columns:
    x[column].fillna(x[column].median(), inplace=True)

# Normally distributed
stat = []
p = []
for col in x.columns:
    if x[col].dtype == 'float64' or x[col].dtype == 'int64':
        s, pv = shapiro(x[col])
        stat.append(s)
        p.append(pv)
    else:
        stat.append(None)
        p.append(None)

# create a new dataframe to store the results
results = pd.DataFrame({'Column': x.columns, 'W': stat, 'p-value': p}) 
mean_p_value = results['p-value'].mean()  # p-value is really small. If p-value is bigger than 0.05, then data is normally distributed. SO its not

      Column         W       p-value
0        0_0  0.493494  2.774571e-43
1        0_1  0.362613  0.000000e+00
2        0_2  0.423284  2.802597e-45
3        0_3  0.383631  0.000000e+00
4        0_4  0.415530  1.401298e-45
...      ...       ...           ...
8995  11_745  0.288316  0.000000e+00
8996  11_746  0.219243  0.000000e+00
8997  11_747  0.231345  0.000000e+00
8998  11_748  0.278828  0.000000e+00
8999  11_749  0.182315  0.000000e+00

[9000 rows x 3 columns]
8.56094515530035e-27


Splitting data into training and test data
- Subset training and test based on ratios
- Stratification
- Cross-validation?




In [58]:
# Split data
X_train, X_test_DO_NOT_FIT, y_train, y_test_DO_NOT_FIT = model_selection.train_test_split(X, y, test_size=0.25, stratify=y)
y_train_ab = y_train==1
print(y_train_ab)
# X_test_DO_NOT_FIT and y_test_DO_NOT_FIT SHOULD NOT BE USED FOR FITTING

# Scale the data to be normal
scaler = preprocessing.RobustScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled_DO_NOT_FIT = scaler.transform(X_test_DO_NOT_FIT)

# Cross-validation
# cv_20fold = model_selection.StratifiedKFold(n_splits=10) --> uit college 1.2_generalization.ipyb

# Loop over the folds
#for validation_index, test_index in cv_20fold.split(X2, y2):
    # Split the data properly
#    X_validation = X2[validation_index]
#    y_validation = y2[validation_index]
    
#    X_test = X2[test_index]
#    y_test = y2[test_index]


573    False
302     True
186    False
802    False
536    False
       ...  
519    False
156    False
665    False
809    False
419    False
Name: label, Length: 620, dtype: bool
