#**Introduction to Data handling and Exploration**

Instructor: Dr Mario Rosario Guarracino


---



In [None]:
# from google.colab import drive
# drive.mount('/content/drive')
# % cd "/content/drive/My Drive/2020 09 Cambridge course/PyNotebooks/"

---
###**1. Import the required libraries**


*  Library **sklearn** - Machine Learning library in python. 
Contains various algorithms for clustering, classification, data pre-processing, manifold learning, dimensionality reduction, feature extraction. 
Refer https://scikit-learn.org/stable/modules/classes.html for a full list of functions and https://scikit-learn.org/stable/user_guide.html#user-guide for the sklearn user guide. 


In [None]:
import warnings
warnings.filterwarnings('ignore')

import sys
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Check libraryversion
np.__version__

###**2. Load the Cleveland Heart UCI dataset**

*   Direct download from  UCI (https://archive.ics.uci.edu/ml/datasets/heart+disease)

*   Load from local file

In [None]:
## Direct download - provide the data path and the attribute names listed in the uci link 
# cleveland_heart = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data', sep=",", names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']) 

## save downloaded csv file locally
# cleveland_heart.to_csv("heart.csv", index = False)

## Load local csv file
cleveland_heart = pd.read_csv("data/heart.csv")

**2.1 Dataframe content and dimensions**

The Cleveland UCI heart dataset contains 13 attributes and the class label:

1. **age** 
2. **sex** (values 0,1 for females/males)
3. **cp**: chest pain type (4 values)
 -  Value 1: typical angina
 -  Value 2: atypical angina
 -  Value 3: non-anginal pain
 -  Value 4: Asymptomatic
4. **trestbps**: resting blood pressure
5. **chol**: serum cholestoral in mg/dl
6. **fbs**: fasting blood sugar > 120 mg/dl (Values: 0,1)
7. **restecg**: resting electrocardiographic results 
 - Value 0: normal
 - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
 - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. **thalach**: maximum heart rate achieved
9. **exang**: exercise induced angina
10. **oldpeak**: ST depression induced by exercise relative to rest
11. **slope**: slope of the peak exercise ST segment
 - Value 1: upsloping
 - Value 2: flat
 - Value 3: downsloping
12. **ca**: number of major vessels (0-3) colored by flourosopy
13. **thal**: 3 = normal; 6 = fixed defect; 7 = reversable defect 
14. **num** presence of heart disease in the patient . 0 (absence) to 1-4 (presence).

**2.2 View the shape and contents of the data**

In [None]:
print(cleveland_heart.shape)
cleveland_heart.head()


*  **Rename** the last column from 'num' to 'Diagnosis_CHD',
 as 0 indicates the absence and 1-4 the presence of coronary heart disease (chd), replace values of 1-4 to 1 in column 'chd'.

In [None]:
cleveland_heart = cleveland_heart.rename(columns = {"num":"Diagnosis_CHD"})

cleveland_heart["Diagnosis_CHD"].replace(to_replace=[1,2,3,4], value=1, inplace=True)
print(cleveland_heart.head(n=3))

# No of instances in each class
print(cleveland_heart.groupby('Diagnosis_CHD').size())

In [None]:
cleveland_heart.cp = cleveland_heart.cp.replace({1:'Typical', 2:'Atypical', 3:'NonAnginal', 4:'Asymptomatic'})
cleveland_heart.restecg = cleveland_heart.restecg.replace({0:'Normal', 1:'ST-T abn', 2:'LVH'})
cleveland_heart.slope = cleveland_heart.slope.replace({1:'up', 2:'flat', 3:'down'})
cleveland_heart.thal = cleveland_heart.thal.replace({3:'Normal', 6:'Fixed', 7:'Revers'})

In [None]:
cleveland_heart.head()

**2.3 Summary statistics**

In [None]:
cleveland_heart.loc[:, ['age','chol', 'trestbps','thalach','oldpeak']].describe()

###**3. Missing data**

---
Check for na, nan or null values in the dataset.



In [None]:
print(cleveland_heart.isnull().sum())
print(cleveland_heart.isna().sum())

---
Dataset info

In [None]:
cleveland_heart.info()
#  Indicates the columns where missed values are present: ca and thal object datatype (mixed datatypes)

In [None]:
print(cleveland_heart['cp'].unique())
print(cleveland_heart['restecg'].unique())
print(cleveland_heart['slope'].unique())
print(cleveland_heart['ca'].unique())
print(cleveland_heart['thal'].unique())
# Unique values in the ca and thal columns show that '?' has been used as an indicator for missing values (can be nan or 0 in other datasets)

In [None]:
# Find number of '?' entries in ca and thal columns
print(sum(cleveland_heart['ca'].values=='?'))
print(sum(cleveland_heart['thal'].values=='?'))

# Replace ? with nan as an exercise (This section of code can also be used with the nan replaced dataset)
# cleveland_heart_withnan = cleveland_heart.replace('?', np.nan)

In [None]:
# Remove rows with missing values and convert string to float in ca and thal.
idx_drop = cleveland_heart.loc[cleveland_heart['ca']=='?'].index.tolist()
cleveland_heart.drop(idx_drop, axis=0, inplace=True)
cleveland_heart['ca'] = cleveland_heart['ca'].astype(str).astype(float).astype(int)

idx_drop = cleveland_heart.loc[cleveland_heart['thal']=='?'].index.tolist()
cleveland_heart.drop(idx_drop, axis=0, inplace=True)
cleveland_heart['thal'] = cleveland_heart['thal'].astype(str).astype(float).astype(int)

In [None]:
# Samples retained 
print(cleveland_heart.groupby('Diagnosis_CHD').size())
# cleveland_heart.to_csv("data/heart_processed.csv", index=False)



---


**Note**: Also visit https://scikit-learn.org/stable/modules/impute.html for missing value imputation methods.


---



###**4. Data Visualization**

**4.1 Boxplots to visualize distribution**

In [None]:
cleveland_heart.plot(kind='box', subplots=True, layout=(4, 4), figsize=(10, 10), fontsize=10)
plt.show()

**4.2 Histograms**

In [None]:
cleveland_heart.hist(figsize=(10, 10))
plt.show()

**4.3 Pairplot** in seaborn. Plots pairwise relationships between attributes in a dataset.


In [None]:
# Pair plots with all the variables
sns.pairplot(cleveland_heart, hue="Diagnosis_CHD", diag_kind="hist", palette =sns.color_palette("Set1", n_colors=2))

In [None]:
# choose only quantitative variables for pairplot.
sns.pairplot(cleveland_heart.loc[:, ['age','chol', 'trestbps','thalach','oldpeak','Diagnosis_CHD']], hue="Diagnosis_CHD", diag_kind="hist" , palette =sns.color_palette("Set1", n_colors=2))



---


**4.4 Outlier detection**
Here we show how to detect outliers in each variable by using the standard deviation method. For more outlier detection methods refer to https://scikit-learn.org/stable/modules/outlier_detection.html.

In [None]:
# ---
# Function to remove outliers
def remove_outliers_stddev(inp_data, num):

  # cut_off
  cut_off = np.std(inp_data) * num
  lower_cut = np.mean(inp_data) - cut_off
  upper_cut = np.mean(inp_data) + cut_off
  
  # identify outliers
  outliers = [x for x in inp_data if x < lower_cut or x > upper_cut]
  print('Number of outliers: ', len(outliers))

  # remove outliers
  outliers_removed = [x for x in inp_data if x >= lower_cut and x <= upper_cut]
  return(outliers_removed)
# ---

var_list = ['age','chol', 'trestbps', 'thalach','oldpeak']
for var in var_list:
  print(var)
  var_data = np.array(cleveland_heart[var])
  var_data_outliers_removed = remove_outliers_stddev(var_data, 3)
  f, axes = plt.subplots(1, 2, sharey=False, figsize=(6,2))
  sns.boxplot(var_data,  orient='v' , ax=axes[0], color = 'red', width=0.2, boxprops=dict(alpha=.7))
  sns.boxplot(var_data_outliers_removed,  orient='v', ax=axes[1], color='blue', width=0.2, boxprops=dict(alpha=.7))

  axes[0].set_title("Before")
  axes[1].set_title('After')
  plt.suptitle('Outlier removal - ' + var, va='bottom')



---


**4.4 Principal component analysis (PCA)** for data visualization and outliers. PCA is a dimensionality reduction technique.

In [None]:
# Divide the dataframe into attributes in X and disease diagnosis labels stored in y
X = cleveland_heart.loc[:, ['age','chol', 'trestbps','thalach','oldpeak']]
y = np.array(cleveland_heart.iloc[:,-1].values)
X.shape, y.shape

# scale X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA on unscaled data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# PCA on scaled data
pca_sc = PCA(n_components=2)
X_pca_scaled = pca_sc.fit_transform(X_scaled)

# subplots for comparison between PCA of unscaled and scaled data
f, (ax1, ax2) = plt.subplots(1, 2, sharey=False, figsize=(10, 5))
scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="Spectral", alpha=0.7)
scatter2 = ax2.scatter(X_pca_scaled[:, 0], X_pca_scaled[:, 1], c=y, cmap="Spectral", alpha=0.7)

# produce a legend with the unique colors from the scatter
legend1 = ax1.legend(*scatter1.legend_elements(), title="Diagnosis_CHD")
ax1.add_artist(legend1)
legend2 = ax2.legend(*scatter2.legend_elements(), title="Diagnosis_CHD")
ax2.add_artist(legend2)

# subplot titles
ax1.set_title("PCA of Unscaled data")
ax2.set_title('PCA of Scaled data')

print(X_pca.shape)
# Percentage of variance explained for each components
print('Explained variance ratio (first two components): \n', '%s'
      % str(pca_sc.explained_variance_ratio_))



---


**4.5 t-distributed stochastic neighbor embedding (t-SNE)** is a manifold learning technique for high-dimensional data visualization.




In [None]:
# tSNE with 2 components and perplexity parameter set at 30
x_tsne = TSNE(n_components=2, random_state=1, perplexity=30).fit_transform(X_scaled)
x_tsne.shape
y = y.astype(str)
col_cut = len(np.unique(y))

# store tsne output in a dataframe
df_for_tsne = pd.DataFrame({'tSNE1': x_tsne[:, 0], 'tSNE2': x_tsne[:, 1],
              'Diagnosis_CHD': y})

# seaborn for scatterplot
plt.figure(figsize=(6, 6))
sns.scatterplot(x="tSNE1", y="tSNE2",
			  hue='Diagnosis_CHD',
			  legend='full',
				palette =sns.color_palette("Set1", n_colors=2), s=45,
			  data=df_for_tsne)



---

