<p align="center">
<img src='../../img/VerticaMLPython.png' width="180px">
</p>

# Vertica ML Python Exercise 4

During this exercice, we will:
<ul>
    <li> Normalize the data using the different methods
    <li> Create a PCA model
    <li> Find a suitable number of principal components to keep 
</ul>
## Initialization

Let's create a cursor using the vertica_cursor function

In [2]:
from vertica_ml_python.utilities import vertica_cursor
cur = vertica_cursor("VerticaDSN")

During this study, we will work with the 'heart diseases' dataset introduced in exercise 2. 

In [3]:
from vertica_ml_python import vDataframe
heart = vDataframe('heart', cursor = cur)
print(heart)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,age,thalach,trestbps,fbs,slope,ca,num,sex,cp,exang,restecg,chol,oldpeak,thal
0.0,29.00,202.00,130.00,0.00,1.00,0.00,1,1.00,2.00,0.00,2.00,204.00,0.00,3.00
1.0,34.00,192.00,118.00,0.00,1.00,0.00,1,0.00,2.00,0.00,0.00,210.00,0.70,3.00
2.0,34.00,174.00,118.00,0.00,1.00,0.00,1,1.00,1.00,0.00,2.00,182.00,0.00,3.00
3.0,35.00,182.00,138.00,0.00,1.00,0.00,1,0.00,4.00,0.00,0.00,183.00,1.40,3.00
4.0,35.00,130.00,120.00,0.00,2.00,0.00,2,1.00,4.00,1.00,0.00,198.00,1.60,7.00
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: heart, Number of rows: 270, Number of columns: 14


This dataset contains many information of 270 patients including:
<ul>
    <li><b>age:</b> age in years</li>
    <li><b>thalach:</b> maximum heart rate achieved</li>
    <li><b>trestbps:</b> resting blood pressure (in mm Hg on admission to the hospital) </li>
    <li><b>fbs:</b> (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) </li>
    <li><b>slope:</b> the slope of the peak exercise ST segment (Value 1: upsloping; Value 2: flat; Value 3: downsloping) </li>
    <li><b>ca:</b> number of major vessels (0-3) colored by flourosopy </li>
    <li><b>num:</b> diagnosis of heart disease (angiographic disease status) (Value 1: < 50% diameter narrowing; Value 2: > 50% diameter narrowing)</li>
    <li><b>sex:</b> sex (1 = male; 0 = female) </li>
    <li><b>cp:</b> hest pain type (Value 1: typical angina; Value 2: atypical angina; Value 3: non-anginal pain; Value 4: asymptomatic) </li>
    <li><b>exang:</b> exercise induced angina (1 = yes; 0 = no) </li>
    <li><b>restecg:</b> resting electrocardiographic results (Value 0: normal; Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV); Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria) </li>
    <li><b>chol:</b> serum cholestoral in mg/dl </li>
    <li><b>oldpeak:</b> ST depression induced by exercise relative to rest </li>
    <li><b>thal:</b> 3 = normal; 6 = fixed defect; 7 = reversable defect </li>
</ul>

The purpose is to find people having heart complications compare to the rest. 

## Decomposition and Normalization

Let's explore the data by displaying descriptive statistics of all the columns.

In [4]:
heart.describe(method = "categorical")

0,1,2,3,4,5
,dtype,unique,count,top,top_percent
"""age""","numeric(6,2)",41,270,54.00,5.926
"""thalach""","numeric(7,2)",90,270,162.00,3.704
"""trestbps""","numeric(7,2)",47,270,120.00,12.593
"""fbs""","numeric(5,2)",2,270,0.00,85.185
"""slope""","numeric(5,2)",3,270,1.00,48.148
"""ca""","numeric(5,2)",4,270,0.00,59.259
"""num""",int,2,270,1,55.556
"""sex""","numeric(5,2)",2,270,1.00,67.778
"""cp""","numeric(5,2)",4,270,4.00,47.778


<object>

In [5]:
heart.describe(method = "numerical")

0,1,2,3,4,5,6,7,8,9
,count,mean,std,min,25%,50%,75%,max,unique
age,270,54.4333333333334,9.10906652389821,29.0,48.0,55.0,61.0,77.0,41
ca,270,0.67037037037037,0.943896383488116,0.0,0.0,0.0,1.0,3.0,4
chol,270,249.659259259259,51.6862371164313,126.0,213.0,245.0,280.0,564.0,144
cp,270,3.17407407407408,0.950090033922864,1.0,3.0,3.0,4.0,4.0,4
exang,270,0.32962962962963,0.470951591301383,0.0,0.0,0.0,1.0,1.0,2
fbs,270,0.148148148148148,0.355906476970731,0.0,0.0,0.0,0.0,1.0,2
num,270,1.44444444444445,0.497826751588616,1.0,1.0,1.0,2.0,2.0,2
oldpeak,270,1.05,1.145209839378,0.0,0.0,0.8,1.6,6.2,39
restecg,270,1.02222222222222,0.997891208966111,0.0,0.0,2.0,2.0,2.0,3


<object>

Many features are categorical and need to be encoded first. However, it is not needed during this exercice as the main purpose is to understand normalization and decomposition. We will only use numerical features during this study (age, thalach, trestbps, chol and oldpeak).

In exercise 2, we didn't normalize the data as we chose features which were approximately living in the same interval. However, it is important to normalize the data when we are dealing with algorithms using a notion of distance during the process. 

<b>Question 1: </b>Look at the descriptive statistics of all the elements and compare the median and the average. What do you notice ? Explain why the 'zscore' is a suitable way to normalize the data. Normalize the data.

<b>Question 2: </b>If we decide to build a model, these columns could be correlated one to another. PCA allows a space reduction and create independent components. Create a PCA model using these features.

<b>Question 3: </b>Look at the explained_variance attribute and choose a number of components to keep (we usually consider to keep components which bring the accumulated explained variance to more than 0.95)

<b>Question 4: </b>Why is it important to reduce the number of features before fitting a ML model ?