## <img src="https://img.icons8.com/color/48/000000/code.png" style="height:50px;display:inline"> Python - Parkinson's Disease Classification Data Analysis

In this exercise, we are going to do data analysis with Python and Pandas. As this is the first "real" exercise, we will add guidance for some of the tasks.

1. Warmup - Generate 100 samples from $\mathcal{N}(0,1)$ (`np.random.randn`). Compute a 95% CI for the CDF. Plot the true CDF, the CDF estimation and the CI in a single plot. To estimate $\hat{F}_n$ use a histogram (`np.histogram`). Repeat this $K=1000$ times and compute the percentage of time that the interval contained the CDF (print the value) . In addition, plot in another single figure the *true* CDF, and the best and worst experiments (use $\max_x|F(x) - \hat{F}_n(x)|$ as quality measure).
    * To compare np arrays element-wise use `np.less_equal(x1, x2 + eps), np.greater_equal(x1, x2 - eps)`, use `.all()` to verify if all the comaprisons were `True`.
    
We are now going to perform some real data analysis on the "Parkinson's Disease Classification Data Set": the data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 at the Department of Neurology in CerrahpaÅŸa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82. During the data collection process, the microphone is set to 44.1 KHz and following the examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions. 

The features are various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson's Disease (PD) patients to extract clinically useful information for PD assessment.

2. Load the data with pandas, drop the 'id' column, take a sample ($k=10$, `dataframe.sample(k)`) and view it.
    * The filename is `pd_speech_features.csv`.
3. Compute the empirical correlation between all pairs of features. Show the results both in a heatmap.
    * Use pandas `.corr()` to calculate the correaltion, and `plt.imshow()` to view the heatmap (2 heatmaps, one for the correlation and one for the absolute correlation). Add a color bar using `plt.colorbar()`
4. Print the top-20 most correlated features. Follow this steps:
    * Take the lower triangle of the correlation matrix (as it is symmetrical and we don't care about $Corr(X_i,X_i)$). Use `np.tril()`
    * Consider only positive correlation (because negative correlation has a different, useful meaning). You can do that by `X = X[X >0]`.
    * From here, these are recommended steps, feel free to achieve the goal in a different way.
        * Assignment to a pandas DataFrame: `X.loc[:,:] = np.(...)`
        * Unstacking the DataFrame (creates a new pivot, read the doc): `df.unstack()`
        * Sorting: `df.sort()`

5. What is the meaning when 2 different features are highly correlated? From a machine learning perspective, can a classifier learn new insights from highly-correlated features? In your answer, address the process of "feature selection" in ML (usually performed as a pre-processing step).

6. Compute the **in-class** correlation between features. Plot a heat map for each class. Address the differences between the heat maps.

7. Consider the features 'numPulses' and 'app_entropy_log_5_coef'. We wish to calculate a 95% confidence interval for the correlation between these features. We will use *Bootstrapping* and the *Chebyshev inequality* (as in Tutorial 2).
    * Implement the bootstrap algorithm to calculate the standard deviation ($\sigma$) of the correlation.
        * You can use the algorithm from the tutorial, but you have to modify it to support 2 arrays.
        * The algorithm will output the empirical correlation of the two input features, and a bootstrap estimation for the std ($\sigma$). Use `K=300` bootstrap samples and `m=100` experiments.
        * Tips:
            * To get values for two columns: `data[['numPulses', 'app_entropy_log_5_coef']]`
            * To calculate correlation, check out `numpy.corrcoef`.
    * Use $\sigma$ to calculate a 95% CI using Chebyshev inequality.
        * Remember to to normalize by the size of the data ($N$).
    The final print should look something like that: `95% confidence interval for the correlation: *** ± ***`

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import math
from scipy.stats import norm
import helpers as h


%matplotlib inline


plt.rcParams['figure.figsize'] = (8.0, 8.0)
plt.rcParams['axes.grid'] = True
plt.rcParams['font.size'] = 10


## Q1

In [27]:
N = 100
x = np.arange(-3,3+0.001,0.001)
x.shape
mu = 0
sigma = 1
 
true_cdf = norm.cdf(x, mu, sigma)

k = 1000
count = 0 # counts the number of times the true cdf is inside the CI
norm_list = []
for i in range(k):
    # First calc F estimated and CI  and check if F real is inside the CI
    F_n, F_ci_down, F_ci_up = h.calc_plot_CI(0.05, N)
    count += 1*((np.greater_equal(true_cdf, F_ci_down).all()) and (np.less_equal(true_cdf, F_ci_up).all()))

    # Now loof for the best and worst estimation of F
    


print(f"The precantage of the estimators that were inside the CI is: {count/k * 100:.2f}%")


fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(x, true_cdf, label='true cdf')
ax.legend()



The precantage of the estimators that were inside the CI is: 95.50%


<matplotlib.legend.Legend at 0x7f5670323f70>