Collaborative coding using GitHub
===========

Alexandre Perera Luna, Mónica Rojas Martínez

December 15th 2023


# Goal

The objective of this assignment is to construct a project through collaborative coding, showcasing an Exploratory Data Analysis (EDA) and a classification. To facilitate your understanding of GitHub, we will utilize code snippets from previous exercises, allowing you to focus on the process without concerns about the final outcome. The current notebook will serve as the main function in the project, and each participant is required to develop additional components and integrate their contributions into the main branch.


## Requirements

In order to work with functions created in other jupyter notebooks you need to install the package `nbimporter` using a shell and the following command:

<font color='grey'>pip install nbimporter</font>

`nbimporter` allows you to import jupyter notebooks as modules. Once intalled and imported, you can use a command like the following to import a function called *fibonacci* that is stored on a notebook *fibbo_func* in the same path as the present notebook:

<font color='green'>from</font> fibbo_func <font color='green'>import</font> fibbonaci  <font color='green'>as</font> fibbo



In [None]:
!pip install nbimporter
!pip install pandas
!pip install seaborn
!pip install scikit-learn

Collecting nbimporter
  Downloading nbimporter-0.3.4-py3-none-any.whl (4.9 kB)
Installing collected packages: nbimporter
Successfully installed nbimporter-0.3.4


In [None]:
## Modify this cell by importing all the necessary modules you need to solve the assigmnent. Observe that we are importing
## the library nbimporter. You will need it for calling fuctions created in other notebooks.
import nbimporter
import pandas as pd



## Exercises
As an illustration of Git workflow, you will analyze the *Parkinson's* dataset, which has been previously examined in past assignments. Each team member has specific responsibilities that may be crucial for the progress of others. Make sure all of you organize your tasks accordingly. We've structured the analysis into modules to assist you in tracking your tasks, but feel free to deviate from it if you prefer.   
Please use Markdown cells for describing your workflow and expalining the findings of your work.
Remember you need both, to modify this notebook and, to create additional functions outside. Your work will only be available for others when you modify and merge your changes.


In [None]:
# We will start by loading the parkinson dataset. The rest is up to you!
df = pd.read_csv('parkinsons.data',
                 dtype = { # indicate categorical variables
                     'status': 'category'})
df.head(5)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


### 1. Cleaning and tidying the dataset

In [None]:
from scat_plt import scat_plt

In [None]:
# your code here

import itertools

# Obté les columnes numèriques
numeric_cols = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)']
numeric_df = df[numeric_cols]

# Crea combinacions de columnes
column_combinations = list(itertools.combinations(numeric_df.columns, 2))

# Genera scatter plots per a cada combinació
for combination in column_combinations:
    scat_plt(df,numeric_df[combination[0]], numeric_df[combination[1]],'status')
    plt.xlabel(combination[0])
    plt.ylabel(combination[1])
    plt.title(f'Scatter plot: {combination[0]} vs {combination[1]}')
    plt.show()

# Obté les columnes numèriques
numeric_cols = ['MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ','Jitter:DDP']
numeric_df = df[numeric_cols]

# Crea combinacions de columnes
column_combinations = list(itertools.combinations(numeric_df.columns, 2))

# Genera scatter plots per a cada combinació
for combination in column_combinations:
    scat_plt(df,numeric_df[combination[0]], numeric_df[combination[1]],'status')
    plt.xlabel(combination[0])
    plt.ylabel(combination[1])
    plt.title(f'Scatter plot: {combination[0]} vs {combination[1]}')
    plt.show()

# Obté les columnes numèriques
numeric_cols = ['MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3','Shimmer:APQ5', 'MDVP:APQ', 'Shimmer:DDA']
numeric_df = df[numeric_cols]

# Crea combinacions de columnes
column_combinations = list(itertools.combinations(numeric_df.columns, 2))

# Genera scatter plots per a cada combinació
for combination in column_combinations:
    scat_plt(df,numeric_df[combination[0]], numeric_df[combination[1]],'status')
    plt.xlabel(combination[0])
    plt.ylabel(combination[1])
    plt.title(f'Scatter plot: {combination[0]} vs {combination[1]}')
    plt.show()

NameError: name 'scat_plt' is not defined

In [None]:
import numpy as np

# Crea un nou df amb la llista de columnes que volem conservar
columnas_a_mantener = [
    'MDVP:Fo(Hz)',
    'MDVP:Fhi(Hz)',
    'MDVP:Flo(Hz)',
    'MDVP:Jitter(Abs)',
    'MDVP:RAP',
    'MDVP:PPQ',
    'Jitter:DDP',
    'MDVP:Shimmer',
    'MDVP:Shimmer(dB)',
    'Shimmer:APQ3',
    'Shimmer:APQ5',
    'MDVP:APQ',
    'Shimmer:DDA'
]
df_filtrado = df[columnas_a_mantener]

#obtain the correlation matrix
correlation_matrix = df_filtrado.corr().abs()

# Get the list of columns with over 99% correlation
high_correlation_cols = np.where(correlation_matrix > 0.9)

# Filter unique columns and avoid self-comparisons
unique_high_correlation_cols = set()
for i, j in zip(*high_correlation_cols):
    if i != j and (j, i) not in unique_high_correlation_cols:
        unique_high_correlation_cols.add((i, j))

# Print to see the colums and their correlations
for col1, col2 in unique_high_correlation_cols:
    correlation_value = correlation_matrix.iloc[col1, col2]
    print(f"Columns {df_filtrado.columns[col1]} and {df_filtrado.columns[col2]} have correlation > 99%: {correlation_value:.4f}")

Columns Shimmer:APQ3 and Shimmer:APQ5 have correlation > 99%: 0.9601
Columns Shimmer:APQ5 and MDVP:APQ have correlation > 99%: 0.9491
Columns MDVP:Jitter(Abs) and MDVP:RAP have correlation > 99%: 0.9229
Columns MDVP:Shimmer and Shimmer:APQ5 have correlation > 99%: 0.9828
Columns MDVP:RAP and Jitter:DDP have correlation > 99%: 1.0000
Columns MDVP:Shimmer(dB) and Shimmer:APQ5 have correlation > 99%: 0.9738
Columns Shimmer:APQ3 and Shimmer:DDA have correlation > 99%: 1.0000
Columns MDVP:Shimmer and Shimmer:APQ3 have correlation > 99%: 0.9876
Columns MDVP:Shimmer(dB) and Shimmer:DDA have correlation > 99%: 0.9632
Columns MDVP:RAP and MDVP:PPQ have correlation > 99%: 0.9573
Columns MDVP:Shimmer(dB) and Shimmer:APQ3 have correlation > 99%: 0.9632
Columns MDVP:Shimmer and Shimmer:DDA have correlation > 99%: 0.9876
Columns MDVP:PPQ and Jitter:DDP have correlation > 99%: 0.9573
Columns MDVP:Jitter(Abs) and Jitter:DDP have correlation > 99%: 0.9229
Columns MDVP:Shimmer(dB) and MDVP:APQ have corr

We can considered eliminating one of the columns in cases where there is more than 99% correlation in order to not repeat information.

It can be seen that MDVP:RAP and Jitter:DDP and Shimmer:APQ3 and Shimmer:DDA have a correlation of 1 (straight line in the graph). We will delete the MDVP:RAP and the Shimmer:APQ3.

**cleaned_df**

In [None]:
cleaned_df = df.drop(['Shimmer:APQ3','MDVP:RAP'], axis=1)
cleaned_df

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,0.00007,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.400,148.650,113.819,0.00968,0.00008,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.01050,0.00009,0.00781,0.01633,0.05233,0.482,...,0.08270,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,0.00009,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00908,0.01966,0.06425,0.584,...,0.10470,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,phon_R01_S50_2,174.188,230.978,94.261,0.00459,0.00003,0.00259,0.00790,0.04087,0.405,...,0.07008,0.02764,19.517,0,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,phon_R01_S50_3,209.516,253.017,89.488,0.00564,0.00003,0.00292,0.00994,0.02751,0.263,...,0.04812,0.01810,19.147,0,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,phon_R01_S50_4,174.688,240.005,74.287,0.01360,0.00008,0.00564,0.01873,0.02308,0.256,...,0.03804,0.10715,17.883,0,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,phon_R01_S50_5,198.764,396.961,74.904,0.00740,0.00004,0.00390,0.01109,0.02296,0.241,...,0.03794,0.07223,19.020,0,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


### 2. Basic EDA based on plots and descriptive statistics

In [None]:
# your code here

### 3. Aggregating and transforming variables in the dataset

In [None]:
# your code

# CALL group_and_average

def group_and_average(df, gv):
    # Grouping and averaging the dataframe by the given variable
    av_df = df.groupby(gv).mean().reset_index()
    return av_df

### 4. Differentiating between controls (healthy subjects) and patients

In [None]:
# your code

grouped_and_averaged = group_and_average(renamed_df, 'subject_id')
grouped_and_averaged['status'] = pd.cut(grouped_and_averaged['status'], bins=[0, 0.5,1], labels=[0,1], include_lowest=True)
z_scored = normalize(grouped_and_averaged, 0)
min_max = normalize(grouped_and_averaged, 1)


def evaluate_knn(dataset, scenario):
    from sklearn.neighbors import KNeighborsClassifier
    column_names = dataset.columns
    X=dataset.drop(['status','subject_id'], axis=1)
    y=dataset['status'] # drop all varribales that are not predicting variables
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X,y)
    accuracy = knn.score(X, y)
    print(f'The accuracy of {scenario} is: {accuracy:.2f}')
Scenario = evaluate_knn(grouped_and_averaged, 'Scenario 1')
Scenario2 = evaluate_knn(z_scored, 'Scenario 2')
Scenario3 = evaluate_knn(min_max, 'Scenario 3')