Collaborative coding using GitHub
===========

Alexandre Perera Luna, Mónica Rojas Martínez

December 15th 2023


# Goal

The objective of this assignment is to construct a project through collaborative coding, showcasing an Exploratory Data Analysis (EDA) and a classification. To facilitate your understanding of GitHub, we will utilize code snippets from previous exercises, allowing you to focus on the process without concerns about the final outcome. The current notebook will serve as the main function in the project, and each participant is required to develop additional components and integrate their contributions into the main branch.


## Requirements

In order to work with functions created in other jupyter notebooks you need to install the package `nbimporter` using a shell and the following command:

<font color='grey'>pip install nbimporter</font>

`nbimporter` allows you to import jupyter notebooks as modules. Once intalled and imported, you can use a command like the following to import a function called *fibonacci* that is stored on a notebook *fibbo_func* in the same path as the present notebook:

<font color='green'>from</font> fibbo_func <font color='green'>import</font> fibbonaci  <font color='green'>as</font> fibbo



In [None]:
## Modify this cell by importing all the necessary modules you need to solve the assigmnent. Observe that we are importing
## the library nbimporter. You will need it for calling fuctions created in other notebooks.
import nbimporter
import pandas as pd



In [None]:
# Here is an example of invoking the Fibonacci function, whisch should be located in the same directory as the main:
from fibbo_func import fibbonaci as fibbo
fibbo(24)

46368

## Exercises
As an illustration of Git workflow, you will analyze the *Parkinson's* dataset, which has been previously examined in past assignments. Each team member has specific responsibilities that may be crucial for the progress of others. Make sure all of you organize your tasks accordingly. We've structured the analysis into modules to assist you in tracking your tasks, but feel free to deviate from it if you prefer.   
Please use Markdown cells for describing your workflow and expalining the findings of your work.
Remember you need both, to modify this notebook and, to create additional functions outside. Your work will only be available for others when you modify and merge your changes.


In [None]:
# We will start by loading the parkinson dataset. The rest is up to you!
df = pd.read_csv('parkinsons.data',
                 dtype = { # indicate categorical variables
                     'status': 'category'})
df.head(5)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


### 1. Cleaning and tidying the dataset

In [None]:
# your code here
def renamevars(df, dict_names):
    renamed_df = df.rename(columns=dict_names)
    return renamed_df

dict_names = {'MDVP:Fo(Hz)':'avFF',
              'MDVP:Fhi(Hz)':'maxFF',
              'MDVP:Flo(Hz)':'minFF',
              'MDVP:Jitter(Abs)':'absJitter' ,
              'MDVP:PPQ': 'ppq',
              'Jitter:DDP': 'ddp',
              'MDVP:Shimmer' : 'lShimer',
              'MDVP:Shimmer(dB)': 'dbShimer',
              'Shimmer:APQ5': 'apq5',
              'MDVP:APQ':'apq',
              'Shimmer:DDA':'dda'}

renamed_df = renamevars(cleaned_df,dict_names)
renamed_df

### 2. Basic EDA based on plots and descriptive statistics

In [None]:
# your code here
def obtain_info(renamed_df):
    # 1. Calculate the Number of Observations
    num_observations = len(renamed_df)
    print("Number of Observations:", num_observations)
    # Suponiendo que la columna que contiene el "status" se llama "status"
    status_counts = renamed_df['status'].value_counts()
    # Imprimir la cantidad de observaciones para cada estado
    print("Status 0:", status_counts.get(0, 0))
    print("Status 1:", status_counts.get(1, 0))
    print(" ")

    # 2. Examine Differences between Controls and Patients
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    grouped_data = renamed_df.groupby('status', sort=False)
    summary_stats = grouped_data.agg(['mean', 'std', 'max', 'min'])
    print("Summary Statistics:")
    for group_name, group_data in summary_stats.groupby(level=0, axis=1, sort=False):
        print(f"\nGroup: {group_name}\n")
        print(group_data)
        print()
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')


    # 4. Identify Outliers and Decide on Treatment
    # Boxplot for each variable
    label_encoder = LabelEncoder()
    cleaned_df['status'] = label_encoder.fit_transform(renamed_df['status'])
    # Filter numeric columns excluding 'status'
    numeric_columns = [column for column in renamed_df.columns if column != 'status' and is_numeric_dtype(renamed_df[column])]
    # Calculate the number of rows needed
    num_columns = len(numeric_columns)
    num_rows = (num_columns + 2) // 3
    # Adjust subplots layout
    fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))
    # Remove empty subplots if there are fewer than 3 columns
    for i in range(num_columns, num_rows * 3):
        fig.delaxes(axes.flatten()[i])
    # Iterate over numeric columns and create boxplots
    for i, column in enumerate(numeric_columns):
        row_index = i // 3
        col_index = i % 3
        ax = axes[row_index, col_index]
        sns.boxplot(x='status', y=column, data=renamed_df, ax=ax)
        ax.set_title(f'{column} vs status')

    plt.tight_layout()
    plt.show()

plots_info = obtain_info(renamed_df)
plots_info

In [None]:
def remove_outliers(group):
    Q1 = group.quantile(0.25)
    Q3 = group.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return group[(group >= lower_bound) & (group <= upper_bound)]

# Assuming 'renamed_df' is your DataFrame
#numeric_columns = renamed_df.select_dtypes(include=['number']).columns

# Apply the remove_outliers function to numeric columns based on 'status' groups
#renamed_df[numeric_columns] = renamed_df.groupby('status')[numeric_columns].apply(remove_outliers).reset_index(drop=True)

# Display the resulting DataFrame
#renamed_df

### 3. Aggregating and transforming variables in the dataset

In [None]:
# your code here

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
renamed_df['status'] = label_encoder.fit_transform(renamed_df['status'])
# Extract the 'Sxx' prefix from the 'name' column
renamed_df['subject_id'] = renamed_df['subject_id'].str.extract(r'(S\d+)')

# Utilizamos la función para agrupar y calcular la media
grouped_and_averaged = group_and_average(renamed_df, 'subject_id')

# Imprimimos el resultado
grouped_and_averaged['status'] = pd.cut(grouped_and_averaged['status'], bins=[0, 0.5,1], labels=[0,1], include_lowest=True)
grouped_and_averaged

### 4. Differentiating between controls (healthy subjects) and patients

In [None]:
# your code here