# Project 3: Parkinson's Dataset

Project 2
- Enrique Almazán Sánchez
- Judith Briz Galera


## Introduction

## Objectives

## Previous concerns

## Dependencies (Required Libraries)

In the following cell, we import all the necessary libraries for the project.

- **Pandas**: A Python library used for data analysis and manipulation. It provides flexible and efficient data structures, such as DataFrames, for working with tabular datasets. Pandas offers a wide range of functions and methods for cleaning, transforming, and exploring data, making the data preparation process easier before applying machine learning algorithms.

- **NumPy**: A fundamental library for scientific computing in Python. It provides a data structure called a multidimensional array (ndarray) that allows for efficient operations on data arrays. NumPy is widely used in numerical analysis and data processing, providing functionality for mathematical operations, array manipulation, and statistical calculations.

- **scipy.stats**: Python library within SciPy that focuses on statistical functions and probability distributions. It offers tools for working with probability distributions, statistical tests, random variables, descriptive statistics, and modeling. It's a versatile library used for statistical analysis and hypothesis testing in scientific research and data analysis.

- **Matplotlib**: A data visualization library in Python. It provides a wide range of functions and methods for creating static plots, such as line plots, bar charts, scatter plots, and contour plots. Matplotlib is highly customizable and allows for adding labels, titles, legends, and other annotations to plots. It is a popular tool for data visualization in data analysis and result presentation.

- **Plotly**: An interactive data visualization library for Python. It enables the creation of interactive and dynamic charts, including scatter plots, line charts, bar charts, and surface plots. Plotly offers a web-based user interface for exploring and manipulating charts, making it easy to create interactive visualizations and present data.

- **Seaborn**: A data visualization library based on Matplotlib. It provides a high-level interface for creating attractive and concise statistical plots. Seaborn simplifies the creation of distribution plots, regression plots, correlation plots, and other common chart types in data analysis. Additionally, Seaborn offers predefined styles and color palettes that enhance the appearance of charts.

- **Scikit-learn (sklearn)**: An open-source machine learning library for Python. It provides a wide range of algorithms and tools for performing machine learning tasks, such as classification, regression, clustering, and feature selection. Scikit-learn stands out for its ease of use and focus on efficiency and scalability. In addition to algorithms, the library also offers utilities for model evaluation, cross-validation, and data preprocessing.

### 1.1 Imports

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

# Visualizing
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Feature Selection
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

# Preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

# Figures of merit
# metrics for linear regression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
# metrics for logistic regression
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, roc_curve, confusion_matrix, classification_report

# Models to be implemented
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge, ElasticNet

# Cross-validation
from sklearn.model_selection import GridSearchCV

from sklearn.multiclass import OneVsRestClassifier

from sklearn.model_selection import cross_val_score, KFold

### 1.2. Dataset import

First of all, we import the clean dataset we obtained in Project 1, which was already split in train and test, but without normalizing it.

In [2]:
# Import train set
train = pd.read_csv('train_data_bcl.csv')
# Import test set
test = pd.read_csv('test_data_bcl.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'train_data_bcl.csv'

Now divide each of the sets in features and target variable.

In [None]:
# Divide train set in features and target variable
X_train, y_train = (train.drop('Fraction of unvoiced frames', axis=1), train['Fraction of unvoiced frames'])
# Divide test set in features and target variable
X_test, y_test = (test.drop('Fraction of unvoiced frames', axis=1), test['Fraction of unvoiced frames'])

In [None]:
Finally, we can show each of the sets.

In [None]:
# Show features for training set
X_train.head(5)

In [None]:
# Show features for test set
X_test.head(5)

In [None]:
# Show target variable for train set
y_train.head(5)

In [None]:
# Show target variable for test set
y_test.head(5)

### 1.3. Normalization

A function is used in order to have different types of normalization and test which of them gives the better results for the posterior training, validation and evaluation.

In [None]:
def normalizing(tp, X_train, X_test, y_train, y_test):
    """
    Normalizes the input datasets using the specified scaler type.

    Parameters:
    - tp (str): Type of scaler to use. Options: "ss" for StandardScaler, "mm" for MinMaxScaler, "rs" for RobustScaler.
    - X_train (array-like): Training feature dataset.
    - X_test (array-like): Testing feature dataset.
    - y_train (array-like): Training target variable.
    - y_test (array-like): Testing target variable.

    Returns:
    Tuple of normalized feature datasets and target variables: (X_train_norm, y_train, X_test_norm, y_test).
    """

    if tp == "ss":
        # Dataset Normalization using StandardScaler for 'Status'
        scaler = StandardScaler()
    elif tp == "mm":
        # Dataset Normalization using MinMaxScaler for 'Status'
        scaler = MinMaxScaler()
    elif tp == "rs":
        # Dataset Normalization using RobustScaler for 'Status'
        scaler = RobustScaler()

    # Fitting the scaler with the X_train subset and normalizing it
    X_train_norm = scaler.fit_transform(X_train)
    # Normalizing the X_test subset with respect to the values taken from X_train, as the scaler was trained with it
    X_test_norm = scaler.transform(X_test)
    # Shuffle the normalized sets
    X_train_norm, y_train = shuffle(X_train_norm, y_train, random_state=20)
    X_test_norm, y_test = shuffle(X_test_norm, y_test, random_state=20)

    return X_train_norm, y_train, X_test_norm, y_test