### Consider the winequality-red.csv dataset (available at the webpage) where the goal is to estimate the quality (sensory appreciation) of a wine based on physicochemical inputs. 
### Using a 80-20 training-test split with a fixed seed (random_state=0), you are asked to learn MLP regressors to answer the following questions.
### Given their stochastic behavior, average the performance of each MLP from 10 runs (for reproducibility consider seeding the MLPs with random_state ∈ {1. .10}).

### 1) [3.5v] Learn a MLP regressor with 2 hidden layers of size 10, rectifier linear unit activation on all nodes, and early stopping with 20% of training data set aside for validation. All remaining parameters (e.g., loss, batch size, regularization term, solver) should be set as default. Plot the distribution of the residues (in absolute value) using a histogram.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split

# Reading the CSV file
df = pd.read_csv("winequality-red.csv", sep=";")

X = df.drop("quality", axis=1)  # Drop the "quality" column to get the features
y = df["quality"]  # Get the "quality" column as the target variable

X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.2, random_state=0)

# Residues
res = []

for state in range(1, 11):
    mlp = MLPRegressor(hidden_layer_sizes=(10, 10), activation='relu', random_state=state, early_stopping=True, validation_fraction=0.2)
    mlp.fit(X_train, y_train)

    # Predict and calculate the residues
    pred = mlp.predict(X_test)
    residue = np.abs(y_test - pred)
    res.extend(residue)

# Plot the histogram of the residues using Seaborn
sns.histplot(res, bins = 30)
plt.title("Distribution of Residues (Absolute Value)")
plt.xlabel("Residues (Absolute Value)")
plt.ylabel("Frequency")
plt.show()


### 2) [1.5v] Since we are in the presence of a integer regression task, a recommended trick is to round and bound estimates. Assess the impact of these operations on the MAE of the MLP learnt in previous question.

In [None]:
from sklearn.metrics import mean_absolute_error

# Initialize lists to store MAE before and after rounding and bounding
mae_original = []
mae_round = []

X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.2, random_state=0)

# Loop through random seeds from 1 to 10
for state in range(1, 11):
    
    mlp = MLPRegressor(hidden_layer_sizes=(10,10), activation = 'relu', random_state=state,
                   early_stopping = True, validation_fraction = 0.2)
    mlp.fit(X_train, y_train)
    
    y_pred = mlp.predict(X_test)

    # Calculate the MAE before rounding and bounding
    mae_original.append(mean_absolute_error(y_test, y_pred))
    
    
    # Round and Bound estimates
    round_pred = np.round(y_pred) 
    y_rounded = np.clip(round_pred, 1, 10)  
    
    # Calculate the MAE after rounding and bounding
    mae_round.append(mean_absolute_error(y_test, y_rounded))

# Calculate the average MAE before and after rounding and bounding
average_mae_original = np.mean(mae_original)
average_mae_round = np.mean(mae_round)

# Print the average MAE before and after
print(f"\nAverage MAE Before Round and Bound Estimates = {average_mae_original}")
print(f"Average MAE After Round and Bound Estimates = {average_mae_round}")

### 3) [1.5v] Similarly assess the impact on RMSE from replacing early stopping by a well-defined number of iterations in {20,50,100,200} (where one iteration corresponds to a batch)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# MLP with Early Stopping
rmse_original = []  

# Arrays for each max_iterations
rmse_20 = []
rmse_50 = []
rmse_100 = []
rmse_200 = []

rmse_arrays = [rmse_20, rmse_50, rmse_100, rmse_200]


iterations = [20, 50, 100, 200]

X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.2, random_state=0)

# Early stopping MLP for RMSE comparison in exercise 4
for random_state in range(1, 11):
    
    original_mlp = MLPRegressor(hidden_layer_sizes=(10,10), activation='relu', random_state=random_state,
                                early_stopping=True, validation_fraction=0.2)
    original_mlp.fit(X_train, y_train)

    y_pred_original = original_mlp.predict(X_test)

    # RMSE
    rmse_original.append(sqrt(mean_squared_error(y_test, y_pred_original)))


# Max iterations 
for num_iterations in iterations:
    for random_state in range(1, 11):

        mlp_iterations = MLPRegressor(hidden_layer_sizes=(10,10), activation='relu', 
                                      random_state=random_state, max_iter=num_iterations)
        mlp_iterations.fit(X_train, y_train)


        y_pred_iterations = mlp_iterations.predict(X_test)

        # Calcula rmse
        rmse = sqrt(mean_squared_error(y_test, y_pred_iterations))
        
        if num_iterations == 20:
            rmse_20.append(rmse)
        elif num_iterations == 50:
            rmse_50.append(rmse)
        elif num_iterations == 100:
            rmse_100.append(rmse)
        elif num_iterations == 200:
            rmse_200.append(rmse)

means = []

# For better code, iterates through the array containing all the iterations
for i, rmse_array in enumerate(rmse_arrays):
    mean_value = np.mean(rmse_array)
    means.append((iterations[i], mean_value))

# Prints
for mean_rmse in means:
    print(f"Mean RMSE of {mean_rmse[0]} iterations: {mean_rmse[1]}")


mean_rmse_original = np.mean(rmse_original)

print(f"\nMean RMSE with Early Stopping = {mean_rmse_original}")

### 4) [1.5v] Critically comment the results obtained in previous question, hypothesizing at least one reason why early stopping favors and/or worsens performance.

In [None]:
# Iterações do modelo com early stopping
# Usado para comparar os modelos
print(original_mlp.n_iter_) 

Early stopping para o treino do modelo quando a performance em relação aos testes de validação começa a descer. Deste modo, este método é bom para evitar tanto o underfitting como o overfitting, treinando até ao momento que a accuracy para novos dados começa a descer. Com o número máximo de iterações podemos notar que o RMSE é maior para todos os modelos, exceto o que tem um máximo de 200 iterações. Podemos então concluir que nos outros modelos (20, 50 e 100 iterações) houve underfitting, não tendo iterações suficientes para treinar o modelo adequadamente (o que não acontece com o early stopping). 

O early stopping separa os dados de treino em validation e training sets. Uma das suas desvantagens é o facto de ser sensível à escolha do validation set: caso o validation set seja muito pequeno ou não muito representativo do testing set, o modelo pode parar muito cedo ou muito tarde, levando a underfitting ou overfitting. Uma vez que os modelos com early stopping e max_iterations a 200 param nas 200 iterações (no caso do early stopping, porque é o max_iterations por default), e sendo o RMSE do modelo com early stopping maior, concluímos que o validation set não deve ser o mais representativo do testing set, levando a um erro maior.