### Consider the winequality-red.csv dataset (available at the webpage) where the goal is to estimate the quality (sensory appreciation) of a wine based on physicochemical inputs. 
### Using a 80-20 training-test split with a fixed seed (random_state=0), you are asked to learn MLP regressors to answer the following questions.
### Given their stochastic behavior, average the performance of each MLP from 10 runs (for reproducibility consider seeding the MLPs with random_state ∈ {1. .10}).

### 1) [3.5v] Learn a MLP regressor with 2 hidden layers of size 10, rectifier linear unit activation on all nodes, and early stopping with 20% of training data set aside for validation. All remaining parameters (e.g., loss, batch size, regularization term, solver) should be set as default. Plot the distribution of the residues (in absolute value) using a histogram.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split

    
# Reading the CSV file
df = pd.read_csv("winequality-red.csv", delimiter=";")

X = df.drop("quality", axis=1)  # Drop the "quality" column to get the features
y = df["quality"]  # Get the "quality" column as the target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)
# Residues 
res = []

for state in range(1, 11):
    mlp = MLPRegressor(hidden_layer_sizes=(10,10), activation = 'relu', random_state=state,\
                   early_stopping = True, validation_fraction = 0.2)
    mlp.fit(X_train, y_train)
    
    # Predict and calculate the residues
    pred = mlp.predict(X_test)

    res.append(np.abs(y_test - pred))
    
plt.hist(res, bins=20, alpha=0.5, label=[f'Seed {i}' for i in range(1, 11)])
plt.xlabel('Residues')
plt.ylabel('Frequency')
plt.title('Distribution of Residues')
plt.legend()
plt.show()

##
## Possiveis alteracoes:
##  1- Make regression anted do X_train (como usado na doscumentação)
##  2- Max_iters dentro do MLP_regressor (retirei pois não me lembro porque o pus lá)
##

### 2) [1.5v] Since we are in the presence of a integer regression task, a recommended trick is to round and bound estimates. Assess the impact of these operations on the MAE of the MLP learnt in previous question.

In [None]:
from sklearn.metrics import mean_absolute_error

# Initialize lists to store MAE before and after rounding and bounding
mae_original = []
mae_round = []

# Define lower and upper bounds for estimates
lower_bound = 0  # Adjust as needed
upper_bound = 10  # Adjust as needed

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Loop through random seeds from 1 to 10
for state in range(1, 11):
    # Create and train the MLP regressor
    mlp = MLPRegressor(hidden_layer_sizes=(10,10), activation = 'relu', random_state=state,\
                   early_stopping = True, validation_fraction = 0.2)
    mlp.fit(X_train, y_train)
    
    # Predict and calculate the residues
    pred = mlp.predict(X_test)

    # Calculate the MAE before rounding and bounding
    mae_original.append(mean_absolute_error(y_test, pred))
    
    rounded_predictions = np.round(pred)  # Round to the nearest integer
    #y_rounded = np.clip(rounded_predictions, lower_bound, upper_bound)  # Bound the estimates within a range
    #mae_round.append(mean_absolute_error(y_test, y_rounded))
    
    # Calculate the MAE after rounding and bounding
    mae_round.append(mean_absolute_error(y_test, rounded_predictions))

# Print the MAE before and after rounding and bounding for each run
for i in range(10):
    print(f"State {i+1}: MAE Before = {mae_original[i]}, MAE After = {mae_round[i]}")

# Calculate the average MAE before and after rounding and bounding
average_mae_original = np.mean(mae_original)
average_mae_round = np.mean(mae_round)

# Print the average MAE before and after
print(f"\nAverage MAE Before Round and Bound Estiamtes = {average_mae_original}")
print(f"Average MAE After Round and Bound Estimates = {average_mae_round}")


### 3) [1.5v] Similarly assess the impact on RMSE from replacing early stopping by a well-defined number of iterations in {20,50,100,200} (where one iteration corresponds to a batch)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# MLP com early stopping
rmse_original = []  

# Arrays para cada max_iterations
rmse_20 = []
rmse_50 = []
rmse_100 = []
rmse_200 = []

rmse_arrays = [rmse_20, rmse_50, rmse_100, rmse_200]


iterations = [20, 50, 100, 200]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Early stopping MLP
for random_state in range(1, 11):

    original_mlp = MLPRegressor(hidden_layer_sizes=(10,10), activation='relu', early_stopping=True,\
                                validation_fraction=0.2, random_state=random_state)
    original_mlp.fit(X_train, y_train)

    y_pred_original = original_mlp.predict(X_test)

    # RMSE
    rmse_original.append(sqrt(mean_squared_error(y_test, y_pred_original)))

    
# Max iterations 
for num_iterations in iterations:
    for random_state in range(1, 11):

        mlp_iterations = MLPRegressor(hidden_layer_sizes=(10,10), activation='relu', \
                                    max_iter=num_iterations, random_state=random_state)
        mlp_iterations.fit(X_train, y_train)


        y_pred_iterations = mlp_iterations.predict(X_test)

        # Calcula rmse
        rmse = sqrt(mean_squared_error(y_test, y_pred_iterations))
        
        if num_iterations == 20:
            rmse_20.append(rmse)
        elif num_iterations == 50:
            rmse_50.append(rmse)
        elif num_iterations == 100:
            rmse_100.append(rmse)
        elif num_iterations == 200:
            rmse_200.append(rmse)

means = []

# Para melhor codigo, itera pelo array que contem todas as iteracoes
for i, rmse_array in enumerate(rmse_arrays):
    mean_value = np.mean(rmse_array)
    means.append((iterations[i], mean_value))

# Prints
for mean_rmse in means:
    print(f"Mean RMSE of {mean_rmse[0]} iterations: {mean_rmse[1]}")


mean_rmse_original = np.mean(rmse_original)

print(f"\nMean RMSE with Early Stopping = {mean_rmse_original}")

### 3) [1.5v] Critically comment the results obtained in previous question, hypothesizing at least one reason why early stopping favors and/or worsens performance.