### Problem Statement :

__Part 1:__

Link https://www.mathworks.com/help/deeplearning/gs/fit-data-with-a-neural-network.html, do the following:

Open the Neural Net Fitting app by typing nftool at the command line. Choose the internal “Body Fat” dataset there and run the example after slightly changing the training/validation / testing percentages and the number of neurons. Submit screenshots of your outputs.  

__Part 2:__

Review information about MATLAB’s/Python internal dataset “bodyfat_dataset” by typing “help bodyfat_dataset“. Load the dataset by entering:

[X, T] = bodyfat_dataset;

__A-__ Find the correlation coefficient of each input with the output:

corrcoef(X(i,:),T), i=1,2,…13.

Which inputs are more suitable for linear regression? Find that linear regression (i.e. the linear mixture coefficients) by using those input variables that have better a correlation with the output, and report the results as the baseline: after the aforesaid feature selection, train (i.e. build the regression model) on the first ¾ of the samples, and then test that linear model on the rest ¼. Do so for 2 subsets. Report on the MSE of train and test as the goodness of fit measure (loss function).

__B-__ (1 and 2 required for graduates, optional for undergrads with up to an extra 20 bonus points) Now use a few different types of MLP neural nets on all the 13 input feature dimensionalities. For each requested task, reset (using the command ‘init’) and retrain the neural network model at hand 10 times and report on both mean and variance on training and validation MSEs as follows:

* Create a simple 15-node, one-hidden-layer regression MLP using the command ‘fitnet.’ Use the network with its default settings except for training/validation/test data partitioning ratios, which you shall set to 70%, 15%, and 15 %, respectively. Find the mean and variance of MSEs for training, validation, and testing portions of the dataset from the 10 training repetitions (with random initializations using init(net) ).

* Change the network’s hidden layer size to 2 nodes and then to 80 nodes, with training, validation, and testing ratios set to 30%, 20%, and 50%; and then find the mean and variance of MSEs for training and validation portions of the dataset from the 10 training repetitions (again with random model initializations).

* (Graduate section only, optional bonus, up to 10 points) For the 80-node model, have regularization (weight decay) set at 0.1 and 0.5 and the find mean and variance of MSEs for training and validation portions of the dataset from the 10 repetitions for each case.

### Task


* Train an MLP regression network using fitnet n MATLAB.
* Use default settings except for specific data partitioning ratios.
* Run the network for 10 training repetitions with random initializations.
* Change hidden layer sizes (15, 2, and 80) and compute the mean and variance of MSE for training, validation, and testing sets.
* Add weight regularization for the 80-node model and analyze results.
* Document results in a table and provide explanations.

#### Step 1: Setup and Libraries
Use `TensorFlow/Keras` for the MLP and NumPy for data manipulation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#### Step 2: Load the Dataset
Read the dataset.

In [4]:
# Load the dataset
url = "https://hbiostat.org/data/repo/bodyfat.csv"
data = pd.read_csv(url)
data.head()

Unnamed: 0,Density,BodyFat,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
0,1.0708,12.3,23,154.25,67.75,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,1.0853,6.1,22,173.25,72.25,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,1.0414,25.3,22,154.0,66.25,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,1.0751,10.4,26,184.75,72.25,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,1.034,28.7,24,184.25,71.25,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7


In [6]:
# # Load the dataset
# data = pd.read_csv('bodyfat.csv')
# data.head()

#### Review information about MATLAB’s/Python internal dataset “bodyfat_dataset” by typing “help bodyfat_dataset“. Load the dataset by entering:

[X, T] = bodyfat_dataset;

In [8]:
# help(bodyfat_dataset)

#### A : To Find the correlation coefficient of each input with the output:

corrcoef(X(i,:),T), i=1,2,…13.

* Which inputs are more suitable for linear regression?
* Find that linear regression (i.e. the linear mixture coefficients) by using those input variables that have better a correlation with the output, and
* Report the results as the baseline: after the aforesaid feature selection, train (i.e. build the regression model) on the first ¾ of the samples, and then test that linear model on the rest ¼. Do so for 2 subsets.
* Report on the MSE of train and test as the goodness of fit measure (loss function).

In [10]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 1: Load the dataset
url = "https://hbiostat.org/data/repo/bodyfat.csv"
data = pd.read_csv(url)

# Step 2: Separate input (X) and output (T)
X = data.iloc[:, :-1].to_numpy()  # All input columns
T = data.iloc[:, -1].to_numpy()   # Output column (BodyFat)

# Step 3: Find correlation coefficients for each input with the output
correlations = [np.corrcoef(X[:, i], T)[0, 1] for i in range(X.shape[1])]
print("Correlation coefficients for each input with the output:")
for i, corr in enumerate(correlations):
    print(f"Input {i+1}: {corr:.4f}")

# Step 4: Select inputs with high correlation for linear regression
threshold = 0.5                                          # Example threshold for correlation
selected_features = [i for i, corr in enumerate(correlations) if abs(corr) > threshold]
print(f"Selected features (indices): {selected_features}")

# Step 5: Create a reduced dataset using the selected features
X_reduced = X[:, selected_features]

# Step 6: Split the data into training and testing subsets
X_train, X_test, T_train, T_test = train_test_split(
    X_reduced, T, test_size=0.25, random_state=42
)

# Step 7: Train a linear regression model
model = LinearRegression()
model.fit(X_train, T_train)

# Step 8: Test the model
T_train_pred = model.predict(X_train)
T_test_pred = model.predict(X_test)

# Step 9: Calculate MSE for training and testing
mse_train = mean_squared_error(T_train, T_train_pred)
mse_test = mean_squared_error(T_test, T_test_pred)

print(f"\nLinear Regression Results:")
print(f"MSE (Train): {mse_train:.4f}")
print(f"MSE (Test): {mse_test:.4f}")

# Step 10: Report the regression coefficients
print(f"Regression Coefficients: {model.coef_}")


Correlation coefficients for each input with the output:
Input 1: -0.3257
Input 2: 0.3466
Input 3: 0.2135
Input 4: 0.7298
Input 5: 0.3221
Input 6: 0.7448
Input 7: 0.6602
Input 8: 0.6198
Input 9: 0.6301
Input 10: 0.5587
Input 11: 0.6645
Input 12: 0.5662
Input 13: 0.6321
Input 14: 0.5856
Selected features (indices): [3, 5, 6, 7, 8, 9, 10, 11, 12, 13]

Linear Regression Results:
MSE (Train): 0.2680
MSE (Test): 0.3933
Regression Coefficients: [ 0.01550398  0.15009994 -0.00646441 -0.00087323 -0.00134432 -0.06600723
  0.08448081  0.11188828  0.00682465  0.02708931]


__B-__ (1 and 2 required for graduates, optional for undergrads with up to an extra 20 bonus points) Now use a few different types of MLP neural nets on all the 13 input feature dimensionalities. For each requested task, reset (using the command ‘init’) and retrain the neural network model at hand 10 times and report on both mean and variance on training and validation MSEs as follows:

* Create a simple 15-node, one-hidden-layer regression MLP using the command ‘fitnet.’ Use the network with its default settings except for training/validation/test data partitioning ratios, which you shall set to 70%, 15%, and 15 %, respectively. Find the mean and variance of MSEs for training, validation, and testing portions of the dataset from the 10 training repetitions (with random initializations using init(net) ).

* Change the network’s hidden layer size to 2 nodes and then to 80 nodes, with training, validation, and testing ratios set to 30%, 20%, and 50%; and then find the mean and variance of MSEs for training and validation portions of the dataset from the 10 training repetitions (again with random model initializations).

* (Graduate section only, optional bonus, up to 10 points) For the 80-node model, have regularization (weight decay) set at 0.1 and 0.5 and the find mean and variance of MSEs for training and validation portions of the dataset from the 10 repetitions for each case.

In [13]:
# Load the dataset
url = "https://hbiostat.org/data/repo/bodyfat.csv"
data = pd.read_csv(url)
data.head()

Unnamed: 0,Density,BodyFat,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
0,1.0708,12.3,23,154.25,67.75,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,1.0853,6.1,22,173.25,72.25,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,1.0414,25.3,22,154.0,66.25,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,1.0751,10.4,26,184.75,72.25,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,1.034,28.7,24,184.25,71.25,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7


In [12]:
# # Load the dataset
# data = pd.read_csv('bodyfat.csv')

In [14]:
# Separate features (X) and target (y)
X = data.drop(columns=['BodyFat'])           # Predictor variables
y = data['BodyFat']                          # Target variable

# Standardize the features and target
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X = scaler_X.fit_transform(X)
y = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten()

#### Step 3: Define Train/Validation/Test Splits
Set the data partitions as per the requirements.

In [15]:
# Split for 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=0)

#### Step 4: Define the MLP Function
Create a function to build an MLP model with variable hidden nodes and optional regularization.

In [16]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2

In [17]:
def build_mlp(input_dim, hidden_size, regularization=0):
    model = Sequential([
        Dense(hidden_size, activation='relu', input_dim=input_dim, kernel_regularizer=l2(regularization)),
        Dense(1, activation='linear')
    ])
    model.compile(optimizer=Adam(learning_rate=0.01), loss='mse')
    return model


#### Step 5: Training and Evaluation Loop
Implement a function to train and evaluate the model 10 times with random initializations.

In [18]:
def train_and_evaluate(X_train, y_train, X_val, y_val, X_test, y_test, hidden_size, train_ratio, val_ratio, test_ratio, regularization=0):
    train_mses, val_mses, test_mses = [], [], []

    for i in range(10):
        # Build and initialize the model
        model = build_mlp(X_train.shape[1], hidden_size, regularization)

        # Train the model
        history = model.fit(
            X_train, y_train,
            validation_data=(X_val, y_val),
            epochs=100,
            batch_size=32,
            verbose=0
        )

        # Evaluate MSEs
        train_mses.append(model.evaluate(X_train, y_train, verbose=0))
        val_mses.append(model.evaluate(X_val, y_val, verbose=0))
        test_mses.append(model.evaluate(X_test, y_test, verbose=0))

    # Calculate statistics
    return {
        'train': {'mean': np.mean(train_mses), 'variance': np.var(train_mses)},
        'val': {'mean': np.mean(val_mses), 'variance': np.var(val_mses)},
        'test': {'mean': np.mean(test_mses), 'variance': np.var(test_mses)},
    }

#### Step 6: Run Experiments
__Experiment 1: 15 Nodes__

In [19]:
result_15_nodes = train_and_evaluate(X_train, y_train, X_val, y_val, X_test, y_test, hidden_size=15, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [20]:
# Print results with 4 decimal places
print("Results for 15 Nodes:")
print(f"Train MSE Mean: {result_15_nodes['train']['mean']:.4f}")
print(f"Train MSE Variance: {result_15_nodes['train']['variance']:.4f}")
print(f"Validation MSE Mean: {result_15_nodes['val']['mean']:.4f}")
print(f"Validation MSE Variance: {result_15_nodes['val']['variance']:.4f}")
print(f"Test MSE Mean: {result_15_nodes['test']['mean']:.4f}")
print(f"Test MSE Variance: {result_15_nodes['test']['variance']:.4f}")

Results for 15 Nodes:
Train MSE Mean: 0.0079
Train MSE Variance: 0.0000
Validation MSE Mean: 0.0231
Validation MSE Variance: 0.0001
Test MSE Mean: 0.0282
Test MSE Variance: 0.0000


__Experiment 2: 2 Nodes__

In [21]:
result_2_nodes = train_and_evaluate(X_train, y_train, X_val, y_val, X_test, y_test, hidden_size=2, train_ratio=0.3, val_ratio=0.2, test_ratio=0.5)

In [22]:
# Print results with 4 decimal places
print("Results for 2 Nodes:")
print(f"Train MSE Mean: {result_2_nodes['train']['mean']:.4f}")
print(f"Train MSE Variance: {result_2_nodes['train']['variance']:.4f}")
print(f"Validation MSE Mean: {result_2_nodes['val']['mean']:.4f}")
print(f"Validation MSE Variance: {result_2_nodes['val']['variance']:.4f}")
print(f"Test MSE Mean: {result_2_nodes['test']['mean']:.4f}")
print(f"Test MSE Variance: {result_2_nodes['test']['variance']:.4f}")

Results for 2 Nodes:
Train MSE Mean: 0.0347
Train MSE Variance: 0.0002
Validation MSE Mean: 0.0163
Validation MSE Variance: 0.0001
Test MSE Mean: 0.0305
Test MSE Variance: 0.0000


__Experiment 3: 80 Nodes__

In [23]:
result_80_nodes = train_and_evaluate(X_train, y_train, X_val, y_val, X_test, y_test, hidden_size=80, train_ratio=0.3, val_ratio=0.2, test_ratio=0.5)

In [24]:
# Print results with 4 decimal places
print("Results for 80 Nodes:")
print(f"Train MSE Mean: {result_80_nodes['train']['mean']:.4f}")
print(f"Train MSE Variance: {result_80_nodes['train']['variance']:.4f}")
print(f"Validation MSE Mean: {result_80_nodes['val']['mean']:.4f}")
print(f"Validation MSE Variance: {result_80_nodes['val']['variance']:.4f}")
print(f"Test MSE Mean: {result_80_nodes['test']['mean']:.4f}")
print(f"Test MSE Variance: {result_80_nodes['test']['variance']:.4f}")

Results for 80 Nodes:
Train MSE Mean: 0.0075
Train MSE Variance: 0.0000
Validation MSE Mean: 0.0136
Validation MSE Variance: 0.0000
Test MSE Mean: 0.0258
Test MSE Variance: 0.0000


__Experiment 4: 80 Nodes with Regularization__

In [25]:
result_80_nodes_reg_01 = train_and_evaluate(X_train, y_train, X_val, y_val, X_test, y_test, hidden_size=80, train_ratio=0.3, val_ratio=0.2, test_ratio=0.5, regularization=0.1)
result_80_nodes_reg_05 = train_and_evaluate(X_train, y_train, X_val, y_val, X_test, y_test, hidden_size=80, train_ratio=0.3, val_ratio=0.2, test_ratio=0.5, regularization=0.5)


In [26]:
# Print results with 4 decimal places for regularization = 0.1
print("Results for 80 Nodes with Regularization (0.1):")
print(f"Train MSE Mean: {result_80_nodes_reg_01['train']['mean']:.4f}")
print(f"Train MSE Variance: {result_80_nodes_reg_01['train']['variance']:.4f}")
print(f"Validation MSE Mean: {result_80_nodes_reg_01['val']['mean']:.4f}")
print(f"Validation MSE Variance: {result_80_nodes_reg_01['val']['variance']:.4f}")
print(f"Test MSE Mean: {result_80_nodes_reg_01['test']['mean']:.4f}")
print(f"Test MSE Variance: {result_80_nodes_reg_01['test']['variance']:.4f}")

Results for 80 Nodes with Regularization (0.1):
Train MSE Mean: 0.0507
Train MSE Variance: 0.0003
Validation MSE Mean: 0.0397
Validation MSE Variance: 0.0002
Test MSE Mean: 0.0492
Test MSE Variance: 0.0003


In [27]:
# Print results with 4 decimal places for regularization = 0.5
print("\nResults for 80 Nodes with Regularization (0.5):")
print(f"Train MSE Mean: {result_80_nodes_reg_05['train']['mean']:.4f}")
print(f"Train MSE Variance: {result_80_nodes_reg_05['train']['variance']:.4f}")
print(f"Validation MSE Mean: {result_80_nodes_reg_05['val']['mean']:.4f}")
print(f"Validation MSE Variance: {result_80_nodes_reg_05['val']['variance']:.4f}")
print(f"Test MSE Mean: {result_80_nodes_reg_05['test']['mean']:.4f}")
print(f"Test MSE Variance: {result_80_nodes_reg_05['test']['variance']:.4f}")


Results for 80 Nodes with Regularization (0.5):
Train MSE Mean: 0.0906
Train MSE Variance: 0.0014
Validation MSE Mean: 0.0780
Validation MSE Variance: 0.0011
Test MSE Mean: 0.0885
Test MSE Variance: 0.0015


#### Step 7: Tabulate Results
Compile the results into a table for comparison.

In [28]:
import pandas as pd

# Compile results
results = pd.DataFrame({
    'Configuration': ['15 Nodes', '2 Nodes', '80 Nodes', '80 Nodes (Reg=0.1)', '80 Nodes (Reg=0.5)'],
    'Train MSE Mean': [result_15_nodes['train']['mean'], result_2_nodes['train']['mean'], result_80_nodes['train']['mean'], result_80_nodes_reg_01['train']['mean'], result_80_nodes_reg_05['train']['mean']],
    'Train MSE Variance': [result_15_nodes['train']['variance'], result_2_nodes['train']['variance'], result_80_nodes['train']['variance'], result_80_nodes_reg_01['train']['variance'], result_80_nodes_reg_05['train']['variance']],
    'Val MSE Mean': [result_15_nodes['val']['mean'], result_2_nodes['val']['mean'], result_80_nodes['val']['mean'], result_80_nodes_reg_01['val']['mean'], result_80_nodes_reg_05['val']['mean']],
    'Val MSE Variance': [result_15_nodes['val']['variance'], result_2_nodes['val']['variance'], result_80_nodes['val']['variance'], result_80_nodes_reg_01['val']['variance'], result_80_nodes_reg_05['val']['variance']]
})

print(results)


        Configuration  Train MSE Mean  Train MSE Variance  Val MSE Mean  \
0            15 Nodes        0.007944            0.000010      0.023078   
1             2 Nodes        0.034673            0.000236      0.016295   
2            80 Nodes        0.007502            0.000038      0.013569   
3  80 Nodes (Reg=0.1)        0.050677            0.000280      0.039729   
4  80 Nodes (Reg=0.5)        0.090556            0.001358      0.078018   

   Val MSE Variance  
0          0.000075  
1          0.000060  
2          0.000042  
3          0.000205  
4          0.001078  


In [32]:
# Save the DataFrame to a CSV file with values rounded to 5 decimal places
results.to_csv('results_summary.csv', float_format='%.7f', index=False)

print("Results saved to 'results_summary.csv'")



Results saved to 'results_summary.csv'


#### Step 8: Write Observations
* __Effect of Hidden Layer Size:__ Compare the results for 2, 15, and 80 nodes to analyze the impact of model capacity.
* __Effect of Regularization:__ Observe how weight decay affects generalization and overfitting.
