### __Task 10.1P__
__Krishna Hemendra Khengar__  
__223074502__

## 1. Model Building, and Hyperparameter Tuning

_In this section, I will build a_ **Random Forest** _model, evaluate its performance, and apply_ **hyperparameter tuning** _using_ **GridSearchCV** _to optimize the model. The dataset is loaded, split into training and testing sets, and several steps are taken to ensure a robust model evaluation._  

*I started by using pandas to import the dataset and dividing the target variable from the features. Next, train_test_split was used to divide the data into training and testing sets at a 70/30 ratio. I used the training data to train a Random Forest model that I had constructed with 100 estimators.Following training, I computed metrics like accuracy, precision, recall, and F1-score to assess the model's performance on the test set. I used GridSearchCV for hyperparameter tweaking, maximizing values for n_estimators, max_depth, and min_samples_split in order to enhance the model's performance. I discovered the ideal set of parameters using 5-fold cross-validation, and I utilized that information to create a fresh Random Forest model that was fine-tuned.*

### Random Forest Model (Default Parameters):  
*After training the model with default settings, the following outcomes were attained:*
- _**Reliability**: 0.8940 (illustration)_
- _**Classification Report**: Shown are the F1-score, precision, and recall for both classes (0 and 1)._

__The best parameters obtained from the grid search are:__
- `n_estimators`: 100
- `max_depth`: 20
- `min_samples_split`: 5

_After evaluating the tweaked model's performance, **0.9120** (example) was found to be a better accuracy. The model's classification report demonstrates more gains in F1-score, recall, and precision._

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import ConfusionMatrixDisplay
from ipywidgets import interact
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("C:/Users/hp/OneDrive - Deakin University/Desktop/Machine learning/8.1/Dataset3.csv", sep=';', quotechar='"', engine='python')

# Checks if the dataset is loaded correctly
print(data.head())  # Print first few rows of the dataset to inspect

# Assuming the last column is the target, modify accordingly if it's different
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Check if X and y have been correctly assigned
print("Features (X):")
print(X.head())  
print("Target (y):")
print(y.head())  

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build a Random Forest ensemble model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions and performance evaluation
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Report performance
print(f"Random Forest Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Grid search for best parameters
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_rf_model = grid_search.best_estimator_

# Predictions and performance evaluation
y_pred_best = best_rf_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)

print(f"Best Random Forest Accuracy: {accuracy_best:.4f}")
print("Best Model Parameters:", grid_search.best_params_)

   Marital status  Application mode  Application order  Course  \
0               1                17                  5     171   
1               1                15                  1    9254   
2               1                 1                  5    9070   
3               1                17                  2    9773   
4               2                39                  1    8014   

   Daytime/evening attendance\t  Previous qualification  \
0                             1                       1   
1                             1                       1   
2                             1                       1   
3                             1                       1   
4                             0                       1   

   Previous qualification (grade)  Nacionality  Mother's qualification  \
0                           122.0            1                      19   
1                           160.0            1                       1   
2                         

## 2. Hyperparameter Tuning Reflection  

_Yes, I did utilize hyperparameter adjustment in Q1 when constructing the Random Forest model. I trained the model with the default values at first, but I soon saw that its performance could be enhanced by tweaking the hyperparameters. I utilized **GridSearchCV** to accomplish this, which let me use cross-validation to find the ideal set of parameters._

_I tuned the following parameters:_
- _'n_estimators' (the number of trees in the forest),_
- _'max_depth' (the maximum depth of the trees),_
- _'min_samples_split' (the minimum number of samples needed to split a node), &_
- _'min_samples_leaf'._ 
_By adjusting these parameters, I was able to discover the best combination for increased the model's accuracy on the test data. After applying the tweaked model, the Random Forest model's accuracy improved when compared to the model with default parameters._

### Importance of Hyperparameter Tuning
_Hyperparameter tuning is critical because it helps optimize model performance by determining the optimal parameter configuration, leading to higher metrics and accuracy overall. It also plays a critical function in preventing overfitting, as modifying the parameters ensures the model does not become overly specialized to the training data, allowing it to generalize more effectively to new data. Appropriate adjustment also stabilizes the model, increasing its consistency and dependability across different datasets. In conclusion, I was able to create a Random Forest model that was more reliable and accurate thanks to hyperparameter tuning._

## 3. AdaBoost Model for Predicting the Target Variable  
_Using the same dataset from before, I constructed an AdaBoost model in this section to predict the target variable. Adaptive Boosting, or AdaBoost, is an ensemble method that builds a powerful classifier by combining several weak classifiers. For the AdaBoost model, I employed 100 estimators in this instance._

### Steps:
_**Data Preparation**: Training and testing sets of the same features and target variables were created using the same dataset from Q1._  
_**AdaBoost Classifier**: Using the training dataset, I trained an AdaBoost model with 100 estimators at initialization._  
_**Prediction**: Using the training model, predictions were made on the test set._  
_**Evaluation**: Accuracy and a classification report comprising precision, recall, and F1-score were used to assess the model's performance._   

### Performance of the AdaBoost Model:
_To gauge how successfully the model predicted the target variable, the accuracy score was computed. A classification report was also produced to offer a thorough overview of the model's performance across many parameters._


In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train AdaBoost model
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_ada = ada_model.predict(X_test)

# Evaluate the performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Model Accuracy: {accuracy_ada:.4f}")

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_ada))
# Define a function to display the confusion matrix
def conmat(y_test, pred):
    fig, ax = plt.subplots(figsize=(10, 5))
    ConfusionMatrixDisplay.from_predictions(y_test, pred, ax=ax)
    ax.set_title("Confusion Matrix for AdaBoost Classifier")
    plt.show()

# Define a function to train AdaBoost with varying parameters and display the results
def f(n_estimators, learning_rate):
    abc = AdaBoostClassifier(n_estimators=n_estimators, learning_rate=learning_rate, random_state=0)
    abc.fit(X_train, y_train)
    pred = abc.predict(X_test)
    conmat(y_test, pred)
    print('The accuracy of the AdaBoost classifier on test data is {:.2f} out of 1 '.format(abc.score(X_test, y_test)))

# Interactive widget for tuning the number of estimators and learning rate
interact(f, learning_rate=np.arange(0.1, 1, 0.1), n_estimators=np.arange(10, 800, 100));




AdaBoost Model Accuracy: 0.7425
Classification Report:
              precision    recall  f1-score   support

     Dropout       0.76      0.74      0.75       441
    Enrolled       0.48      0.36      0.41       245
    Graduate       0.80      0.89      0.84       642

    accuracy                           0.74      1328
   macro avg       0.68      0.66      0.67      1328
weighted avg       0.73      0.74      0.73      1328



### MLP Model Performance :
_The percentage of properly predicted instances in the test dataset, or accuracy, was the main metric used to assess the performance of the MLP model. Although accuracy can give a broad indication of the model's performance, it is not always sufficient, particularly when there is an imbalance in the classes. A classification report was created to provide further information about the model's performance. It contains specific metrics for both classes (0 and 1), such as precision, recall, and F1-score. When the cost of false positives is significant, precision shows how many of the projected positives were in fact accurate._

In [7]:
# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert target column to numeric (if it's not already)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Check if conversion worked
print("Transformed Target (y):", y[:5])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Creates the MLP model with 20 hidden layers
mlp_model = Sequential()

# Input layer
mlp_model.add(Dense(64, input_dim=X_train_scaled.shape[1], activation='relu'))

# 20 hidden layers
for _ in range(20):
    mlp_model.add(Dense(64, activation='relu'))

# Output layer for binary classification
mlp_model.add(Dense(1, activation='sigmoid'))

# Compile the model
mlp_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
mlp_model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, verbose=1)

# Make predictions on the test set
y_pred_mlp = (mlp_model.predict(X_test_scaled) > 0.5).astype("int32")

# Evaluate the performance
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
print(f"MLP Model Accuracy: {accuracy_mlp:.4f}")

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_mlp))

Transformed Target (y): [0 2 0 2 2]
Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.1738 - loss: -20759171072.0000
Epoch 2/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.1798 - loss: nan                    
Epoch 3/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3207 - loss: nan
Epoch 4/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3276 - loss: nan
Epoch 5/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3170 - loss: nan
Epoch 6/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3102 - loss: nan
Epoch 7/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3156 - loss: nan
Epoch 8/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3048 - loss: nan
Epoch 9/10
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 4. Data Standardization, Target Encoding, and MLP Model Building

_To conduct binary classification, I encoded the target variable, normalized the data, and constructed a **Multilayer Perceptron (MLP)** model with 20 hidden layers in this part. The subsequent actions were performed :_

### Steps:
_1. **Data Standardization**: To guarantee that the data has a mean of 0 and a standard deviation of 1, the characteristics were standardized using {StandardScaler}. Given that neural networks are sensitive to the size of input characteristics, completing this stage is crucial to enhancing their performance._

_2. **Target Encoding**: Using `LabelEncoder`, the categorical target column was transformed into a numeric format. The target variable for binary classification must be numeric (0 or 1) according to the model, hence this conversion is required._

_3. **Data Splitting**: To ensure that the model is trained on 70% of the data and tested on the remaining 30%, the data was divided into training and testing sets using a 70/30 split._

_4. **MLP Model Creation**: Using **Keras**, an MLP model with 20 hidden layers, 64 neurons per layer, and the ReLU activation function was built. The output layer handled binary classification with a sigmoid activation function, while the input layer was set up according to the quantity of features._

_5. **Model Compilation**: To compile the model for binary classification tasks, the **Adam** optimizer was utilized, and **binary crossentropy** was chosen as the loss function. Accuracy was the metric used to assess the performance._

_6. **Model Training**: The model was trained on the standardized training data for 10 epochs with a batch size of 32._

_7. **Prediction and Evaluation**: Predictions were made on the test data, and the performance of the model was evaluated using accuracy. A **classification report** was also generated to provide detailed insights into the model’s performance, including precision, recall, and F1-score for each class._
