# Binary Classification using Tensorflow and Keras by developing Feed-Forward Neural Networks.

# First of all, what is a Neural Network?
-  A **neural network** is a computational model inspired by the structure of the human brain, consisting of layers of interconnected nodes (neurons) that process and learn patterns from data to perform tasks like classification, prediction, and recognition.
- They are often used in applications such as image recognition, natural language processing, recommendation systems, and generative AI.

# Key Definitions
- **TensorFlow:** An open-source machine learning framework developed by Google that provides a flexible platform for building and deploying deep learning models.
- **Keras:** A high-level neural network API built on top of TensorFlow, designed for easy and fast prototyping of deep learning models.
- **Feed-Forward Neural Network:** A type of artificial neural network where information moves in one direction—from input to output—without loops or cycles, making it the simplest form of deep learning architecture.

## Problem Statement: Predicting 'red wine quality'. 
- If Quality >= 5.5, then it is **'good wine'** (1).
- Otherwise it is **'bad wine'** (0).

---

## Initial Step: Load the Data

Key question before pre-processing is how many rows and columns does the data have?

In [None]:
import numpy as np

# Load the dataset, skipping the first row (headers) and selecting only numerical columns
dataset = np.loadtxt('WineQT.csv', delimiter=',', skiprows=1, usecols=range(12))

# Print dataset dimensions
print(dataset.shape)

In [None]:
import pandas as pd

# Convert dataset to a Pandas Dataframe for easier analysis
df = pd.DataFrame(dataset)

print(df.head())    # Display first 5 rows


In [None]:
# Load dataset with headers for better column identification
df = pd.read_csv('WineQT.csv')  
df.head()   # Show first few rows

In [None]:
# Drop 'Id' column since it's useless for training
df = df.drop(columns=['Id'])

In [None]:
# Display data set info now
df.info()

In [None]:
df.describe()

Convert the dataset back to a NumPy array because all calculations, preprocessing, and execution will be done using NumPy, but I initially used Pandas for easier inspection.

In [None]:
dataset = df.to_numpy()

---

## Step 1: Preview the first 5 rows.

In [None]:
# Prints only floating point numbers with a 2 decimal point precision. 
np.set_printoptions(formatter={'float': lambda x: '{0:0.2f}'.format(x)})

print(dataset[0:5, :])  # Display rows 0 to x with all columns

---

## Step 2: Prepare the output

If the last column is less than 5.5, set it to 0, otherwise 1.
- (good wine = 1, bad wine = 0)

In [None]:
dataset[dataset[:, -1] < 5.5, -1] = 0
dataset[dataset[:, -1] >= 5.5, -1] = 1

print(dataset[0:20, :])

We have now converted this problem into a **binary classification** problem because our output labels are now 0's and 1's. 

---

## Step 3: Shuffle the rows

This is done simply to get a good mix of all of the rows. It may be bad for reproducibility, but it is good for reliability. Why?
- every time you shuffle you will get completely different results, but it more or less guarantees that you won't get all 0's in the training or 1's in the testing and vice versa.

In [None]:
import random
np.random.shuffle(dataset) 

---

## Step 4: Split into Training/Testing data.

All the input values that go into the model are Xtrain, and what the model needs to predict is Ytrain.

Proposed rule of thumb is an 80/20 split for training and testing.

In [None]:
index_20percent = int(0.2 * len(dataset[:, 0]))

print(index_20percent)  # 228 samples will be tested

In [None]:
# Take samples from start to 228 for Testing
XTEST = dataset[:index_20percent, :-1]  # Extract all columns except the last into X (indicated by :-1)
YTEST = dataset[:index_20percent, -1]   # Extract last column only into y

In [None]:
# Take samples from 228 onwards (: indicates this)
XTRAIN = dataset[index_20percent:, 0:-1]    # Extract all columns besides last
YTRAIN = dataset[index_20percent:, -1]       # Extract only last

---

## Step 5: Normalize the data (if needed)

Not all datasets need normalization, but if your input column values do not necessarily sit around 0-1 then you may need to normalize

Normalization is needed when input values vary widely in scale because neural networks learn best when data is within a small, consistent range; large differences between features can slow training, cause unstable gradients, and make certain features dominate learning.

Common practice to normalize is to use standardization.

In [None]:
import matplotlib.pyplot as plt

plt.hist(XTRAIN[:, 0])
plt.ylabel('0th Column (fixed acidity)')
plt.show()

The 0th column 'fixed acidity has values ranging from roughly 3-15, and most are in the 6-10 region.

So, since these values are not in the 0-1 range we want to normalize them. 

In [None]:
# Check the split of 0's and 1's in our Y-data, aka what we are predicting.

plt.hist(YTRAIN)
plt.ylabel('Output labels')
plt.show()

plt.hist(YTEST)
plt.ylabel('Output labels')
plt.show()

**Baseline Accuracy** is determined by this split, as if you have more of one value than the other the baseline accuracy will be greater than or less than 50%.

Now, lets normalize the data with standardization:

In [None]:
# You obtain the mean and standard deviation from your training data
# Then use those values to normalize your testing data

mean = XTRAIN.mean(axis=0)
XTRAIN -= mean
std = XTRAIN.std(axis=0)
XTRAIN /= std

XTEST -= mean
XTRAIN /= std

It is important to calculate the mean and std using the training data, and subsequently use the same values for the testing data. Why?
- You only use the training data to calculate mean and std as a **True Test** of the model how it does on testing data using the parameters learned from the training set.

In [None]:
# mean and std are vectors
# So, we can see the mean and std for each feature we have

print(mean)
print(std)

In [None]:
# Now, look at the distribution of values for the same column again. 

plt.hist(XTRAIN[:, 0])
plt.ylabel('0th Column (fixed acidity)')
plt.show()

---

## Step 6: Review the Dimensions of the training & testing sets.

Also previews some of the 'input features' and 'correct labels' for the datasets.

In [None]:
# The number of rows in XTRAIN & YTRAIN must be the same.
print(XTRAIN.shape)
print(YTRAIN.shape)

In [None]:
# Same goes for the testing sets.
print(XTEST.shape)
print(YTEST.shape)

In [None]:
# Print the head (first 3) of the datasets
print(XTRAIN[0:3, ])
print(YTRAIN[0:3])
print(XTEST[0:3, ])
print(YTEST[0:3])

---

## Step 7: Create a neural network model

We want to create a network with the following architecture:
- 8 neurons in layer 1
- 4 neurons in layer 2
- 1 neuron as the last layer

In [None]:
from tensorflow.keras.models import Sequential  # Sequential model to stack layers
from tensorflow.keras.layers import Dense       # Dense layer aka fully connected layer

# Initialize a Sequential model (stacks layers in order)
model = Sequential()

# The number of inputs that each neuron receives is the number of columns in the data
model.add(Dense(8, input_dim = len(XTRAIN[0, :]), activation='relu'))

# input_dim = len(XTRAIN[0, :]) calculates the length by using the first row (0) and all columns(:)
# Necessary for first layer as each neuron needs to know how many inputs it is receiving when designing the architecture of the NN.

# All of the outputs from the 8 neurons in layer 1 become inputs to the 4 neurons in layer 2
model.add(Dense(4, activation='relu'))

# Since we are doing binary classification, we want 1 neuron in the last layer.
model.add(Dense(1, activation='sigmoid'))

### Why ReLU and Sigmoid?
- **ReLU (`relu`)** is used in hidden layers to introduce non-linearity. It helps the model learn complex patterns and prevents the vanishing gradient problem.
- **Sigmoid (`sigmoid`)** is used in the output layer for binary classification. It converts raw scores (logits) into probabilities between 0 and 1.
- **Logits** are the raw, unscaled outputs of a neuron before applying an activation function. They can take any real value, but we apply an activation function (like `sigmoid`) to convert them into interpretable probabilities.
- **The Vanishing Gradient Problem** happens when the learning signal (gradient) becomes too small, making earlier layers in a deep network stop learning. This is like trying to learn from feedback that gets quieter and quieter until you can’t hear it anymore. ReLU helps fix this by keeping gradients strong and allowing deeper layers to keep learning.

In [None]:
# Display model architecture
print(model.summary())

The `model.summary()` function provides a structured overview of the neural network, showing key details about each layer:

- **Layer Type**: Lists the different layers in the model (e.g., `Dense` for fully connected layers).  
- **Output Shape**: Displays the shape of the data as it moves through each layer.  
  - `(None, 8)`, `(None, 4)`, `(None, 1)` → The `None` indicates that the model can process any batch size.  
  - The numbers (8, 4, 1) represent the number of neurons in each layer.  
- **Number of Parameters**: Shows how many **trainable weights (connections between neurons) and biases (offset values for neurons)** each layer contains. For example, the first layer has **96 parameters** because 11*8 (weights) = 88 + 8 (biases) = 96.
- **Total Parameters**: The sum of all layer parameters, indicating the complexity of the model.  
- **Trainable Parameters**: The parameters that are updated during training.  
- **Non-Trainable Parameters**: Parameters that are fixed (e.g., from a pre-trained model).

### **Why This is Useful**
- Helps ensure the architecture is **correctly structured** before training.  
- Provides insight into the **model’s complexity** (more parameters = more capacity, but also risk of overfitting).  
- Useful for **debugging** if unexpected shapes or parameter mismatches occur.  


In [None]:
import matplotlib.image as mpimg

plt.figure(figsize=(15, 6))

img = mpimg.imread('NN.png')
plt.imshow(img)
plt.axis('off')
plt.show()

All the input features go into the network in a different way than the output. 

The 11 different features (columns) go into the 8 neurons in the 1st layer, from each column to every neuron. 

Those 8 neurons then feed the 4 neurons, and those 4 feed the final neuron that sits at the end. 

That final neuron, based on all the previous weights and calculations gives you an accuracy score from 0 to 1. 

If the score is bad, then the weights that connect the nodes need to be updated such that in future rounds of training the output matches closer with the input. 

---

## Step 8: Compile the model

This checks if there are cycles in the **sequential model**.

Typically, when designing sequential models there are usually no errors.

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

When defining the loss, for binary classification **binary crossentropy** is the only choice. Why?
- **Binary crossentropy** is used for binary classification because it measures how well the predicted probabilities match the actual class labels, making it ideal for problems with two possible outcomes.

For optimizer, we chose *adam* but could have chosen any adaptive optimizer like *rmsprop*.
- Adaptive optimizers automatically adjust the learning rate during training, allowing for faster and more stable convergence compared to fixed learning rate optimizers like SGD.
- **Adam (Adaptive Moment Estimation)** is an optimizer that combines the advantages of both RMSprop and momentum, making it effective for most deep learning tasks.
- **RMSprop (Root Mean Square Propagation)** is another optimizer that adapts the learning rate for each parameter, preventing large updates that can slow down convergence.
- **Momentum Optimizer** accelerates training by allowing the optimizer to keep moving in the same direction even if gradients change slightly, helping escape small local minima and improving convergence speed.

For **binary classification** the most widely used metric is *accuracy* as it is easier to understand.

---

## Step 9: Train the model

We feed *XTRAIN* into the model and the model calculates errors using *YTRAIN*.

In one *epoch* the model scans through all the rows in *XTRAIN*.

- An **epoch** is one full pass through of the entire dataset.

Updating the number of *epochs* usually increases the accuracy of the model, as the more often that the training process sees the entire dataset the more likely it is to get better on it. 

We add *test_data = (XTEST, YTEST)* to observe the accuracy of *TEST* data during the training process.

In [None]:
# Add some callbacks
from keras.callbacks import EarlyStopping, ModelCheckpoint

callback_a = ModelCheckpoint(filepath = 'my_best_model.keras', monitor='val_loss', save_best_only = True)   # Prevents saving worse models
callback_b = EarlyStopping(monitor='val_loss', mode='min', patience=20, verbose=1)                          # Prevents overfitting

## Callbacks: ModelCheckpoint & EarlyStopping
We added them to improve the model's training efficiency.
- **Model Checkpoint** saves the best model based on validation loss.
- **Early Stopping** stops the model if the validation loss does not decrease for 20 epochs. 
- **Validation Loss** measures how well the model performs on unseen data. A low validation loss means the model is learning patterns that generalize well, while a high validation loss may indicate overfitting.


In [None]:
history = model.fit(XTRAIN, YTRAIN, validation_data=(XTEST, YTEST), epochs=256, batch_size=10, callbacks=[callback_a, callback_b])

---

## Step 10: Check the learning curves

- The learning curves are the loss, accuracy, etc... over the number of epochs.
- Learning curves can only be tracked if we used history when training the model, as that is what stores the output of the model during training. 

In [None]:
print(history.params)

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(['training_data', 'validation_data'], loc='lower right')
plt.show()

- The training accuracy steadily increased, meaning the model is learning the training data well. 
- The validation accuracy increased, but flattens and slightly fluctuated, meaning that after some epochs, the model stopped improving on unseen data. 
- The dip in validation accuracy happens because the model initially struggles to generalize, possibly due to a high learning rate, small batch size, or overfitting to early training examples before stabilizing.

This suggests overfitting, meaning the model memorized the training data instead of generalizing well to new data. 

Reload the best weights so we are back with the highest performing model on the testing set. 

In [None]:
model.load_weights('my_best_model.keras')

---

## Step 11: Evaluate the model on the training data

This is done to validate the results we got on the Testing set. 

We are evaluating the model on the same data we used to train it, so essentially this evaluation is meaningless. But, it is done to make sure everything is working okay. 

In [None]:
scores = model.evaluate(XTRAIN, YTRAIN)
print(model.metrics_names)
print(scores)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Displays the accuracy.

---

## Step 12: Evaluate on testing set

This is a **real test** on the model as we are evaluating it on the 'unknown' dataset, aka unseen data. 

In [None]:
scores = model.evaluate(XTEST, YTEST)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

lower than the training evaluation.

---

## Step 13: Check what the model actually predicts

This is an example of what the model has predicted and a subsequent comparison with the true classes.

In [None]:
print(XTEST[0:5])
print(YTEST[0:5])

In [None]:
prediction = model.predict(XTEST)

In [None]:
print(prediction[0:10])

We see that the confidence scores are not matching in some areas.

In [None]:
print(prediction[0:10].round())

This is clearer once we round. 

The following scatter plot visualizes the relationship between the **true labels (X-axis)** and the **predicted confidence scores (Y-axis)**.  

In [None]:
plt.plot(YTEST, prediction, '.', alpha=0.3)
plt.xlabel('Correct labels')
plt.ylabel('Predicted confidence scores')
plt.show()

- Each point represents a **single prediction** made by the model.  
- The X-axis (Correct labels) shows the actual class labels **(0 = bad wine, 1 = good wine)**.  
- The Y-axis (Predicted confidence scores) shows the **model’s confidence** in classifying the sample as **good wine (1)**.  

**Interpreting the Graph:**  
- **Points close to 0 or 1 on the Y-axis** = The model is confident in its predictions.  
- **Points near 0.5** = The model is uncertain, meaning the classification is less reliable.  
- **Ideally**, predictions for class **0** should cluster near the bottom (low confidence), and predictions for class **1** should cluster near the top (high confidence).  
- **If many points are incorrectly positioned**, the model may struggle with classification, possibly due to overfitting or imbalanced data.  

This visualization helps identify patterns in model confidence and potential areas for improvement.  

---

## Key Question: Is 'accuracy' sufficient enough to evaluate our model?

Many times, accuracy may not be the best metric to evaluate what our model is predicting. **Accuracy can be misleading** if the dataset is **imbalanced** (ex: predicting 99% "negative" in a dataset where 99% of cases are negative still gives high accuracy but fails to detect positives).

To further assess the model's performance, we used the following metrics:

- **Accuracy**: Measures the percentage of correct predictions.  
  - **High** = The model is making mostly correct predictions.  
  - **Low** = The model is performing poorly, possibly just guessing.

- **Precision**: Measures how often a predicted positive is actually correct.  
  - **High** = Few false positives, reliable positive predictions.  
  - **Low** = Many false positives, model predicts positives incorrectly too often.  

- **Recall (Sensitivity)**: Measures how well the model detects actual positives.  
  - **High** = Few false negatives, model catches most actual positives.  
  - **Low** = Many false negatives, missing actual positive cases.  

- **F1-Score**: A balance between precision and recall.  
  - **High** = Both precision and recall are strong.  
  - **Low** = One or both metrics are weak, indicating poor overall performance.  

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(YTEST, prediction.round())
precision = precision_score(YTEST, prediction.round())
recall = recall_score(YTEST, prediction.round())
f1score = f1_score(YTEST, prediction.round())

print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("Precision: %.2f%%" % (precision * 100.0))
print("Recall: %.2f%%" % (recall * 100.0))
print("F1-score: %.2f" % (f1score))

---

## Key Question: How can the performance be improved?

First we could increase the number of epochs to 100 or 150

In [None]:
## **Testing Improvement 1: Increasing Epochs**
# Increase the number of epochs to **100** and observe if the model improves.

model.fit(XTRAIN, YTRAIN, validation_data=(XTEST, YTEST), epochs=100, batch_size=32, callbacks=[callback_a])

# Display final accuracy after increasing epochs
print(f"Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")

# Plot accuracy over 100 epochs
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Accuracy After Increasing Epochs')
plt.show()

Or we could add more layers to the neural networks

In [None]:
## **Testing Improvement 2: Adding More Layers**
# We'll add an additional hidden layer (making it a **4-layer network**) and see if performance improves.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Define new model with an extra hidden layer
model2 = Sequential()
model2.add(Dense(8, input_dim=len(XTRAIN[0, :]), activation='relu'))
model2.add(Dense(6, activation='relu'))  # New layer added
model2.add(Dense(4, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))

model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the new model
history2 = model2.fit(XTRAIN, YTRAIN, validation_data=(XTEST, YTEST), epochs=50, batch_size=32, callbacks=[callback_a, callback_b])

# Display final accuracy after increasing epochs
print(f"Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")

# Plot accuracy over 100 epochs
plt.plot(history2.history['accuracy'], label='Training Accuracy')
plt.plot(history2.history['val_accuracy'], label='Validation Accuracy')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Accuracy After Increasing Epochs')
plt.show()

We could also balance the data, meaning create a good balance of positive and negative class values so the evaluation makes sense 

In [None]:
## **Testing Improvement 3: Balancing the Data**
# We'll use **random oversampling** to balance the classes and see if it helps improve model performance.

from imblearn.over_sampling import RandomOverSampler

# Apply oversampling to balance the dataset
ros = RandomOverSampler()
XTRAIN_resampled, YTRAIN_resampled = ros.fit_resample(XTRAIN, YTRAIN)

# Train model on the balanced dataset
history3 = model.fit(XTRAIN_resampled, YTRAIN_resampled, validation_data=(XTEST, YTEST), epochs=50, batch_size=32, callbacks=[callback_a, callback_b])

# Display final accuracy after increasing epochs
print(f"Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")

# Plot accuracy over 100 epochs
plt.plot(history3.history['accuracy'], label='Training Accuracy')
plt.plot(history3.history['val_accuracy'], label='Validation Accuracy')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Accuracy After Increasing Epochs')
plt.show()

Finally, we could increase/decrease the number of rows in the training/testing set, meaning alter the split

In [None]:
## **Testing Improvement 4: Adjusting Train/Test Split**
# We'll change the train/test split to **90% train, 10% test** and check its effect.

# Adjust train/test split to 90% train, 10% test
index_10percent = int(0.1 * len(dataset[:, 0]))

XTEST_new = dataset[:index_10percent, :-1]
YTEST_new = dataset[:index_10percent, -1]
XTRAIN_new = dataset[index_10percent:, :-1]
YTRAIN_new = dataset[index_10percent:, -1]

# Train model with new split
history4 = model.fit(XTRAIN_new, YTRAIN_new, validation_data=(XTEST_new, YTEST_new), epochs=50, batch_size=32, callbacks=[callback_a, callback_b])

# Display final accuracy after increasing epochs
print(f"Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")

# Plot accuracy over 100 epochs
plt.plot(history4.history['accuracy'], label='Training Accuracy')
plt.plot(history4.history['val_accuracy'], label='Validation Accuracy')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Accuracy After Increasing Epochs')
plt.show()