# Binary Classification

## Objectives

- Apply neural network techniques to solve binary classification problems.
- Explore the effectiveness of neural networks on both structured and unstructured data.
- Evaluate model performance using various metrics and visualize training results.

## Background

This notebook employs neural networks for binary classification tasks, using structured data from the Banknote Authentication dataset and unstructured data from the IMDB movie review dataset. It focuses on model construction, training, and evaluation.

## Datasets Used

- Banknote Authentication Dataset: It consists of features extracted from images of genuine and forged banknote-like specimens.
- IMDB Movie Reviews Dataset: It contains 50,000 polarized movie reviews for natural language processing tasks, particularly sentiment analysis.

## The Banknote Authentication dataset

In this notebook, we will solve some binary classification problems with neural networks. 

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout

We will use the Banknote Authentication dataset [https://archive.ics.uci.edu/ml/datasets/banknote+authentication] from the UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php]. 

In [4]:
# Defining the headers
headers = ['variance', 'skewness', 'curtosis', 'entropy', 'target']

In [5]:
# Reading the data
df = pd.read_csv("data_banknote_authentication.txt", header=None, names=headers, na_values="?")
#df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt", 
#                 header=None, names=headers, na_values="?")
print(df.shape)
df.head()

(1372, 5)


Unnamed: 0,variance,skewness,curtosis,entropy,target
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [6]:
target = df.target.value_counts()
target

target
0    762
1    610
Name: count, dtype: int64

In [7]:
px.bar(x=target.index, y=target.values,  
       width=600, height=400, title='Class Distribution')

In [8]:
X = df[['variance', 'skewness', 'curtosis', 'entropy']]     # Feature Matrix
y = df['target']                                            # Target Variable

In [9]:
# Scale the feature data
scaler = StandardScaler()
Xs = pd.DataFrame(scaler.fit_transform(X), columns=['variance', 'skewness', 'curtosis', 'entropy'])

In [10]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=23)
print('Train = %i cases \t Test = %i cases' %(len(X_train), len(X_test)))

Train = 1097 cases 	 Test = 275 cases


Let's create a simple model with a `Dense` layer.

Remember, `Dense` is a fully connected layer, a type of artificial neural network layer where every neuron in the current layer is connected to every neuron in the subsequent layer.

In [11]:
# Define the model architecture
model1 = Sequential([
    Input(shape=(4,)),              # Explicitly define the input shape
    Dense(10, activation='relu'),   # First hidden layer
    Dense(1, activation='sigmoid')  # Output layer
])

# Display model summary
model1.summary()

In [12]:
# Compile the model
model1.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

In [13]:
# Train the model
epochs = 10
history1 = model1.fit(X_train, y_train, 
          batch_size=32, 
          epochs=epochs, 
          validation_data=(X_test, y_test));

Epoch 1/10
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.7281 - loss: 0.6313 - val_accuracy: 0.7745 - val_loss: 0.5701
Epoch 2/10
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7730 - loss: 0.5490 - val_accuracy: 0.8327 - val_loss: 0.5059
Epoch 3/10
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8142 - loss: 0.4995 - val_accuracy: 0.8873 - val_loss: 0.4472
Epoch 4/10
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8614 - loss: 0.4392 - val_accuracy: 0.9055 - val_loss: 0.3971
Epoch 5/10
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9062 - loss: 0.3914 - val_accuracy: 0.9236 - val_loss: 0.3541
Epoch 6/10
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9155 - loss: 0.3474 - val_accuracy: 0.9345 - val_loss: 0.3169
Epoch 7/10
[1m35/35[0m [32m━━━━━━━━━━

Let's use `plot_history` for plotting the results.

In [14]:
def plot_history(history):
    '''
    Plotting the results of the neural network training process
    '''
    hist = history.history
    d = pd.DataFrame({'epochs': [epoch + 1 for epoch in history.epoch],
                      'accuracy': hist['accuracy'],
                      'val_accuracy': hist['val_accuracy'],
                      'loss': hist['loss'],
                      'val_loss': hist['val_loss']})
    
    fig = px.line(d, x='epochs', y=['loss', 'val_loss', 'accuracy', 'val_accuracy'],
                  color_discrete_sequence=['orange', 'peru', 'yellowgreen', 'darkolivegreen'],
                  labels={'epochs': 'Epochs', 'value': 'Loss/Accuracy', 'variable': 'Legend'},
                  title='Neural Network Training History', width=800, height=500)
    
    fig.update_traces(mode='lines+markers')
    
    return fig.show()

In [15]:
plot_history(history1)

In [16]:
# Evaluate the model
score1 = model1.evaluate(X_test, y_test, verbose=0)
print('Test loss     = %.4f' % score1[0])
print('Test accuracy = %.4f' % score1[1])

Test loss     = 0.2118
Test accuracy = 0.9455


## IMDB Keras dataset 

The `IMDB` is a dataset of 50000 highly polarized reviews from the Internet Movie Database. 

The `IMDB` Keras dataset is famous for natural language processing (NLP) tasks, specifically for sentiment analysis. The dataset contains movie reviews from the Internet Movie Database (IMDB) labeled as positive or negative based on the overall sentiment the review conveys.

The dataset is preprocessed such that each review is encoded as a sequence of integers, where each integer represents a specific word in the study. The sequences are truncated to have a fixed length, with a hyperparameter that can be adjusted based on the particular NLP task.

The dataset contains 50,000 movie reviews, with 25,000 reviews labeled as positive and 25,000 labeled as negative. 

In [17]:
# Loading the IMDB dataset
from tensorflow.keras.datasets import imdb

In [18]:
max_words = 10000

The argument `num_words = max_words` means you will only keep the top `max_words` most frequently occurring words in the training data. Rare words will be discarded. This allows you to work with vector data of manageable size.

In [19]:
# Split the dataset into training and test sets
(X_train2, y_train2), (X_test2, y_test2) = imdb.load_data(num_words=max_words)
print('Train = %i cases \t Test = %i cases' %(len(X_train2), len(X_test2)))

Train = 25000 cases 	 Test = 25000 cases


The `imdb.load_data()` function from Keras does not directly provide an option to change the proportion of the training and testing sets. By default, it splits the dataset into 50% training and 50% testing.

To adjust the proportion, we can manually split the data after loading it using techniques such as slicing or the `train_test_split()` function from scikit-learn.

In [20]:
# Concatenate the training and testing data
X2 = np.concatenate((X_train2, X_test2), axis=0)
y2 = np.concatenate((y_train2, y_test2), axis=0)

In [21]:
# Get the unique values and their counts
unique_values, counts = np.unique(y2, return_counts=True)

In [22]:
# Plot the distribution of the target variable
px.bar(x=unique_values, y=counts,  
       width=600, height=400, title='Class distribution')

In [23]:
# Some data examples
print('X2[0] has', len(X2[0]), 'elements. The first 5 are:', X2[0][:5], '\ty_label:', y2[0])
print('X2[1] has', len(X2[1]), 'elements. The first 4 are:', X2[1][:4], '\ty_label:', y2[1])
print('X2[3] has', len(X2[3]), 'elements. The first 5 are:', X2[3][:5], '\ty_label:', y2[2])

X2[0] has 218 elements. The first 5 are: [1, 14, 22, 16, 43] 	y_label: 1
X2[1] has 189 elements. The first 4 are: [1, 194, 1153, 194] 	y_label: 0
X2[3] has 550 elements. The first 5 are: [1, 4, 2, 2, 33] 	y_label: 0


You can quickly decode one of these reviews back to English words. Let's do it with the smallest one.

In [24]:
# Finding the smallest sequence 
seq_len = np.array([len(x) for x in X2])
print('Minimum sequence length:', seq_len.min(), 'at the position', seq_len.argmin())   

Minimum sequence length: 7 at the position 27104


In [25]:
print('Smallest sequence:', X2[seq_len.argmin()], '\ty_label:', y2[0])

Smallest sequence: [1, 332, 4, 274, 859, 4, 20] 	y_label: 1


What is this review about?

In [26]:
# index is a dictionary mapping words to an integer index.
index = imdb.get_word_index()      
# Reverses it, mapping integer indices to words
reverse_index = dict([(value, key) for (key, value) in index.items()])
# Decoding the review 
print(" ".join([reverse_index.get(i - 3, "#") for i in X2[seq_len.argmin()]])) 
# Decoding the corresponding y_label
y_label = 'Positive review' if y2[seq_len.argmin()] == 1 else 'Negative review' 
print('y_label:', y_label)

# read the book forget the movie
y_label: Negative review


### Data Preparation

We have to prepare the data. We will vectorize every review and fill it with zeros to contain exactly `max_words` numbers. That means we will fill every review shorter than `max_words` with zeros. We need to do this because the biggest review is nearly that long, and every input for our neural network needs to have the same size.

In [27]:
print('Number of dimensions: ', X2.ndim)
print('Dimensions (or shape):', X2.shape)
print(X2)

Number of dimensions:  1
Dimensions (or shape): (50000,)
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 3

In [28]:
def vectorize(sequences, dimension = 10000):
    '''
    This function takes a list of sequences (array of lists) and returns 
    a NumPy array of shape (len(sequences), dimension) with 0 and 1.
    '''
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):        
        results[i, sequence] = 1
    return results

In [29]:
X2 = vectorize(X2)
print('Number of dimensions: ', X2.ndim)
print('Dimensions (or shape):', X2.shape)
print(X2)

Number of dimensions:  2
Dimensions (or shape): (50000, 10000)
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]


In [30]:
# Split the data into training and testing sets with a specified proportion
test_size = 0.2
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=test_size, random_state=0)
print('Train = %i cases \t Test = %i cases' %(len(X_train2), len(X_test2)))

Train = 40000 cases 	 Test = 10000 cases


### Model A

Let's use a simple model with fully connected (`Dense`) layers with `relu` activation function.

In [31]:
# Defining the model
model2a = Sequential([
    Input(shape=(max_words,)),      # Explicitly define the input shape
    Dense(20, activation='relu'),   # First hidden layer with 20 neurons    
    Dense(1, activation='sigmoid')  # Output layer with 1 neuron (binary output)
])

# Display model summary
model2a.summary()

In [32]:
# Compile the model
model2a.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])

In [33]:
# Train the model
batch_size = 512
history2a = model2a.fit(X_train2, y_train2,
                epochs=epochs,
                batch_size=batch_size,
                validation_data=(X_test2, y_test2));

Epoch 1/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 34ms/step - accuracy: 0.7642 - loss: 0.5180 - val_accuracy: 0.8907 - val_loss: 0.2906
Epoch 2/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.9155 - loss: 0.2442 - val_accuracy: 0.8970 - val_loss: 0.2627
Epoch 3/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9291 - loss: 0.1989 - val_accuracy: 0.8962 - val_loss: 0.2632
Epoch 4/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.9393 - loss: 0.1718 - val_accuracy: 0.8950 - val_loss: 0.2714
Epoch 5/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.9459 - loss: 0.1541 - val_accuracy: 0.8920 - val_loss: 0.2841
Epoch 6/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.9508 - loss: 0.1423 - val_accuracy: 0.8912 - val_loss: 0.2975
Epoch 7/10
[1m79/79[0m [32m━━━━

In [34]:
plot_history(history2a)

Overfitting is characterized by a model that performs well on the training data (low training loss, high training accuracy) but poorly on new data (high validation loss, low validation accuracy), indicating a lack of generalization. We will address overfitting by adding a dropout layer.

In [35]:
# Evaluate the model
score2a = model2a.evaluate(X_test2, y_test2, batch_size=batch_size, verbose=0)
print('Test loss     = %.4f' % score2a[0])
print('Test accuracy = %.4f' % score2a[1])

Test loss     = 0.3618
Test accuracy = 0.8828


### Model B

Let's add a Dropout layer to mitigate overfitting. Simpler models are less likely to overfit than complex ones.

Dropout consists of randomly dropping out several output features of the layer during training. The dropout rate is usually between 0.2 and 0.5, but we will use a higher value for our example.

In [36]:
# Defining the model
model2b = Sequential([
    Input(shape=(max_words,)),      # Explicitly define the input shape
    Dense(20, activation='relu'),   # First hidden layer with 20 neurons
    Dropout(0.8),                   # Dropout layer with a rate of 0.8
    Dense(1, activation='sigmoid')  # Output layer with 1 neuron (binary output)
])

# Display model summary
model2b.summary()

In [37]:
# Compile the model
model2b.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])

In [38]:
# Train the model
history2b = model2b.fit(X_train2, y_train2,
                epochs=epochs,
                batch_size=batch_size,
                validation_data=(X_test2, y_test2));

Epoch 1/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 26ms/step - accuracy: 0.6326 - loss: 0.6207 - val_accuracy: 0.8762 - val_loss: 0.3701
Epoch 2/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.8143 - loss: 0.4166 - val_accuracy: 0.8935 - val_loss: 0.2980
Epoch 3/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.8472 - loss: 0.3530 - val_accuracy: 0.8947 - val_loss: 0.2729
Epoch 4/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.8620 - loss: 0.3176 - val_accuracy: 0.8964 - val_loss: 0.2599
Epoch 5/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.8735 - loss: 0.2956 - val_accuracy: 0.8986 - val_loss: 0.2563
Epoch 6/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.8752 - loss: 0.2843 - val_accuracy: 0.9003 - val_loss: 0.2529
Epoch 7/10
[1m79/79[0m [32m━━━━

In [39]:
plot_history(history2b)

In [40]:
# Evaluate the model
score2b = model2b.evaluate(X_test2, y_test2, batch_size=batch_size, verbose=0)
print('Test loss     = %.4f' % score2b[0])
print('Test accuracy = %.4f' % score2b[1])

Test loss     = 0.2590
Test accuracy = 0.8977


## Conclusions

Key Takeaways:
- Neural networks can handle structured and unstructured data, achieving high accuracy with appropriate preprocessing and model configuration.
- Overfitting is a significant challenge in model training, often necessitating techniques like dropout for more robust generalization.
- Training dynamics, such as the relationship between loss and accuracy over epochs, are crucial for diagnosing model behavior and guiding adjustments to model architecture or training regimen.

## References

- Chollet, F. (2021) *Deep Learning with Python*, Second Edition, Manning Publications Co, chap 2