<a href="https://colab.research.google.com/github/AnovaYoung/SchoolProjects/blob/main/Segmentation_Analysis_Boltzmann_Machines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**A Brief Introduction to the Boltzmann Machine**

A Boltzmann machine is a type of stochastic recurrent neural network that consists of a network of symmetrically connected units. It is used for learning and representing complex probability distributions.

Boltzmann machines can be employed for tasks such as feature learning, dimensionality reduction, and classification, and they operate by minimizing the *energy* of the network through a process of *probabilistic sampling*.

It consists of units, or nodes, that are either visible (input units) or hidden. Each node can be in one of two states: active or inactive.

The nodes are connected with weighted edges, and the goal is to **minimize the energy of the network**, which is a function of the states of the nodes and the weights of the edges.


Biswal, A., ... Hussain, Z. (n.d.). Boltzmann Machine. In Computer Science. ScienceDirect. Retrieved from https://www.sciencedirect.com/topics/computer-science/boltzmann-machine

In [None]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
   ---------------------------------------- 0.0/250.9 kB ? eta -:--:--
   - -------------------------------------- 10.2/250.9 kB ? eta -:--:--
   ---------------- ----------------------- 102.4/250.9 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 250.9/250.9 kB 2.6 MB/s eta 0:00:00
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.5



[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: C:\Users\manov\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
import pandas as pd

file_path = r'\Users\manov\Downloads\M5-online+retail+ii\online_retail_II.xlsx'
data = pd.read_excel(file_path, sheet_name='Year 2010-2011')

print(data.head())

data_cleaned = data.dropna(subset=['Customer ID'])

print(data_cleaned.head())

  Invoice StockCode                          Description  Quantity  \
0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1  536365     71053                  WHITE METAL LANTERN         6   
2  536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3  536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4  536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  Price  Customer ID         Country  
0 2010-12-01 08:26:00   2.55      17850.0  United Kingdom  
1 2010-12-01 08:26:00   3.39      17850.0  United Kingdom  
2 2010-12-01 08:26:00   2.75      17850.0  United Kingdom  
3 2010-12-01 08:26:00   3.39      17850.0  United Kingdom  
4 2010-12-01 08:26:00   3.39      17850.0  United Kingdom  
  Invoice StockCode                          Description  Quantity  \
0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1  536365     71053                  WHITE METAL LANTERN         6   
2  536365

**Preprocessing Data**

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Encode categorical data
encoder = OneHotEncoder()
encoded_countries = encoder.fit_transform(data_cleaned[['Country']]).toarray()

# Scale numerical features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_cleaned[['Quantity', 'Price']])

# Combine both
preprocessed_data = pd.concat([pd.DataFrame(encoded_countries), pd.DataFrame(scaled_data)], axis=1)

print(preprocessed_data.head())


    0    1    2    3    4    5    6    7    8    9   ...   29   30   31   32  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   

    33   34   35   36        0         1   
0  0.0  0.0  1.0  0.0 -0.024373 -0.013136  
1  0.0  0.0  1.0  0.0 -0.024373 -0.001017  
2  0.0  0.0  1.0  0.0 -0.016330 -0.010250  
3  0.0  0.0  1.0  0.0 -0.024373 -0.001017  
4  0.0  0.0  1.0  0.0 -0.024373 -0.001017  

[5 rows x 39 columns]


**Transform the Data**

Now I'll create a binary matrix indicating whether each customer purchased an item (1) or not (0) over a given time period. This transformation will involve grouping data by customer ID and invoice date.

In [None]:
# Here I'm grouping data by Customer ID and InvoiceDate
binary_data = data_cleaned.groupby(['Customer ID', 'InvoiceDate']).apply(lambda x: x['StockCode'].nunique()).unstack().fillna(0)

# Convert purchase counts to binary values: 1 if purchased, 0 if not purchased
binary_data[binary_data > 0] = 1

# Display the binary data
print(binary_data.head())

  binary_data = data_cleaned.groupby(['Customer ID', 'InvoiceDate']).apply(lambda x: x['StockCode'].nunique()).unstack().fillna(0)


InvoiceDate  2010-12-01 08:26:00  2010-12-01 08:28:00  2010-12-01 08:34:00  \
Customer ID                                                                  
12346.0                      0.0                  0.0                  0.0   
12347.0                      0.0                  0.0                  0.0   
12348.0                      0.0                  0.0                  0.0   
12349.0                      0.0                  0.0                  0.0   
12350.0                      0.0                  0.0                  0.0   

InvoiceDate  2010-12-01 08:35:00  2010-12-01 08:45:00  2010-12-01 09:00:00  \
Customer ID                                                                  
12346.0                      0.0                  0.0                  0.0   
12347.0                      0.0                  0.0                  0.0   
12348.0                      0.0                  0.0                  0.0   
12349.0                      0.0                  0.0          

**Step 4 Train the Boltzmann Machine**

Now I will train the Boltzmann machine using the training set with the goal of learning the underlying probability distribution of the data.
I'm going to use the **BernoulliRBM** from the sklearn library to implement and train the Boltzmann machine. I'm going to code it in Tensorflow since my PyTorch is corrupted.

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

file_path = r'\Users\manov\Downloads\M5-online+retail+ii\online_retail_II.xlsx'
data = pd.read_excel(file_path, sheet_name='Year 2010-2011')

# Cleaning the data by removing incomplete entries
data_cleaned = data.dropna(subset=['Customer ID'])

# Group data by Customer ID and InvoiceDate and then create a binary matrix for purchases
binary_data = data_cleaned.groupby(['Customer ID', 'InvoiceDate']).apply(lambda x: x['StockCode'].nunique()).unstack().fillna(0)
binary_data[binary_data > 0] = 1

# Split the data into training and testing sets
X_train, X_test = train_test_split(binary_data, test_size=0.2, random_state=42)

# Convert the data to numpy arrays
X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)

class RBM:
    def __init__(self, n_visible, n_hidden, learning_rate=0.01):
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        self.learning_rate = learning_rate

        # Initializing the weights and biases
        self.weights = tf.Variable(tf.random.normal([self.n_visible, self.n_hidden], stddev=0.01))
        self.h_bias = tf.Variable(tf.zeros([self.n_hidden]))
        self.v_bias = tf.Variable(tf.zeros([self.n_visible]))

    def sample_prob(self, probs):
        return tf.nn.relu(tf.sign(probs - tf.random.uniform(tf.shape(probs))))

    def forward_pass(self, v):
        h_prob = tf.nn.sigmoid(tf.matmul(v, self.weights) + self.h_bias)
        h_sample = self.sample_prob(h_prob)
        return h_prob, h_sample

    def backward_pass(self, h):
        v_prob = tf.nn.sigmoid(tf.matmul(h, tf.transpose(self.weights)) + self.v_bias)
        v_sample = self.sample_prob(v_prob)
        return v_prob, v_sample

    def train(self, X, batch_size=64, n_epochs=10):
        dataset = tf.data.Dataset.from_tensor_slices(X).batch(batch_size)

        for epoch in range(n_epochs):
            epoch_error = 0
            for batch in dataset:
                with tf.GradientTape() as tape:
                    v0 = batch
                    h0_prob, h0_sample = self.forward_pass(v0)
                    v1_prob, v1_sample = self.backward_pass(h0_sample)
                    h1_prob, _ = self.forward_pass(v1_sample)

                    positive_grad = tf.matmul(tf.transpose(v0), h0_prob)
                    negative_grad = tf.matmul(tf.transpose(v1_sample), h1_prob)

                    loss = tf.reduce_mean(tf.square(v0 - v1_prob))

                gradients = tape.gradient(loss, [self.weights, self.h_bias, self.v_bias])
                optimizer = tf.optimizers.SGD(self.learning_rate)
                optimizer.apply_gradients(zip(gradients, [self.weights, self.h_bias, self.v_bias]))

                epoch_error += loss.numpy()

            print(f'Epoch {epoch+1}/{n_epochs}, Reconstruction Error: {epoch_error / len(X):.6f}')

    def reconstruct(self, v):
        h_prob, h_sample = self.forward_pass(v)
        v_prob, v_sample = self.backward_pass(h_sample)
        return v_prob

# Npw initializing the RBM
n_visible = X_train.shape[1]
n_hidden = 100
rbm = RBM(n_visible, n_hidden)

# Training the model
rbm.train(X_train, batch_size=64, n_epochs=10)

# Function to compute reconstruction error
def reconstruction_error(rbm, data):
    v = tf.constant(data, dtype=tf.float32)
    v_reconstructed = rbm.reconstruct(v)
    error = tf.reduce_mean(tf.square(v - v_reconstructed))
    return error.numpy()

# reconstruction error for training and testing sets
train_error = reconstruction_error(rbm, X_train)
test_error = reconstruction_error(rbm, X_test)

print(f"Training set reconstruction error: {train_error:.6f}")
print(f"Test set reconstruction error: {test_error:.6f}")


  binary_data = data_cleaned.groupby(['Customer ID', 'InvoiceDate']).apply(lambda x: x['StockCode'].nunique()).unstack().fillna(0)


Epoch 1/10, Reconstruction Error: 0.003938
Epoch 2/10, Reconstruction Error: 0.003938
Epoch 3/10, Reconstruction Error: 0.003937
Epoch 4/10, Reconstruction Error: 0.003936
Epoch 5/10, Reconstruction Error: 0.003936
Epoch 6/10, Reconstruction Error: 0.003935
Epoch 7/10, Reconstruction Error: 0.003934
Epoch 8/10, Reconstruction Error: 0.003934
Epoch 9/10, Reconstruction Error: 0.003933
Epoch 10/10, Reconstruction Error: 0.003932
Training set reconstruction error: 0.249993
Test set reconstruction error: 0.250001


The training and test reconstruction errors have been successfully calculated and the RBM has been trained. However, there's a deprecation warning related to the use of DataFrameGroupBy.apply. I'll definetly ned to address this warning to ensure my code is clean and future-proof.

To do so I need to modify the line where the binary data is created. I will explicitly exclude the grouping columns from the opperation.

In [None]:
# First group the data by Customer ID and InvoiceDate and then create a binary matrix for purchases
grouped = data_cleaned.groupby(['Customer ID', 'InvoiceDate'])['StockCode'].nunique()
binary_data = grouped.unstack().fillna(0)
binary_data[binary_data > 0] = 1

In [None]:
# Now rewrite the code with the updated prerpocessing steps.

file_path = r'\Users\manov\Downloads\M5-online+retail+ii\online_retail_II.xlsx'
data = pd.read_excel(file_path, sheet_name='Year 2010-2011')

data_cleaned = data.dropna(subset=['Customer ID'])

grouped = data_cleaned.groupby(['Customer ID', 'InvoiceDate'])['StockCode'].nunique()
binary_data = grouped.unstack().fillna(0)
binary_data[binary_data > 0] = 1

X_train, X_test = train_test_split(binary_data, test_size=0.2, random_state=42)

X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)

class RBM:
    def __init__(self, n_visible, n_hidden, learning_rate=0.01):
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        self.learning_rate = learning_rate

        self.weights = tf.Variable(tf.random.normal([self.n_visible, self.n_hidden], stddev=0.01))
        self.h_bias = tf.Variable(tf.zeros([self.n_hidden]))
        self.v_bias = tf.Variable(tf.zeros([self.n_visible]))

    def sample_prob(self, probs):
        return tf.nn.relu(tf.sign(probs - tf.random.uniform(tf.shape(probs))))

    def forward_pass(self, v):
        h_prob = tf.nn.sigmoid(tf.matmul(v, self.weights) + self.h_bias)
        h_sample = self.sample_prob(h_prob)
        return h_prob, h_sample

    def backward_pass(self, h):
        v_prob = tf.nn.sigmoid(tf.matmul(h, tf.transpose(self.weights)) + self.v_bias)
        v_sample = self.sample_prob(v_prob)
        return v_prob, v_sample

    def forward(self, v):
        p_h, h = self.forward_pass(v)
        p_v, v = self.backward_pass(h)
        return v

    def free_energy(self, v):
        vbias_term = tf.reduce_sum(tf.matmul(v, tf.expand_dims(self.v_bias, 1)), axis=1)
        wx_b = tf.matmul(v, self.weights) + self.h_bias
        hidden_term = tf.reduce_sum(tf.math.log1p(tf.exp(wx_b)), axis=1)
        return -vbias_term - hidden_term

def train_rbm(rbm, data, lr=0.01, batch_size=64, n_epochs=10):
    optimizer = tf.optimizers.SGD(lr)
    for epoch in range(n_epochs):
        epoch_error = 0.0
        for i in range(0, len(data), batch_size):
            batch = data[i:i+batch_size]
            v0 = batch
            h0_prob, h0_sample = rbm.forward_pass(v0)
            v1_prob, v1_sample = rbm.backward_pass(h0_sample)
            h1_prob, _ = rbm.forward_pass(v1_sample)

            with tf.GradientTape() as tape:
                positive_grad = tf.matmul(tf.transpose(v0), h0_prob)
                negative_grad = tf.matmul(tf.transpose(v1_sample), h1_prob)
                loss = tf.reduce_mean(rbm.free_energy(v0) - rbm.free_energy(v1_sample))

            gradients = tape.gradient(loss, [rbm.weights, rbm.h_bias, rbm.v_bias])
            optimizer.apply_gradients(zip(gradients, [rbm.weights, rbm.h_bias, rbm.v_bias]))
            epoch_error += loss.numpy()
        print(f'Epoch {epoch+1}/{n_epochs}, Reconstruction Error: {epoch_error / len(data):.6f}')

n_visible = X_train.shape[1]
n_hidden = 100
rbm = RBM(n_visible, n_hidden)

train_rbm(rbm, X_train, lr=0.01, batch_size=64, n_epochs=10)

def reconstruction_error(rbm, data):
    v = tf.constant(data, dtype=tf.float32)
    v_reconstructed = rbm.forward(v)
    error = tf.reduce_mean(tf.square(v - v_reconstructed))
    return error.numpy()
train_error = reconstruction_error(rbm, X_train)
test_error = reconstruction_error(rbm, X_test)

print(f"Training set reconstruction error: {train_error:.6f}")
print(f"Test set reconstruction error: {test_error:.6f}")


Epoch 1/10, Reconstruction Error: -17.709074
Epoch 2/10, Reconstruction Error: -43.134250
Epoch 3/10, Reconstruction Error: -58.942324
Epoch 4/10, Reconstruction Error: -68.630678
Epoch 5/10, Reconstruction Error: -74.400723
Epoch 6/10, Reconstruction Error: -77.651505
Epoch 7/10, Reconstruction Error: -79.063146
Epoch 8/10, Reconstruction Error: -79.106429
Epoch 9/10, Reconstruction Error: -77.998898
Epoch 10/10, Reconstruction Error: -75.989487
Training set reconstruction error: 0.144916
Test set reconstruction error: 0.144841


The previous implementation's reconstruction error being almost constant showed that the model wasn't learning effectively. The high reconstruction error on both training and test sets further confirms that the model was not capturing the patterns in the data.

In contrast, the updated implementation shows a significant decrease in the reconstruction error over epochs, this is pretty effective learning. The negative values of reconstruction error (free energy) suggest that the RBM is successfully minimizing the energy function,this is exactly what I wanted! The lower reconstruction errors for both the training and test sets indicate better performance in capturing the data patterns.