# MBDA 770 Final Project

## David Curtis and Jehu Humphries

# Intoduction

The dataset used in this project was downloaded from the UCI Machine Learning Repository and is a collection of handwritten letters used for image classification. The name of the dataset is "Letter Recognition" and was originally created in 1990 by David Slate. There are 20,000 instances in the data set each representing a handwritten english capital letters in 20 different fonts. Although the scope of this assignment is to conduct exploratory analysis and partition the data, efforts undertaken during this assingment will feed into the larger goals of the project. The goal of the project will be the creation of an image classification system that will iteratively improve itself through a simulated online learning environment. During this paper, the data will be explored for imbalances, variables inspected to understand the distibution of data, partitioned for training and testing, and partitioned for model improvement. 

## Prepare Environment

In [126]:
# used to gather data
import ucimlrepo

# used for exploratory analysis and data partitioning
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from sklearn.preprocessing import StandardScaler

# Used for Data Partitioning
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

# Used for model SGD training
from sklearn.linear_model import SGDClassifier

# used to create Random Forest model
from sklearn.ensemble import RandomForestClassifier

# used to create Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

# used to create Passive Aggressive Model
from sklearn.linear_model import PassiveAggressiveClassifier

# used for artificial nueral network model
from sklearn.neural_network import MLPClassifier

# used for model evaluation
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report

In [128]:
pd.options.display.float_format = '{:.4f}'.format

In [130]:
from ucimlrepo import fetch_ucirepo, list_available_datasets

## Import Data Set

The code snippet below downloads the dataset from the machine learning repository through the UCI repository API and saves the data files in the local environment as two data frames. 

In [134]:
# fetch dataset from UCI 
letter_recognition = fetch_ucirepo(id=59)

In [135]:
# data (as pandas dataframes) 
X = letter_recognition.data.features 
y = letter_recognition.data.targets

# Exploratory Analysis

The features in the dataset are scaled statistical details extracted from each of the images and the the target variable is categorical variable containing each letter. The code snippet below outlines each of the variables in the two datasets used for this paper. Additionally, a description for each variable is avaiable in the code output below.

In [138]:
# gather variable information 
print(letter_recognition.variables)

     name     role         type demographic                    description  \
0   lettr   Target  Categorical        None                 capital letter   
1   x-box  Feature      Integer        None     horizontal position of box   
2   y-box  Feature      Integer        None       vertical position of box   
3   width  Feature      Integer        None                   width of box   
4    high  Feature      Integer        None                  height of box   
5   onpix  Feature      Integer        None              total # on pixels   
6   x-bar  Feature      Integer        None     mean x of on pixels in box   
7   y-bar  Feature      Integer        None     mean y of on pixels in box   
8   x2bar  Feature      Integer        None                mean x variance   
9   y2bar  Feature      Integer        None                mean y variance   
10  xybar  Feature      Integer        None           mean x y correlation   
11  x2ybr  Feature      Integer        None              mean of

In [139]:
# Extract details from features data frame
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   x-box   20000 non-null  int64
 1   y-box   20000 non-null  int64
 2   width   20000 non-null  int64
 3   high    20000 non-null  int64
 4   onpix   20000 non-null  int64
 5   x-bar   20000 non-null  int64
 6   y-bar   20000 non-null  int64
 7   x2bar   20000 non-null  int64
 8   y2bar   20000 non-null  int64
 9   xybar   20000 non-null  int64
 10  x2ybr   20000 non-null  int64
 11  xy2br   20000 non-null  int64
 12  x-ege   20000 non-null  int64
 13  xegvy   20000 non-null  int64
 14  y-ege   20000 non-null  int64
 15  yegvx   20000 non-null  int64
dtypes: int64(16)
memory usage: 2.4 MB


In [140]:
# Extract details of target data frame
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   lettr   20000 non-null  object
dtypes: object(1)
memory usage: 156.4+ KB


In [None]:
# Print head of target data frame
y.head()

In [None]:
# Print head of features data frame
X.head()

In [16]:
# Merge data frames
data = pd.merge(y, X, left_index = True, right_index = True)

In [None]:
# Print head of merged data
data.head()

In [None]:
# Print tail of merged data
data.tail()

In [None]:
# Gather summary statistics of data frame
data.describe()

## Target Variable Representation

In [None]:
# Gather details about possible imbalances to the target variable in the downloaded data
data['lettr'].value_counts()

In [None]:
# Create visualization of the distribution of each letter in the data set.
category_counts = data['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters')
plt.xlabel('Letter')
plt.ylabel('Frequency')

From the code above it is apparent that there is relatively balanced representation of each of the letters in the dataset. When partitioning data, the balance for the training, testing, and online learning data is expected to degrade.

## "X-Box"

In [None]:
# Create visualizations of the distribution of data in each variable. The visualization below is the horizontal position of each letter.
sns.boxplot(x = "lettr", y = "x-box", data = data, order = sorted(data["lettr"].unique()))

plt.title('Distribution of "x-box" Grouped by Letter')
plt.xlabel("Letter")
plt.ylabel('"x-box" Value')

The boxplot of the horizontal position in the image reveals details about how the pixels are colored in an image of each letter. Outliers are present in each of the boxplots.

## "Y-Box"

In [None]:
# Create visualization of the vertical position of each letter in the data frame. 
sns.boxplot(x = "lettr", y = "y-box", data = data, order = sorted(data["lettr"].unique()))

plt.title('Distribution of "y-box" Grouped by Letter')
plt.xlabel("Letter")
plt.ylabel('"y-box" Value')

The boxplot of the vertical position of each letter reveal no outliers but an incredibly large range of values. In a classification task, this may reveal a lower variable importance compared to other variables.

## "Width"

In [None]:
# Create visualization of the width of each letter in the data frame. 
sns.boxplot(x = "lettr", y = "width", data = data, order = sorted(data["lettr"].unique()))

plt.title('Distribution of "Width" Grouped by Letter')
plt.xlabel("Letter")
plt.ylabel('"Width" Value')

Unsurprisingly, the letter I shows a large seperation between other letters in the data set. 

## "High"

In [None]:
# Create visualization of the vertical position of each letter in the data frame. 
sns.boxplot(x = "lettr", y = "high", data = data, order = sorted(data["lettr"].unique()))

plt.title('Distribution of "high" Grouped by Letter')
plt.xlabel("Letter")
plt.ylabel('"high" Value')

Several letters present interesting details of how they are commonly written based on the height of the letter in each instance. The letter "Q" generally is written most commonly larger than other letters so much so that it is the first presence of an outlier in the lower bounds of the IQR. Letters Z, Y, and J, have several very tall images.

## "Onpix"

In [None]:
# Create visualization of the vertical position of each letter in the data frame. 
sns.boxplot(x = "lettr", y = "onpix", data = data, order = sorted(data["lettr"].unique()))

plt.title('Distribution of "onpix" Grouped by Letter')
plt.xlabel("Letter")
plt.ylabel('"onpix" Value')

The variable "onpix" represents the number of pixels in the image and is scaled for future classification. The remaining variables in the data frame are the result of mathematical analysis and are valuable to classification but are more complex than vertical position, horiztonal position, width, height, and number of pixels. While the remaining variables may serve as features that are important to classification, they do not easily communicate differences between the letters though boxplot visualizations.

## Correlations

In [None]:
# Create a correlation matrix of the features in the data set
correlation_matrix = X.corr()

print(correlation_matrix)

In [None]:
sns.heatmap(correlation_matrix, annot = False, fmt = ".2f", cmap = 'coolwarm',
            square = True, linewidths = .5, cbar_kws = {"shrink": .5})

Unsurprisingly, the areas with the greatest positive correlation are found in the variables that describe the height, width, and number of pixels. It is reasonable that as the box used to segregate the handritten image from its background was gathered, the larger and wider the box became, the more pixels were caputered in the box. 

# Data Partitioning

The next portion of this paper will partition the data for training, testing, and learning. Although previous studies have created models based on 16,000 instances and validated the model on the remaining 4,000 instances, this project will partition data differently. The first point of seperation between previous studies is that the project will create a larger training partition by 5% compared to the amount used to validate the model, but will be 50% of the size of the other studies. This project will create a 80/20 partition on half of the data instead of the 75/25 split of all the data in other studies. The partition will first occur by seperating the model in two equal sized data frames. 50% of the data will be used to train and validate the model, the other 50% will serve as new information that will improve the model. It is expected that the intital models will suffer in accuracy because there are 20 different fonts used in the data set and some letters may not be expressed in each of the fonts. However, by updating the model with images of known letters written in unknown fonts, the model will improve. The seperation and incremental model improvement is a crucial aspect of a fluid ML system that can rapidly adjust to new environments.

In [18]:
# Create a data partition of 50% of the data. Data1 will be partitioned to a train/validate split and Data2 will be split 5 times for itertive improvement.
data1, data2 = train_test_split(data, test_size = 0.5, random_state=42)

In [20]:
# Partition Data1 into a train/validate data split
data_train, data_test = train_test_split(data1, test_size = .2, random_state=43)

In [22]:
# Partition Data2 into new data frames to later simulate new information

# Shuffle the DataFrame
data_shuffled = data2.sample(frac=1, random_state=44).reset_index(drop=True)

# Calculate the size of each partition
partition_size = int(np.ceil(len(data_shuffled) / 5))

# Split the data into 5 equally sized DataFrames
data_parts = [data_shuffled.iloc[i * partition_size:(i + 1) * partition_size] for i in range(5)]

In [24]:
# Extract and rename the data frames from the list
new1, new2, new3, new4, new5 = data_parts

## Class Balance

Because the target variable represents 26 different letters in 20 fonts, there are opportuntites for imbalances and information that the model may not have been trained with. The primary concern is how each of the letters are represented in the training data. Ideally, each of the letters is present in the training data, but is written in a font that the model may not have seen and can be improved with when it encounters new information.

In [None]:
# Create visualization of the distribution of each letter in the training set.
category_counts = data_train['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in Training Data')
plt.xlabel('Letter')
plt.ylabel('Frequency')

In [None]:
# Create visualization of the distribution of each letter in the data set.
category_counts = data_test['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in Validation Data')
plt.xlabel('Letter')
plt.ylabel('Frequency')

In [None]:
# Create visualization of the distribution of each letter in the "new data".
category_counts = new1['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in "New Information 1"')
plt.xlabel('Letter')
plt.ylabel('Frequency')

In [None]:
# Create visualization of the distribution of each letter in the "new data".
category_counts = new2['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in "New Information 2"')
plt.xlabel('Letter')
plt.ylabel('Frequency')

In [None]:
# Create visualization of the distribution of each letter in the "new data".
category_counts = new3['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in "New Information 3"')
plt.xlabel('Letter')
plt.ylabel('Frequency')

In [None]:
# Create visualization of the distribution of each letter in the "new data".
category_counts = new4['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in "New Information 4"')
plt.xlabel('Letter')
plt.ylabel('Frequency')

In [None]:
# Create visualization of the distribution of each letter in the "new data".
category_counts = new5['lettr'].value_counts()

category_counts.plot(kind='bar')
plt.title('Frequency of Letters in "New Information 5"')
plt.xlabel('Letter')
plt.ylabel('Frequency')

Although there is very good balance of each of the classes in the training data, the information that will be used to update the model is not as equally represented. During model improvement specific actions will be taken to ensure that the model does not become overfit. Although the fonts used in each of the instances are not available, it is hopeful that there is a combination of font and letter that is not contained in the training data and the model will first be exposed to that combination in either the new data or the validation data. One of the goals of this project is to understand how to create a model then implement a process of continual improvements. Through the partitioning and purposeful exposure to new information, the study will accomplish that goal.

# Final Project

There are several models that will be explored to evaluate accuracy with complete retraining. 

* Stochastic Gradient Descent
* Naive Bayes
* Random Forest
* Passive Aggressive
* Artificial Neural Network

The goal from evaluation of the numerous models is to create model deployment options. Through insepction of incremental learning and batch learning protocols, the project will identify which model is best suited in each environment. Ultimately, the conclusions from this section will allow the organization to answer the following questions:

* Which model type is best suited for my data?
* Which training protocol is best suited for my organization and problem?
* Which model allows me to achieved my desired endstate the fastest, with the least amount of retraining?
* How robust is my changing model to changes in the data?

In [26]:
# initial data partitioning training with 50% of the original data
X = data1[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']]   
y = data1['lettr']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=43)

In [28]:
X1 = new1[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y1 = new1['lettr']

In [30]:
X2 = new2[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y2 = new2['lettr']

In [32]:
X3 = new3[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y3 = new3['lettr']

In [34]:
X4 = new4[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y4 = new4['lettr']

In [36]:
X5 = new5[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y5 = new5['lettr']

# Stochastic Gradient Descent Model

In [None]:
# define parameters
sgd_clf = SGDClassifier(loss='hinge',
                        learning_rate='optimal',
                        early_stopping=True,
                        validation_fraction=0.1,  
                        n_iter_no_change=20,
                        random_state=42)

# Train with the initial training set
sgd_clf.fit(X_train, y_train)

In [None]:
# Create Predictions
YPred1 = sgd_clf.predict(X_test)

In [None]:
# Evaluate initial performance
initial_accuracy = accuracy_score(y_test, sgd_clf.predict(X_test))
print(f'Initial model accuracy: {initial_accuracy:.4f}')

In [None]:
# create confusion matrix of intial model
cm = confusion_matrix(y_test, YPred1)

# define class labels
class_names = np.unique(y_test)

plt.figure(figsize=(10, 7))  # You might adjust the size to fit 26 classes
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.title('Stochastic Gradient Descent Model Confusion Matrix')
plt.xlabel('Predicted labels')
plt.ylabel('Actual labels')
plt.show()

## SGD Model Applied on New Data Without Additional Training

In [None]:
# Create Predictions
YPred_batch1 = sgd_clf.predict(X1)
YPred_batch2 = sgd_clf.predict(X2)
YPred_batch3 = sgd_clf.predict(X3)
YPred_batch4 = sgd_clf.predict(X4)
YPred_batch5 = sgd_clf.predict(X5)

In [None]:
# Evaluate performance one each batch without any retraining measures
accuracy_batch1 = accuracy_score(y1, YPred_batch1)
accuracy_batch2 = accuracy_score(y2, YPred_batch2)
accuracy_batch3 = accuracy_score(y3, YPred_batch3)
accuracy_batch4 = accuracy_score(y4, YPred_batch4)
accuracy_batch5 = accuracy_score(y5, YPred_batch5)

print(f"Accuracy on batch 1: {accuracy_batch1:.4f}")
print(f"Accuracy on batch 2: {accuracy_batch2:.4f}")
print(f"Accuracy on batch 3: {accuracy_batch3:.4f}")
print(f"Accuracy on batch 4: {accuracy_batch4:.4f}")
print(f"Accuracy on batch 5: {accuracy_batch5:.4f}")

In [None]:
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    plt.figure(figsize=(8, 8))
    sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', xticklabels=np.unique(y_true), yticklabels=np.unique(y_true))
    plt.title(title)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Plot confusion matrix for each batch
plot_confusion_matrix(y1, YPred_batch1, 'SGD Batch 1')
plot_confusion_matrix(y2, YPred_batch2, 'SGD Batch 2')
plot_confusion_matrix(y3, YPred_batch3, 'SGD Batch 3')
plot_confusion_matrix(y4, YPred_batch4, 'SGD Batch 4')
plot_confusion_matrix(y5, YPred_batch5, 'SGD Batch 5')

## SGD Model With Incremental Learning

In [None]:
# Define batches as DataFrames
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

# Define class labels
classes = np.unique(np.concatenate([y for _, y in new_batches]))

# Initialize data storage for predictions and true labels
predictions = []
true_labels = []

# Initialize the model to train initially on the first batch
sgd_clf.partial_fit(new_batches[0][0], new_batches[0][1], classes=classes)

for i, (X_new, y_new) in enumerate(new_batches):
    # Predict on the current batch
    y_pred = sgd_clf.predict(X_new)
    accuracy = accuracy_score(y_new, y_pred)
    print(f"Accuracy for batch {i + 1}: {accuracy:.4f}") 
    
    # Store predictions and true labels for analysis
    predictions.append(y_pred)
    true_labels.append(y_new)

    # Update model with the current batch
    sgd_clf.partial_fit(X_new, y_new, classes=classes)

In [None]:
def plot_confusion_matrix(cm, classes, title='Confusion Matrix'):
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
    plt.title(title)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Loop through stored predictions and true labels
for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    plot_confusion_matrix(cm, np.unique(y_true), title=f'SGD Incremental Learning for Batch {i}')

## SGD Model With Complete Retraining After Each Batch

In [None]:
# Create retraining function
def retrain_model(current_data_X, current_data_y, new_data_X, new_data_y):
    # Combine the new data with the existing data using pandas concat
    updated_data_X = pd.concat([current_data_X, new_data_X], ignore_index=True)
    updated_data_y = pd.concat([current_data_y, new_data_y], ignore_index=True)
    
    # Reinitialize the model
    new_model = SGDClassifier(loss='hinge', penalty='l2', learning_rate='optimal', random_state=42)
    
    # Retrain the model on the combined dataset
    new_model.fit(updated_data_X, updated_data_y)
    
    return new_model, updated_data_X, updated_data_y

In [None]:
# Initialize data storage for predictions and true labels
predictions = []
true_labels = []

# Initialize training data
current_data_X, current_data_y = X_train.copy(), y_train.copy()

# Define your batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

# Iterate through each batch
for i, (new_data_X, new_data_y) in enumerate(new_batches, start=1):
    # Retrain the model with the current and new batch of data
    sgd_clf, current_data_X, current_data_y = retrain_model(current_data_X, current_data_y, new_data_X, new_data_y)
    
    # Predict on the new batch and evaluate
    new_predictions = sgd_clf.predict(new_data_X)
    accuracy = accuracy_score(new_data_y, new_predictions)
    print(f"Retrained model accuracy on batch {i}: {accuracy:.4f}")
    
    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(new_predictions)
    true_labels.append(new_data_y)

In [None]:
def plot_confusion_matrix(cm, classes, title='Confusion Matrix'):
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
    plt.title(title)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Loop through stored predictions and true labels
for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    plot_confusion_matrix(cm, np.unique(y_true), title=f'SGD With Retraining for Batch {i}')

# Random Forest Classification Model

In [None]:
# initialize model
random_forest_clf = RandomForestClassifier(random_state=42)

In [None]:
# train Random Forest Model
random_forest_clf.fit(X_train, y_train)

In [None]:
# Predictions on the test set
predictions = random_forest_clf.predict(X_test)

# Evaluate its performance on the test set
initial_accuracy = accuracy_score(y_test, random_forest_clf.predict(X_test))
print(f'Initial model accuracy: {initial_accuracy:.4f}')

In [None]:
# create confusion matrix of intial model
cm = confusion_matrix(y_test, predictions)

# define class labels
class_names = np.unique(y_test)

plt.figure(figsize=(10, 7))  # You might adjust the size to fit 26 classes
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted labels')
plt.ylabel('Actual labels')
plt.show()

## Random Forest Model Applied on New Data Without Additional Training

In [None]:
# Create Predictions using original model
YPred_batch1 = random_forest_clf.predict(X1)
YPred_batch2 = random_forest_clf.predict(X2)
YPred_batch3 = random_forest_clf.predict(X3)
YPred_batch4 = random_forest_clf.predict(X4)
YPred_batch5 = random_forest_clf.predict(X5)

In [None]:
# Evaluate performance one each batch without any retraining measures
accuracy_batch1 = accuracy_score(y1, YPred_batch1)
accuracy_batch2 = accuracy_score(y2, YPred_batch2)
accuracy_batch3 = accuracy_score(y3, YPred_batch3)
accuracy_batch4 = accuracy_score(y4, YPred_batch4)
accuracy_batch5 = accuracy_score(y5, YPred_batch5)

print(f"Accuracy on batch 1: {accuracy_batch1:.4f}")
print(f"Accuracy on batch 2: {accuracy_batch2:.4f}")
print(f"Accuracy on batch 3: {accuracy_batch3:.4f}")
print(f"Accuracy on batch 4: {accuracy_batch4:.4f}")
print(f"Accuracy on batch 5: {accuracy_batch5:.4f}")

## Random Forest Model With Retraining After Each Batch

In [None]:
def retrain_model_with_new_batch(model, current_X, current_y, new_X, new_y):
    # Combine the new data with the existing data using pandas concat
    updated_X = pd.concat([current_X, new_X], ignore_index=True)
    updated_y = pd.concat([current_y, new_y], ignore_index=True)
    
    # Re-train the model on the combined dataset
    model.fit(updated_X, updated_y)
    
    return model

In [None]:
# Accumulate the initial training data as pandas DataFrame and Series
accumulated_X = X_train.copy()
accumulated_y = y_train.copy()

# Define batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

# Sequentially retrain and evaluate the model with each new batch
for i, (new_X, new_y) in enumerate(new_batches, start=1):
    # Retrain the model with the current and new batch of data
    random_forest_clf = retrain_model_with_new_batch(random_forest_clf, accumulated_X, accumulated_y, new_X, new_y)
    
    # Update the accumulated data with the new batch using pandas concat
    accumulated_X = pd.concat([accumulated_X, new_X], ignore_index=True)
    accumulated_y = pd.concat([accumulated_y, new_y], ignore_index=True)
    
    # Evaluate the retrained model on the test set
    accuracy = accuracy_score(y_test, random_forest_clf.predict(X_test))
    print(f'Retrained model accuracy after batch {i}: {accuracy:.4f}')
    

# Naive Bayes Model

In [None]:
# initialize the model
mnb = MultinomialNB()

In [None]:
# train the model
mnb.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = mnb.predict(X_test)

# Evaluate its performance on the test set
initial_accuracy = accuracy_score(y_test, y_pred)
print(f'Initial model accuracy: {initial_accuracy:.4f}')

## Naive Bayes Model Applied on New Data Without Additional Training

In [None]:
# Create Predictions using original model
YPred_batch1 = mnb.predict(X1)
YPred_batch2 = mnb.predict(X2)
YPred_batch3 = mnb.predict(X3)
YPred_batch4 = mnb.predict(X4)
YPred_batch5 = mnb.predict(X5)

In [None]:
# Evaluate performance one each batch without any retraining measures
accuracy_batch1 = accuracy_score(y1, YPred_batch1)
accuracy_batch2 = accuracy_score(y2, YPred_batch2)
accuracy_batch3 = accuracy_score(y3, YPred_batch3)
accuracy_batch4 = accuracy_score(y4, YPred_batch4)
accuracy_batch5 = accuracy_score(y5, YPred_batch5)

print(f"Accuracy on batch 1: {accuracy_batch1:.4f}")
print(f"Accuracy on batch 2: {accuracy_batch2:.4f}")
print(f"Accuracy on batch 3: {accuracy_batch3:.4f}")
print(f"Accuracy on batch 4: {accuracy_batch4:.4f}")
print(f"Accuracy on batch 5: {accuracy_batch5:.4f}")

In [None]:
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    plt.figure(figsize=(8, 8))
    sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', xticklabels=np.unique(y_true), yticklabels=np.unique(y_true))
    plt.title(title)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Plot confusion matrix for each batch
plot_confusion_matrix(y1, YPred_batch1, 'Naive Bayes for Batch 1')
plot_confusion_matrix(y2, YPred_batch2, 'Naive Bayes for Batch 2')
plot_confusion_matrix(y3, YPred_batch3, 'Naive Bayes for Batch 3')
plot_confusion_matrix(y4, YPred_batch4, 'Naive Bayes for Batch 4')
plot_confusion_matrix(y5, YPred_batch5, 'Naive Bayes for Batch 5')

## Naive Bayes Model With Incremental Learning

In [None]:
# Define class labels
classes = np.unique(y_train)

# Define batches as DataFrames
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

predictions = []
true_labels = []

for i, (X_new, y_new) in enumerate(new_batches, start=1):
    # Update the model with the new batch using partial_fit
    mnb.partial_fit(X_new, y_new, classes=classes)
    
    # Evaluate the updated model on a consistent test set
    y_pred = mnb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy after batch {i}: {accuracy:.4f}")
    
    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(y_pred)
    true_labels.append(y_test)

In [None]:
def plot_confusion_matrix(cm, classes, title='Confusion Matrix'):
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
    plt.title(title)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Loop through stored predictions and true labels
for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
    cm = confusion_matrix(y_true, y_pred, labels=classes)
    plot_confusion_matrix(cm, classes, title=f'Naive Bayes Incremental Learning for Batch {i}')

## Naive Bayes With Complete Retraining After Each Batch

In [None]:
# Initialize the Multinomial Naive Bayes model
mnb = MultinomialNB()

# Define classes and original trainging data
classes = np.unique(y_train)
current_X_train, current_y_train = X_train.copy(), y_train.copy()

# Define new batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

predictions = []
true_labels = []

# Iterate through each batch
for i, (new_X, new_y) in enumerate(new_batches):
    # Retrain model on the current training data
    mnb.fit(current_X_train, current_y_train)

    # Predict on the current batch and evaluate
    y_pred = mnb.predict(new_X)
    accuracy = accuracy_score(new_y, y_pred)
    print(f'Accuracy after retraining with batch {i+1}: {accuracy:.4f}')

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(y_pred)
    true_labels.append(new_y)

    # Update the current training dataset with the current batch for the next iteration
    # This keeps the model learning cumulatively
    current_X_train = pd.concat([current_X_train, new_X], ignore_index=True)
    current_y_train = pd.concat([current_y_train, new_y], ignore_index=True)

In [None]:
def plot_confusion_matrices(predictions, true_labels, classes):
    for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
        cm = confusion_matrix(y_true, y_pred, labels=classes)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
        plt.title(f'Naive Bayes With Retraining for Batch {i}')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()

# Call the function with the stored predictions and labels
plot_confusion_matrices(predictions, true_labels, classes)

# Passive Aggressive Classifier Model

In [None]:
# initialize model
pac = PassiveAggressiveClassifier(max_iter=1000, random_state=42, C=1.0)

In [None]:
# train model
pac.fit(X_train, y_train)

In [None]:
initial_predictions = pac.predict(X_test)
print(f"Initial accuracy: {accuracy_score(y_test, initial_predictions):.4f}")

## Passive Aggressive Model With Incremental Training

In [None]:
# Define classes
classes = np.unique(np.concatenate([y_train] + [y for _, y in new_batches]))

# Define new data
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

predictions = []
true_labels = []

# Process each batch for prediction and subsequent training
for i, (X_new, y_new) in enumerate(new_batches):
    # Apply the model on the current batch to gather predictions
    current_predictions = pac.predict(X_new)
    current_accuracy = accuracy_score(y_new, current_predictions)
    print(f"Accuracy for batch {i + 1}: {current_accuracy:.4f}")

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(current_predictions)
    true_labels.append(y_new)
    
    # Use partial fit to update the model with the current batch
    pac.partial_fit(X_new, y_new, classes=classes)

In [None]:
def plot_confusion_matrices(predictions, true_labels, classes):
    for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
        cm = confusion_matrix(y_true, y_pred, labels=classes)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
        plt.title(f'PAC Incremental Training for Batch {i}')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()

# Call the function with the stored predictions and labels
plot_confusion_matrices(predictions, true_labels, classes)

## Passive Aggressive Model With Retraining After Each Batch

In [None]:
# Define classes and original trainging data
classes = np.unique(y_train)
current_X_train, current_y_train = X_train.copy(), y_train.copy()

# Define batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

predictions = []
true_labels = []

# Iterate through each batch
for i, (new_X, new_y) in enumerate(new_batches):
    # Retrain model on the current training data
    pac.fit(current_X_train, current_y_train)

    # Predict on the current batch and evaluate
    y_pred = pac.predict(new_X)
    accuracy = accuracy_score(new_y, y_pred)
    print(f'Accuracy after retraining with batch {i+1}: {accuracy:.4f}')

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(y_pred)
    true_labels.append(new_y)

    # Update the current training dataset with the current batch for the next iteration
    current_X_train = pd.concat([current_X_train, new_X], ignore_index=True)
    current_y_train = pd.concat([current_y_train, new_y], ignore_index=True)


In [None]:
def plot_confusion_matrices(predictions, true_labels, classes):
    for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
        cm = confusion_matrix(y_true, y_pred, labels=classes)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
        plt.title(f'PAC with Retraining for Batch {i}')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()
        
# Call the function to plot confusion matrices after processing all batches
plot_confusion_matrices(predictions, true_labels, classes)

# Artificial Data Creation

In [None]:
#Partition Data into new data frames to later simulate new information

# Shuffle the DataFrame
data_shuffled = data.sample(frac=1, random_state=44).reset_index(drop=True)

# Calculate the size of each partition
partition_size = int(np.ceil(len(data_shuffled) / 10))

# Split the data into 10 equally sized DataFrames
data_parts = [data_shuffled.iloc[i * partition_size:(i + 1) * partition_size] for i in range(10)]

In [None]:
# Extract and rename the data frames from the list
add1, add2, add3, add4, add5, add6, add7, add8, add9, add10 = data_parts

In [None]:
X6 = add1[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y6 = add1['lettr']

In [None]:
X7 = add2[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y7 = add2['lettr']

In [None]:
X8 = add3[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y8 = add3['lettr']

In [None]:
X9 = add4[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y9 = add4['lettr']

In [None]:
X10 = add5[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y10 = add5['lettr']

In [None]:
X11 = add6[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y11 = add6['lettr']

In [None]:
X12 = add7[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y12 = add7['lettr']

In [None]:
X13 = add8[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y13 = add8['lettr']

In [None]:
X14 = add9[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y14 = add9['lettr']

In [None]:
X15 = add10[['x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr',
           'x2ybr', 'xy2br', 'x-ege','xegvy', 'y-ege', 'yegvx']] 
y15 = add10['lettr']

In [None]:
# Define batches as DataFrames
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

## SGD Incremental Learning with Additional Batches

In [None]:
# Define batches as DataFrames
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

# Define class labels
classes = np.unique(np.concatenate([y for _, y in new_batches]))

# Initialize data storage for predictions and true labels
predictions = []
true_labels = []

# Initialize the model to train initially on the first batch
sgd_clf.partial_fit(new_batches[0][0], new_batches[0][1], classes=classes)

for i, (X_new, y_new) in enumerate(new_batches):
    # Predict on the current batch
    y_pred = sgd_clf.predict(X_new)
    accuracy = accuracy_score(y_new, y_pred)
    print(f"Accuracy for batch {i + 1}: {accuracy:.4f}") 
    
    # Store predictions and true labels for analysis
    predictions.append(y_pred)
    true_labels.append(y_new)

    # Update model with the current batch
    sgd_clf.partial_fit(X_new, y_new, classes=classes)

## SGD Retraining With Additional Batches

In [None]:
# Initialize data storage for predictions and true labels
predictions = []
true_labels = []

# Initialize training data
current_data_X, current_data_y = X_train.copy(), y_train.copy()

# Define your batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

# Iterate through each batch
for i, (new_data_X, new_data_y) in enumerate(new_batches, start=1):
    # Retrain the model with the current and new batch of data
    sgd_clf, current_data_X, current_data_y = retrain_model(current_data_X, current_data_y, new_data_X, new_data_y)
    
    # Predict on the new batch and evaluate
    new_predictions = sgd_clf.predict(new_data_X)
    accuracy = accuracy_score(new_data_y, new_predictions)
    print(f"Retrained model accuracy on batch {i}: {accuracy:.4f}")
    
    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(new_predictions)
    true_labels.append(new_data_y)

## PAC Incremental Training with Additional Batches

In [None]:
# Define classes
classes = np.unique(np.concatenate([y_train] + [y for _, y in new_batches]))

# Define new data
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

predictions = []
true_labels = []

# Process each batch for prediction and subsequent training
for i, (X_new, y_new) in enumerate(new_batches):
    # Apply the model on the current batch to gather predictions
    current_predictions = pac.predict(X_new)
    current_accuracy = accuracy_score(y_new, current_predictions)
    print(f"Accuracy for batch {i + 1}: {current_accuracy:.4f}")

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(current_predictions)
    true_labels.append(y_new)
    
    # Use partial fit to update the model with the current batch
    pac.partial_fit(X_new, y_new, classes=classes)

## PAC With Retraining and Additional Batches

In [None]:
# Define classes and original trainging data
classes = np.unique(y_train)
current_X_train, current_y_train = X_train.copy(), y_train.copy()

# Define your batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

predictions = []
true_labels = []

# Iterate through each batch
for i, (new_X, new_y) in enumerate(new_batches):
    # Retrain model on the current training data
    pac.fit(current_X_train, current_y_train)

    # Predict on the current batch and evaluate
    y_pred = pac.predict(new_X)
    accuracy = accuracy_score(new_y, y_pred)
    print(f'Accuracy after retraining with batch {i+1}: {accuracy:.4f}')

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(y_pred)
    true_labels.append(new_y)

    # Update the current training dataset with the current batch for the next iteration
    current_X_train = pd.concat([current_X_train, new_X], ignore_index=True)
    current_y_train = pd.concat([current_y_train, new_y], ignore_index=True)

# Artificial Neural Network

In [None]:
# Initialize the ANN classifier (Validation Loss (1% point = 93%)
ANNClf = MLPClassifier(max_iter=500, learning_rate_init=0.001, hidden_layer_sizes=(100, 50), activation='relu', solver='adam', random_state=42)

In [None]:
# scale features
#scaler = StandardScaler()
#X_trainANN = scaler.fit_transform(X_train)
#X_testANN = scaler.transform(X_test)

In [None]:
# Train the model
ANNClf.fit(X_train, y_train)

In [None]:
# Make predictions on test data and evaluate Accuracy
YPred5 = ANNClf.predict(X_test)
ACC5 = accuracy_score(y_test, YPred5)
print("Artificial Neural Network (M5): Accuracy Score:", ACC5)

## ANN with Incremental Training

In [None]:
#X1S = scaler.transform(X1)
#X2S = scaler.transform(X2)
#X3S = scaler.transform(X3)
#X4S = scaler.transform(X4)
#X5S = scaler.transform(X5)

In [None]:
# define batches and loop though each batch creating predictions
#new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

#for i, (X_new, y_new) in enumerate(new_batches, 1):
    # Update the model with the new batch
#    ANNClf.partial_fit(X_new, y_new, classes=np.unique(y))

    # Evaluate the updated model on the new batch
  #  new_predictions = ANNClf.predict(X_new)
 #   new_accuracy = accuracy_score(y_new, new_predictions)
 #   print(f"Batch {i} accuracy: {new_accuracy}")

In [None]:
# Produce Classification Report
print("Artificial Neural Network (M5): Classification Report:")
print(classification_report(y_test, YPred5))

In [None]:
# Define classes
classes = np.unique(np.concatenate([y_train] + [y for _, y in new_batches]))

# Define new data
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

predictions = []
true_labels = []

# Process each batch for prediction and subsequent training
for i, (X_new, y_new) in enumerate(new_batches):
    # Apply the model on the current batch to gather predictions
    current_predictions = ANNClf.predict(X_new)
    current_accuracy = accuracy_score(y_new, current_predictions)
    print(f"Accuracy for batch {i + 1}: {current_accuracy:.4f}")

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(current_predictions)
    true_labels.append(y_new)
    
    # Use partial fit to update the model with the current batch
    ANNClf.partial_fit(X_new, y_new, classes=classes)

In [None]:
def plot_confusion_matrices(ANNClf, true_labels, classes):
    for i, (YPred5, y_true) in enumerate(zip(ANNClf, true_labels), start=1):
        cm = confusion_matrix(y_true, YPred5, labels=classes)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', xticklabels=classes, yticklabels=classes)
        plt.title(f'ANN Incremental Training for Batch {i}')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()

# Call the function with the stored predictions and labels
plot_confusion_matrices(predictions, true_labels, classes)

## ANN with Retraining After Each Batch

In [None]:
# Define classes and original trainging data
classes = np.unique(y_train)
current_X_train, current_y_train = X_train.copy(), y_train.copy()

# Define batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5)]

predictions = []
true_labels = []

# Iterate through each batch
for i, (new_X, new_y) in enumerate(new_batches):
    # Retrain model on the current training data
    ANNClf.fit(current_X_train, current_y_train)

    # Predict on the current batch and evaluate
    y_pred = ANNClf.predict(new_X)
    accuracy = accuracy_score(new_y, y_pred)
    print(f'Accuracy after retraining with batch {i+1}: {accuracy:.4f}')

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(y_pred)
    true_labels.append(new_y)

    # Update the current training dataset with the current batch for the next iteration
    current_X_train = pd.concat([current_X_train, new_X], ignore_index=True)
    current_y_train = pd.concat([current_y_train, new_y], ignore_index=True)


In [None]:
def plot_confusion_matrices(predictions, true_labels, classes):
    for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
        cm = confusion_matrix(y_true, y_pred, labels=classes)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
        plt.title(f'ANN with Retraining for Batch {i}')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()
        
# Call the function to plot confusion matrices after processing all batches
plot_confusion_matrices(predictions, true_labels, classes)

## ANN Incremental Training with Additional Batches

In [None]:
# Define classes
classes = np.unique(np.concatenate([y_train] + [y for _, y in new_batches]))

# Define new data
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

predictions = []
true_labels = []

# Process each batch for prediction and subsequent training
for i, (X_new, y_new) in enumerate(new_batches):
    # Apply the model on the current batch to gather predictions
    current_predictions = ANNClf.predict(X_new)
    current_accuracy = accuracy_score(y_new, current_predictions)
    print(f"Accuracy for batch {i + 1}: {current_accuracy:.4f}")

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(current_predictions)
    true_labels.append(y_new)
    
    # Use partial fit to update the model with the current batch
    ANNClf.partial_fit(X_new, y_new, classes=classes)

## ANN With Retraining and Additional Batches

In [None]:
# Define classes and original trainging data
classes = np.unique(y_train)
current_X_train, current_y_train = X_train.copy(), y_train.copy()

# Define your batches
new_batches = [(X1, y1), (X2, y2), (X3, y3), (X4, y4), (X5, y5), (X6, y6), (X7, y7), (X8, y8),
              (X9, y9), (X10, y10), (X11, y11), (X12, y12), (X13, y13), (X14, y14), (X15, y15)]

predictions = []
true_labels = []

# Iterate through each batch
for i, (new_X, new_y) in enumerate(new_batches):
    # Retrain model on the current training data
    ANNClf.fit(current_X_train, current_y_train)

    # Predict on the current batch and evaluate
    y_pred = ANNClf.predict(new_X)
    accuracy = accuracy_score(new_y, y_pred)
    print(f'Accuracy after retraining with batch {i+1}: {accuracy:.4f}')

    # Store predictions and actual labels for later confusion matrix analysis
    predictions.append(y_pred)
    true_labels.append(new_y)

    # Update the current training dataset with the current batch for the next iteration
    current_X_train = pd.concat([current_X_train, new_X], ignore_index=True)
    current_y_train = pd.concat([current_y_train, new_y], ignore_index=True)

In [None]:
def plot_confusion_matrices(predictions, true_labels, classes):
    for i, (y_pred, y_true) in enumerate(zip(predictions, true_labels), start=1):
        cm = confusion_matrix(y_true, y_pred, labels=classes)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
        plt.title(f'ANN with Retraining for Batch {i}')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()
        
# Call the function to plot confusion matrices after processing all batches
plot_confusion_matrices(predictions, true_labels, classes)

# PyTorch Neural Network

In [100]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import OneHotEncoder
from torch.utils.data import DataLoader, TensorDataset

In [146]:
X.info()
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   x-box   20000 non-null  int64
 1   y-box   20000 non-null  int64
 2   width   20000 non-null  int64
 3   high    20000 non-null  int64
 4   onpix   20000 non-null  int64
 5   x-bar   20000 non-null  int64
 6   y-bar   20000 non-null  int64
 7   x2bar   20000 non-null  int64
 8   y2bar   20000 non-null  int64
 9   xybar   20000 non-null  int64
 10  x2ybr   20000 non-null  int64
 11  xy2br   20000 non-null  int64
 12  x-ege   20000 non-null  int64
 13  xegvy   20000 non-null  int64
 14  y-ege   20000 non-null  int64
 15  yegvx   20000 non-null  int64
dtypes: int64(16)
memory usage: 2.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   lettr   20000 non-null  object
dtypes

In [152]:
# Convert the 'lettr' column of the DataFrame y to a categorical data type
y['lettr'] = pd.Categorical(y['lettr'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y['lettr'] = pd.Categorical(y['lettr'])


In [None]:
y.info()

In [102]:
# fetch dataset from UCI 
LRec = fetch_ucirepo(id=59)

In [122]:
# data (as pandas dataframes) 
X = LRec.data.features 
y = LRec.data.targets.reshape(-1, 1)
Y = y["lettr"]
X.info()
y.info()
Y.info()

AttributeError: 'DataFrame' object has no attribute 'reshape'

In [None]:
#X = LRec.data                   # Feature variables
#Y = LRec.target.reshape(-1, 1)  # Target variable reshaped
#print("Feature names:", Iris.feature_names)
#print("Target names:", Iris.target_names)
#print("First 5 samples:")
#for i in range(5):
#    print(f"Sample {i+1}: {X[i]} (Class: {Y[i]}, Species: {Iris.target_names[Y[i]]})")

In [96]:
# Convert Pandas DF to Numpy


In [116]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

XTrain = torch.tensor(X_train.values)
yTrain = torch.tensor(y_train.values)
XTest = torch.tensor(X_test.values)
yTest = torch.tensor(y_test.values)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

In [86]:


X_train, X_test, y_train, y_test = train_test_split(X1, Y1, test_size=0.2, random_state=42)

X_train, y_train = torch.tensor(X_train).float(), torch.tensor(y_train).long()
X_test, y_test = torch.tensor(X_test).float(), torch.tensor(y_test).long()

print(f"X_train shape: {X_train.shape}"), print(f"X_test shape: {X_test.shape}"), print(f"y_train shape: {y_train.shape}"), print(f"y_test shape: {y_test.shape}")

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

In [64]:
# Define a neural network with 3 hidden layers
class Net(nn.Module):
    def __init__(TNN3):
        super(Net, TNN3).__init__()
        TNN3.fc1 = nn.Linear(4, 8)  # First hidden layer
        TNN3.fc2 = nn.Linear(8, 5)  # Second hidden layer
        TNN3.fc3 = nn.Linear(5, 3)  # Third hidden layer
        TNN3.output_layer = nn.Linear(3, 3)  # Output layer

    def forward(TNN3, x):
        x = torch.relu(TNN3.fc1(x))
        x = torch.relu(TNN3.fc2(x))
        x = torch.relu(TNN3.fc3(x))
        x = TNN3.output_layer(x)
        return x

In [66]:
# Instantiate the model
model = Net()

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Lists to store training history
TrnLoss = []
TrnAccu = []

# Train the model
for epoch in range(500):
    optimizer.zero_grad()
    outputs = model(X_train)
    Loss = criterion(outputs, y_train)
    Loss.backward()
    optimizer.step()

    # Calculate accuracy
    _, predicted = torch.max(outputs, 1)
    ACC = (predicted == y_train).sum().item() / len(y_train)
    TrnLoss.append(Loss.item())
    TrnAccu.append(ACC)

    print(f"Epoch [{epoch + 1}/100], Training Accuracy: {ACC:.4f}, Loss: {loss:.4f}")

RuntimeError: mat1 and mat2 shapes cannot be multiplied (8000x17 and 4x8)

In [47]:
from sklearn.preprocessing import OneHotEncoder

# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the y DataFrame
y_encoded = encoder.fit_transform(y.values.reshape(-1, 1)).toarray()
y.info()

<class 'pandas.core.series.Series'>
Index: 10000 entries, 9434 to 15795
Series name: lettr
Non-Null Count  Dtype 
--------------  ----- 
10000 non-null  object
dtypes: object(1)
memory usage: 156.2+ KB


In [None]:
# Convert Train and Test sets to numpy arrays and then to PyTorch Tensor
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()
X_test_np = X_test.to_numpy()
y_test_np = y_test.to_numpy()

X_train_tensor = torch.tensor(X_train_np).float()
#y_train_tensor = torch.tensor(y_train_np).long()
X_test_tensor = torch.tensor(X_test_np).float()
y_test_tensor = torch.tensor(y_test_np).long()

In [None]:
# Convert Train and Test sets to numpy arrays
#X_train_np = X_train.to_numpy()
#y_train_np = y_train.to_numpy()
#X_test_np = X_test.to_numpy()
#y_test_np = y_test.to_numpy()

# Convert numpy arrays to PyTorch Tensors
#X_train_tensor = torch.tensor(X_train_np).float()
#y_train_tensor = torch.tensor(y_train_np).long()
#X_test_tensor = torch.tensor(X_test_np).float()
#y_test_tensor = torch.tensor(y_test_np).long()

In [None]:
# Change Train and Test sets to Pytorch Tensor
#X_train, y_train = torch.tensor(X_train).float(), torch.tensor(y_train).long()
#X_test, y_test = torch.tensor(X_test).float(), torch.tensor(y_test).long()

In [None]:
# Define a neural network with 3 hidden layers
class Net(nn.Module):
    def __init__(TNN3):
        super(Net, TNN3).__init__()
        TNN3.fc1 = nn.Linear(4, 8)  # First hidden layer
        TNN3.fc2 = nn.Linear(8, 5)  # Second hidden layer
        TNN3.fc3 = nn.Linear(5, 3)  # Third hidden layer
        TNN3.output_layer = nn.Linear(3, 3)  # Output layer

    def forward(TNN3, x):
        x = torch.relu(TNN3.fc1(x))
        x = torch.relu(TNN3.fc2(x))
        x = torch.relu(TNN3.fc3(x))
        x = TNN3.output_layer(x)
        return x

In [None]:
# Instantiate the model
model = Net()

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Lists to store training history
TrnLoss = []
TrnAccu = []

# Train the model
for epoch in range(10):
    optimizer.zero_grad()
    outputs = model(X_test_tensor)
    Loss = criterion(outputs, y_train)
    Loss.backward()
    optimizer.step()

    # Calculate accuracy
    _, predicted = torch.max(outputs, 1)
    ACC = (predicted == y_train).sum().item() / len(y_train)
    TrnLoss.append(Loss.item())
    TrnAccu.append(ACC)

    print(f"Epoch [{epoch + 1}/100], Training Accuracy: {ACC:.4f}, Loss: {loss:.4f}")

# Convolutional Neural Network

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

In [None]:
# Define the CNN architecture
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Fully connected layers
        self.fc1 = nn.Linear(in_features=32*7*7, out_features=128)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(in_features=128, out_features=10)
        
    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        
        x = x.view(x.size(0), -1)
        
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.fc2(x)
        
        return x

# Create an instance of the CNN
model = CNN()

In [None]:
# Define the criterion for calculating the loss, and define the optimizer for updating the model parameters
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:


# Convert X_train and y_train to tensors
X_train_tensor = torch.tensor(X_train)
y_train_tensor = torch.tensor(y_train)

# Create a TensorDataset
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)

# Define batch size and create a DataLoader
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Define the model, optimizer, and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.MSELoss()

num_epochs = 10 

for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

    print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {loss.item():.4f}")

# Save the trained model (optional)
torch.save(model.state_dict(), "trained_model.pth")