<a href="https://colab.research.google.com/github/MrHidr/AUT.CDM.matrix-accelerated-fpm/blob/main/Main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 >**Project
information**  =>  **SN**: *404112060*  |  **SName**: *Mahdi Heydari*  |  CDM Course | Project 1 | Oct 2025

# A Survey of Matrix Monitoring Methods for Accelerating Frequent Pattern Mining Algorithms | First project of Computational Data Mining Course

## summary
[[will be added]]

# Step1: Fetching & Cleaning Data

### Installing Required Packages
ucimlrepo package : fetching dataset from UCI (UC Irvine Machine Learning Repository)

pandas : dataframe control


In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [2]:
#importing packages
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
import time

### Fetching Online Retail Dataset from UCI
Documentation of UCI repo : https://github.com/uci-ml-repo/ucimlrepo

In [3]:
online_retail = fetch_ucirepo(id=352)

In [4]:
# data (as pandas dataframes)
X = online_retail.data.features
I = online_retail.data.ids
df = pd.concat([I, X], axis=1)
print("fetched dataset 5 rows as sample:\n", df.head(5))
print("\n","="*30,"\n")
df.info()
print("dataset info:\n", df.describe())

fetched dataset 5 rows as sample:
   InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       -----

In [5]:
initial_rows = df.shape[0]
print("Initial row count before any:", initial_rows)

Initial row count before any: 541909


In [6]:
# Filter for rows where Quantity is 0
zero_quantity_rows = df[df['Description'] == "returned"]

# Display the 'Description' and other useful columns for these rows
print(zero_quantity_rows[['Quantity', 'Description', 'UnitPrice', 'CustomerID']].head(77))

        Quantity Description  UnitPrice  CustomerID
166646         2    returned        0.0         NaN
166647         2    returned        0.0         NaN


### Cleaning Data; Removing Incomplete Records with Missing columns
According to dataset info, Some rows has missing data on "Description" and/or "CustomerID" Columns, as first part of cleaning, we will delete these rows.

In [7]:
df_cleaned = df.dropna(subset=['CustomerID', 'Description'])
rows_removed = initial_rows - df_cleaned.shape[0]

In [8]:
output_rows_count = df_cleaned.shape[0]
print("row count after removing rows: ", output_rows_count)
print("removed row count: ", initial_rows - output_rows_count)

row count after removing rows:  406829
removed row count:  135080


we had 4068290 rows with no CustomerID, which is equal to our rows after removing rows with no CustomerID and/or no Description so as result it means if a row has no Description it will have no CustomerID too.

###Cleaning Data; Negetive Values (& Returned Invoices)
While min value of 'Quantity' and 'UnitPrice' are -80995.000000 and -11062.060000, we have rows with negetive quantity and price which make these rows are invalid.
as mentioned in brief we should remove return invoices too, which is covered by removing rows with negetive "Quantity" and as more overthinking, in think of deleting متناظر negetive quantities, the goal of Not loosing data of selected baskets and not being a financial prefossional we just remove negetive baskets.

In [9]:
df_cleaned = df_cleaned[df_cleaned['Quantity'] > 0]
df_cleaned = df_cleaned[df_cleaned['UnitPrice'] >= 0]

In [10]:
cleaned_rows_count = df_cleaned.shape[0]
print("row count after removing rows with neg values: ", cleaned_rows_count)
print("removed row count (Neg Price and/or Neg Quantity): ", output_rows_count - cleaned_rows_count)

row count after removing rows with neg values:  397924
removed row count (Neg Price and/or Neg Quantity):  8905


# Step2: Item-Transaction Matrix

## Count Matrix
(rows='InvoiceNo',columns='StockCode count')

In [11]:
count_matrix = df_cleaned.pivot_table(index='InvoiceNo',
                                      columns='StockCode',
                                      aggfunc='size' ,
                                      fill_value=0)

In [12]:
print(count_matrix.head(10))
print("="*30)
print("number of unique InvoiceNo: ", count_matrix.shape[0] ,"\nnumber of unique StockCode", count_matrix.shape[1])

StockCode  10002  10080  10120  10123C  10124A  10124G  10125  10133  10135  \
InvoiceNo                                                                     
536365         0      0      0       0       0       0      0      0      0   
536366         0      0      0       0       0       0      0      0      0   
536367         0      0      0       0       0       0      0      0      0   
536368         0      0      0       0       0       0      0      0      0   
536369         0      0      0       0       0       0      0      0      0   
536370         1      0      0       0       0       0      0      0      0   
536371         0      0      0       0       0       0      0      0      0   
536372         0      0      0       0       0       0      0      0      0   
536373         0      0      0       0       0       0      0      0      0   
536374         0      0      0       0       0       0      0      0      0   

StockCode  11001  ...  90214V  90214W  90214Y  9021

## Binary Matrix

In [13]:
binary_matrix = (count_matrix > 0).astype(int)
print(binary_matrix.head(10))

StockCode  10002  10080  10120  10123C  10124A  10124G  10125  10133  10135  \
InvoiceNo                                                                     
536365         0      0      0       0       0       0      0      0      0   
536366         0      0      0       0       0       0      0      0      0   
536367         0      0      0       0       0       0      0      0      0   
536368         0      0      0       0       0       0      0      0      0   
536369         0      0      0       0       0       0      0      0      0   
536370         1      0      0       0       0       0      0      0      0   
536371         0      0      0       0       0       0      0      0      0   
536372         0      0      0       0       0       0      0      0      0   
536373         0      0      0       0       0       0      0      0      0   
536374         0      0      0       0       0       0      0      0      0   

StockCode  11001  ...  90214V  90214W  90214Y  9021

as we see each row stands for one InvoiceNo and each column show apearance of stock in InvoiceNo (Basket).

In [14]:
#accessing row data by InvoiceID

# Step3: Streaming Simulation

## InvoiceDate convert to datetime objects

In [15]:
import pandas as pd
import numpy as np
print(df_cleaned['InvoiceDate'])
df_cleaned['InvoiceDate'] = pd.to_datetime(df_cleaned['InvoiceDate'])
print(df_cleaned['InvoiceDate'])


0          12/1/2010 8:26
1          12/1/2010 8:26
2          12/1/2010 8:26
3          12/1/2010 8:26
4          12/1/2010 8:26
               ...       
541904    12/9/2011 12:50
541905    12/9/2011 12:50
541906    12/9/2011 12:50
541907    12/9/2011 12:50
541908    12/9/2011 12:50
Name: InvoiceDate, Length: 397924, dtype: object
0        2010-12-01 08:26:00
1        2010-12-01 08:26:00
2        2010-12-01 08:26:00
3        2010-12-01 08:26:00
4        2010-12-01 08:26:00
                 ...        
541904   2011-12-09 12:50:00
541905   2011-12-09 12:50:00
541906   2011-12-09 12:50:00
541907   2011-12-09 12:50:00
541908   2011-12-09 12:50:00
Name: InvoiceDate, Length: 397924, dtype: datetime64[ns]


## Deviding Batches

In [16]:

sorted_invoice_index = df_cleaned.groupby('InvoiceNo')['InvoiceDate'].min().sort_values().index

# 2. Re-order the main matrix based on the sorted invoice index
sorted_matrix = binary_matrix.loc[sorted_invoice_index]

# 3. Split the sorted matrix into 10 sequential batches
N_BATCHES = 10
data_stream_batches = np.array_split(sorted_matrix, N_BATCHES)

print(f"Data stream created with {len(data_stream_batches)} batches.")

  return bound(*args, **kwds)


Data stream created with 10 batches.


## Batches info


In [17]:
for i, batch_df in enumerate(data_stream_batches):

    # Get the list of InvoiceNo's for this batch (they are the index)
    batch_invoice_ids = batch_df.index

    # Filter df_cleaned to get all dates for just those invoices
    # .isin() is fast for this
    relevant_dates = df_cleaned[df_cleaned['InvoiceNo'].isin(batch_invoice_ids)]['InvoiceDate']

    print(f"\n--- Batch {i+1} ---")
    print(f"  Shape (Invoices, Items): {batch_df.shape}")
    print(f"  Start Date: {relevant_dates.min().strftime('%Y-%m-%d')}")
    print(f"  End Date:   {relevant_dates.max().strftime('%Y-%m-%d')}")


--- Batch 1 ---
  Shape (Invoices, Items): (1854, 3665)
  Start Date: 2010-12-01
  End Date:   2011-01-16

--- Batch 2 ---
  Shape (Invoices, Items): (1854, 3665)
  Start Date: 2011-01-16
  End Date:   2011-03-09

--- Batch 3 ---
  Shape (Invoices, Items): (1854, 3665)
  Start Date: 2011-03-09
  End Date:   2011-04-20

--- Batch 4 ---
  Shape (Invoices, Items): (1854, 3665)
  Start Date: 2011-04-20
  End Date:   2011-06-01

--- Batch 5 ---
  Shape (Invoices, Items): (1854, 3665)
  Start Date: 2011-06-01
  End Date:   2011-07-12

--- Batch 6 ---
  Shape (Invoices, Items): (1854, 3665)
  Start Date: 2011-07-12
  End Date:   2011-08-23

--- Batch 7 ---
  Shape (Invoices, Items): (1853, 3665)
  Start Date: 2011-08-24
  End Date:   2011-09-28

--- Batch 8 ---
  Shape (Invoices, Items): (1853, 3665)
  Start Date: 2011-09-28
  End Date:   2011-10-27

--- Batch 9 ---
  Shape (Invoices, Items): (1853, 3665)
  Start Date: 2011-10-27
  End Date:   2011-11-18

--- Batch 10 ---
  Shape (Invoices, 

# Step4: Matrix Sketching Algorithms
[[summary of Sketching Goals and reasons]]

## Gaussian Random Projection

In [21]:
k=100
n, d = data_stream_batches[0].shape
print(n,d)
R = np.random.normal(0, 1.0/np.sqrt(k), (d, k))
print(R)


1854 3665
[[ 0.22935182 -0.12962995 -0.04734972 ... -0.15989214 -0.07461096
  -0.04932404]
 [-0.0771588   0.19788638  0.0772105  ...  0.16387545 -0.05788329
   0.05666053]
 [ 0.00375485 -0.0758355   0.05552546 ... -0.08199971 -0.04247693
  -0.02314888]
 ...
 [-0.03818312 -0.00679523 -0.10780094 ... -0.01733498 -0.04592168
   0.13136703]
 [-0.26838785  0.08860183 -0.01706817 ...  0.0070579   0.06490605
  -0.07663125]
 [ 0.02486878  0.0133225  -0.16050816 ... -0.14855204 -0.21476544
   0.10249637]]


In [25]:
grp_batch_sketches = []
grp_time_records = []

In [32]:
start_total_time = time.time()

for batch_df in data_stream_batches:

    A_batch = batch_df.values

    batch_start_time = time.time()

    B_sketch = A_batch @ R

    batch_end_time = time.time()

    grp_batch_sketches.append(B_sketch)
    grp_time_records.append(batch_end_time - batch_start_time)

end_total_time = time.time()

print("Whole process took:", end_total_time - start_total_time, "seconds")
print("Average time per batch:", np.mean(grp_time_records), "seconds")
print("max time per batch:", np.max(grp_time_records), "seconds")
print("min time per batch:", np.min(grp_time_records), "seconds")

Whole process took: 1.1718463897705078 seconds
Average time per batch: 0.12589367628097534 seconds
max time per batch: 0.23077917098999023 seconds
min time per batch: 0.08906221389770508 seconds
