### The aim for this notebook is to create a data training pipeline that assummes limited/finite Computer memory in the face of ever growing data.
    Specifically, it aims to implement batched loading of a dataset in preparation for training,in a way that scales with infinite data

## Outcomes
1. Compare pandas batched loading functionality in the trivial case of loading data from a file
        - csv
        - Parquet
2.  is there a difference, is it possible to load chunkss of row at a time?
3. Currently, Chunking mechanisms are used to fetch data ***d*** sequentially from big data stores to solve the out of core memory problem in machine learning. The ubiquitous logic is to use an iterable Dataset in pytorch that yields a chunk ***c*** at a time. this results in ***b/c*** network calls ***n***, that leaves us at the mercy of the network bandwidth. I personally think that there is a better way to obtain chunked data that results in <***n*** number of network calls to the database. Can we prove that?
   - One way is to fetch a huge chunk of data ***s*** that is x times greater than the batch size but fits into memory so as to allow multiple batches to be extracted from it and when then the model nearly goes through all the data in ***s***, another chunk is prefetched in parallel so that the model doesnt need to wait for the network call for the data 
        

In [1]:
import pandas as pd 
import torch
import time
import plotly.graph_objects as go
from scipy.stats import ttest_ind
from typing import Tuple
from tqdm import tqdm
import gc
from scipy.stats import gaussian_kde
import numpy as np

In [4]:
written=False

def compare_file_load_times(csv_path:str, pqt_path:str, iterations:int)-> Tuple[list]:
    global written
    
    csv_times = []
    pqt_times = []

    # Time CSV loading
    print(f"-------Loading csv file------\n")
    for _ in tqdm(range(iterations)):
        start = time.time()
        df=pd.read_csv(csv_path,nrows=1000000)
        csv_times.append(time.time() - start)
        if not written:
            df.to_parquet("train.pqt")
            written=True
        del df
        gc.collect()

    # Time Parquet loading
    print(f"-------Loading parquet file------\n")
    for _ in tqdm(range(iterations)):
        start = time.time()
        df=pd.read_parquet(pqt_path)
        pqt_times.append(time.time() - start)

        del df
        gc.collect()

    return csv_times,pqt_times

In [14]:
def plot_load_time_distributions(csv_times, pqt_times):
    # Generate KDE (smoothed distributions)
    csv_kde = gaussian_kde(csv_times)
    pqt_kde = gaussian_kde(pqt_times)

    x_vals = np.linspace(min(min(csv_times), min(pqt_times)),
                         max(max(csv_times), max(pqt_times)), 500)

    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x=x_vals,
        y=csv_kde(x_vals),
        mode='lines',
        name='CSV Load Time',
        line=dict(color='blue'),
        fill='tozeroy',
        fillcolor='rgba(0,0,255,0.2)'
    ))

    fig.add_trace(go.Scatter(
        x=x_vals,
        y=pqt_kde(x_vals),
        mode='lines',
        name='Parquet Load Time',
        line=dict(color='green'),
        fill='tozeroy',
        fillcolor='rgba(0,128,0,0.2)'
    ))

    fig.update_layout(
        title='Load Time Distribution: CSV vs Parquet',
        xaxis_title='Time (seconds)',
        yaxis_title='Density',
        template='plotly_white',
        legend=dict(title='File Type')
    )

    return fig.show()

In [6]:
def check_significance(csv_times, pqt_times):
    t_stat, p_value = ttest_ind(csv_times, pqt_times)
    print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print("Result: Statistically significant difference in load times (p < 0.05).")
    else:
        print("Result: No statistically significant difference in load times (p ≥ 0.05).")

In [None]:
csv_times,pqt_times=compare_file_load_times("/kaggle/input/predict-closed-questions-on-stack-overflow/train.csv", "/kaggle/working/train.pqt", iterations=100)

In [15]:
plot_load_time_distributions(csv_times, pqt_times)
check_significance(csv_times, pqt_times)

T-statistic: 98.1910, P-value: 0.0000
Result: Statistically significant difference in load times (p < 0.05).


   1.  File format of the data makes a huge difference as shown above. it will become apparent in the future why I choose to do this comparison, but for the curious of you, it is common practice for big data systems/warehouses or datalakes/lakehouses to choose to store data raw or processed, in file format and to reduce load on SQL based datawarehouses, data scientists might choose to ingest these files as oppose to querrying the database. therefore the choice of the file formart becomes really important especially as the data scales up. 

In [47]:
class Model:
    def __init__(self):
        pass
    def __call__(self,X):
        return len(''.join([str(x) for x in X]))

In [53]:
class ChunkedIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, filepath, chunk_size=100000, preprocess_fn=None):
        self.filepath = filepath
        self.chunk_size = chunk_size

    def __iter__(self):
        chunk_iter = pd.read_csv(self.filepath, nrows=1000000, chunksize=self.chunk_size)

        for chunk in chunk_iter:
            X = chunk["BodyMarkdown"].values

            for xi in X:
                yield xi
                
chunked_dataset = ChunkedIterableDataset("/kaggle/input/predict-closed-questions-on-stack-overflow/train.csv")
dataloader = torch.utils.data.DataLoader(chunked_dataset, batch_size=10000, num_workers=0)
model=Model()

In [45]:
1000000/10000

100.0

In [55]:
def train(dataset):
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=10000, num_workers=0)
    output=0
    for X_batch  in tqdm(dataloader):
        output += model(X_batch)
        
start = time.time()
train(chunked_dataset)
end=time.time()

print(end-start)

100it [00:22,  4.49it/s]

22.296972274780273



