<a href="https://colab.research.google.com/github/Superkart/Pandas_Provenance/blob/main/ProvenanceOnPandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose
- This notebook initializes the setup for a Pandas Provenance Tracker.
- The goal is to manually track the provenance of data transformations performed on pandas DataFrames.
- Provenance tracking allows for transparency, reproducibility, and accountability in data workflows by maintaining a detailed log of operations and changes to data over time.

In [1]:
import pandas as pd
from datetime import datetime
import os
from google.colab import drive

## Drive Mounting
Here we will be mounting our google drive, so that we can access tables and later on store the Provenance Logs.

In [2]:
drive.mount('/content/drive', force_remount = True)
%cd /content/drive/My Drive/Pandas_Provenance

print(os.listdir('/content/drive/My Drive/Pandas_Provenance'))

Mounted at /content/drive
/content/drive/My Drive/Pandas_Provenance
['color_srgb.csv']


## Accessing Table From Drive
We will be using this table to play with some of the provenance functions we will override down below

In [3]:
colorTable = pd.read_csv('/content/drive/My Drive/Pandas_Provenance/color_srgb.csv')
colorTable

Unnamed: 0,Name,HEX,RGB
0,White,#FFFFFF,"rgb(100,100,100)"
1,Silver,#C0C0C0,"rgb(75,75,75)"
2,Gray,#808080,"rgb(50,50,50)"
3,Black,#000000,"rgb(0,0,0)"
4,Red,#FF0000,"rgb(100,0,0)"
5,Maroon,#800000,"rgb(50,0,0)"
6,Yellow,#FFFF00,"rgb(100,100,0)"
7,Olive,#808000,"rgb(50,50,0)"
8,Lime,#00FF00,"rgb(0,100,0)"
9,Green,#008000,"rgb(0,50,0)"


# Support Functions
 This Block of Code will contain the Support functions that we use to build the Provenance System

**ProvenanceDataFrame** :
 This is a class that I have created that will be inheriting from the pandas.DataFrame class

 **provenance_logs**:
 This is a list that will store any operation that is taking place.

 **ensure_prov_table**
 This method ensures that all the tables have a corresponding prov table else helps create one

 **log_Provenance**:
 This is a function that will be called when overriding a pandas function to log the operation

 **get_Provenance**:
 This is the function that will return the provenance log based on the id provided

In [None]:
class ProvenanceDataFrame(pd.DataFrame):

    provenance_logs = []

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.ensure_prov_table("CREATE", "Initial creation")


    @property
    def _constructor(self):
        return ProvenanceDataFrame


    def ensure_prov_table(self, operation, details=""):
        if not hasattr(self, 'prov_table'):
            self.prov_table = pd.DataFrame(columns=["operation", "details", "shape", "timestamp"])
            self.log_provenance(operation, details, self.shape)


    def log_provenance(self, operation, details, shape):
        new_entry = {
            "operation": operation,
            "details": details,
            "shape": shape,
            "timestamp": datetime.now()
        }
        self.prov_table = self.prov_table.append(new_entry, ignore_index=True)


    def get_provenance(self):
        self.ensure_prov_table("RETRIEVE", "Retrieve provenance log")
        return self.prov_table



#Overriding read_csv() !!
In this Code block we will be overriding our first Pandas function.

The read_csv() method.

We will be calling calling the readcsv() method and then also logging the provenance.

We will also look at creating a new provenance table for each dataframe that is created by read_csv()

In [None]:
def read_csv_with_provenance(*args, **kwargs):
    print("Reading CSV with provenance tracking...")
    df = pd.read_csv(*args, **kwargs)

    df.provenance_logs = []
    df.provenance_logs.append({
        "operation": "CREATE DF",
        "shape": df.shape,
        "timestamp": datetime.datetime.now(),
        "source_id": id(df)
    })
    return df