In [1]:
import os
import sqlite3
import numpy as np
import pandas as pd
import sqlite3

Our prototype consists of the following storage system `ExtendedObjectStorage` that can serve as a "catalog" for object storage (AWS S3, IBM cos etc) using SQLITE. The storage includes all basic file operations, but the main purpose of the prototype is to support <u>artificial renaming</u> in object storage, as further discussed in our paper *Spark and Object Storage*.

In [2]:
class ExtendedObjectStorage():
    def __init__(self, database):
        self.conn = sqlite3.connect(database)
        self.c = self.conn.cursor()
    
    def __check_directory_exist(self, directory_name):
        '''Check if directory exists'''

        return self.conn.cursor().execute(f''' SELECT count(name) FROM sqlite_master WHERE type='table' AND name='{directory_name}' ''').fetchall()[0][0]

    def create_object(self, directory, filename, data=''):
        assert self.__check_directory_exist, f'Directory {directory} does not exist'
        try:
            self.conn.execute(f'''INSERT INTO {directory} (directory,filename, data) VALUES
                ("{directory}/", "{filename}", "{data}")'''
            )
            self.conn.commit()
        except sqlite3.IntegrityError:
            raise sqlite3.IntegrityError(f"Another file named '{filename}' exists under directory '{directory}'")
        
    
    def get_object(self, directory, filename):
        assert self.__check_directory_exist, f'Directory {directory} does not exist'
        try:
            return pd.read_sql(f'''SELECT data FROM {directory}
                                WHERE filename = "{filename}"''', con=self.conn)['data'][0]
        except IndexError:
            raise ValueError(f"No file named '{filename}' exists under directory '{directory}'")
    
    def delete_object(self, directory, filename):
        self.get_object(directory, filename)  # check if file exists
        self.c.execute(f'''DELETE FROM {directory}
                           WHERE filename = "{filename}"''')
        self.conn.commit()
    
    def create_directory(self, directory_name):
        assert not self.__check_directory_exist(directory_name), f"Directory '{directory_name}' already exists"
    
        self.c.execute(f'''CREATE TABLE {directory_name}(
        directory VARCHAR(120), 
        filename VARCHAR(120) PRIMARY KEY,
        last_modified DATETIME DEFAULT CURRENT_TIMESTAMP,
        data BLOB
        )''')
        self.conn.commit()
    
    def delete_directory(self, directory_name):
        assert self.__check_directory_exist(directory_name), f"Directory '{directory_name}' does not exist"
        self.c.execute(f'''DROP TABLE {directory_name}''')
        self.conn.commit()
    
    def list_directory(self, directory_name):
        assert self.__check_directory_exist(directory_name), f"Directory '{directory_name}' does not exist"
        return pd.read_sql(f'SELECT * FROM {directory_name}', con=self.conn)
    
    def rename_directory(self, directory_name, new_name):
        assert self.__check_directory_exist(directory_name), f"Directory '{directory_name}' does not exist"
        # Check if another directory under new_name exists
        assert not self.__check_directory_exist(new_name), f"Directory '{new_name}' already exists"

        self.c.execute(f'''ALTER TABLE {directory_name}
                           RENAME TO {new_name}''')
        self.conn.commit()
    
    def rename_object(self, directory, filename, new_name):
        assert self.__check_directory_exist(directory), f"Directory '{directory}' does not exist"

        try:
            db.c.execute(f'''UPDATE {directory} SET
                            filename = "{new_name}",
                            last_modified = CURRENT_TIMESTAMP
                            WHERE filename = "{filename}"''')
            db.conn.commit()
        except sqlite3.IntegrityError:
                raise sqlite3.IntegrityError(f"Another file named '{new_name}' exists under directory '{directory}'")

            

Initiate a new `ExtendedObjectStorage` object with the required database name as a parameter. It then immidiately creates a new database with that name and connects to it, or only connects if it already exists. The connection is made via SQLITE.

In [3]:
db = ExtendedObjectStorage(database='mydb.db')

Create a new directory with the directory name as the input. In our implementation, each directory is a sql table.

In [4]:
db.create_directory('betanir')

You can list the directory. Note the scheme - each directory has the directory name field, the filename, last modified timestamp and an optional data field with a BLOB type (shouldn't be used in the context of a catalog). It is important to note that each (directory, filename) pair serves as a primary key in our SQL table. That ensure no duplicate filenames under the same directory.

In [5]:
db.list_directory('betanir')

Unnamed: 0,directory,filename,last_modified,data


Let us add some files to examine the data integrity.

In [6]:
for i in range(100):
    db.create_object('betanir', f'tst_{str(i).zfill(2)}.csv')
db.list_directory('betanir')

Unnamed: 0,directory,filename,last_modified,data
0,betanir/,tst_00.csv,2022-02-16 13:43:17,
1,betanir/,tst_01.csv,2022-02-16 13:43:17,
2,betanir/,tst_02.csv,2022-02-16 13:43:17,
3,betanir/,tst_03.csv,2022-02-16 13:43:17,
4,betanir/,tst_04.csv,2022-02-16 13:43:17,
...,...,...,...,...
95,betanir/,tst_95.csv,2022-02-16 13:43:18,
96,betanir/,tst_96.csv,2022-02-16 13:43:18,
97,betanir/,tst_97.csv,2022-02-16 13:43:18,
98,betanir/,tst_98.csv,2022-02-16 13:43:18,


Trying to add a file with existing name yields a well described Integrity error:

In [7]:
db.create_object('betanir', 'tst_01.csv')

IntegrityError: Another file named 'tst_01.csv' exists under directory 'betanir'

As promised, Renaming a file is an atomic O(1) operation. Note how the `last_modified` timestamp is affected.

In [8]:
db.rename_object('betanir', 'tst_01.csv', 'tst.csv')
db.list_directory('betanir').head()

Unnamed: 0,directory,filename,last_modified,data
0,betanir/,tst_00.csv,2022-02-16 13:43:17,
1,betanir/,tst.csv,2022-02-16 13:43:35,
2,betanir/,tst_02.csv,2022-02-16 13:43:17,
3,betanir/,tst_03.csv,2022-02-16 13:43:17,
4,betanir/,tst_04.csv,2022-02-16 13:43:17,


Removing an object also has it's own method. Trying to remove a file which doesn't exists raises an error.

In [9]:
db.delete_object('betanir', 'tst.csv')
db.list_directory('betanir').head()

Unnamed: 0,directory,filename,last_modified,data
0,betanir/,tst_00.csv,2022-02-16 13:43:17,
1,betanir/,tst_02.csv,2022-02-16 13:43:17,
2,betanir/,tst_03.csv,2022-02-16 13:43:17,
3,betanir/,tst_04.csv,2022-02-16 13:43:17,
4,betanir/,tst_05.csv,2022-02-16 13:43:17,


As this system is originally made for object storage integration, the `get_object` method should originally get the file from object storage via GET operation, using his cloud provider's API. Therefore, the method can be easily modified by the user to fit his needs.

On the vanila prototype version we assume the data is stored in the SQL table in the `data` BLOB column, so the `get_object` operation should yield the object itself. For the sake of this demonstration, we add the same file we deleted in the previous cell with an actual data.

In [10]:
data = 'firstname,secondname,city\nAlbert,Cohen,Haifa\nScott,Cohen,NewYork'
db.create_object('betanir', 'tst_01.csv', data=data)
db.get_object('betanir', 'tst_01.csv')


'firstname,secondname,city\nAlbert,Cohen,Haifa\nScott,Cohen,NewYork'

A directory can be removed easily, simply by dropping the SQL table having that directory name. Note that trying to list a removed (or nonexist) directory will also result in an error.

In [11]:
db.delete_directory('betanir')

In [12]:
db.list_directory('betanir')

AssertionError: Directory 'betanir' does not exist

Finally, close the connection to the database. You can also remove the database itself using the hashed out command.

In [13]:
db.conn.close()

# Optional - remove database
# os.remove('./mydb.db')