# Quack Quack - Creating the DuckDB

---

**Ducking the Data with a Database**

As this project will involve many data transformations; engineered features; and iterative modeling; I need an orderly, robust system to handle all of the data, models, etc. without creating too much complexity in the repository. Instead of creating separate files for each version of the data, I decided that I need to create a small database to store the information  effectively and efficiently. Before I can create the database, I need data!

**Hatching the Plan**

I obtained my source data from the article referenced in the `README.md` file. The source data comes in the form of two separate CSV files, which are both sizeable and take a while to load into a dataframe (or in this case, a database). To reduce the size and increase read/write times, I will convert the original source files from CSVs to parquet files. Then, I will take the raw reservation data and use it to create the first table within the database.

---

In [1]:
import duckdb
import glob
import os
import pandas as pd
import uuid

# Convert Source CSVs to Parquet

---

The following code loops through this repository's `/data/` directory; searches for the source CSVs; converts each of them to a parquet file; and then deletes the CSV.

---

In [3]:
# Define the directory containing the CSV files
directory = './data/'

# Pattern match for files named 'h1.csv' or 'h2.csv'
file_patterns = [os.path.join(directory, 'h1.csv'), os.path.join(directory, 'h2.csv')]

# Initialize a list to hold the matched file paths
csv_files = []

# Loop through the patterns and extend the list with found files
for pattern in file_patterns:
    csv_files.extend(glob.glob(pattern))

# Check if no files were found
if not csv_files:
    print("No matching filepaths found. Stopping execution.")
else:
    # Loop through each found CSV file
    for csv_file in csv_files:
        try:
            # Read the CSV file into a DataFrame
            df = pd.read_csv(csv_file)
            
            df['HotelNumber'] = csv_file[1:2]
            
            # Define the Parquet file path (same name as the CSV file but with .parquet extension)
            parquet_file = csv_file.replace('.csv', '.parquet')
            
            # Convert the DataFrame to a Parquet file
            df.to_parquet(parquet_file)

            # If the conversion was successful, remove the CSV file
            os.remove(csv_file)
            print(f"Successfully converted and removed {csv_file}")
        except Exception as e:
            print(f"Error converting {csv_file}: {e}")

    print("Conversion completed.")

No matching filepaths found. Stopping execution.


# Generate and Append UUIDs

Since the source data was anonymized, there are no unique identifiers for each reservation. To support database joins and relationships between tables, I will add columns for both a UUID and the source hotel number to differentiate the reservations and preserve the unique details of each hotel.

In [4]:
def generate_uuid_list(df):

    # Generate a UUID for each row and save to a list
    uuid_list = [str(uuid.uuid4()) for _ in range(len(df))]
    
    return uuid_list
    

def extract_hotel_number(filepath):
    
    # Extract the base filename without the extension
    base_filename = os.path.basename(filepath).split('.')[0]
    
    # Extract the hotel number (assuming the format is always H<number>)
    hotel_number = base_filename[1:]
    
    return int(hotel_number)


def add_hotel_number_to_dataframe(input_filepath,
                                  output_filepath = None,
                                  col_hotelnum = 'HotelNumber',
                                  col_id = 'UUID',
                                  save_to_parquet = True,
                                  engine = 'pyarrow',
                                  compression = 'snappy'):
    
    # Read the Parquet file into a DataFrame
    df = pd.read_parquet(input_filepath)
    
    # Extract the hotel number from the filename and append to dataframe
    df[col_hotelnum] = extract_hotel_number(input_filepath)
    
    # Generate and append UUIDs for each reservation
    df[col_id] = generate_uuid_list(df)
    
    if save_to_parquet == True and output_filepath != None:
        df.to_parquet(output_filepath, engine, compression)
    else:
        pass
    
    return df

In [5]:
input_files = ['./data/H1.parquet', './data/H2.parquet']
output_files = ['./data/H1_with_uuid.parquet', './data/H2_with_uuid.parquet']

save = True

for input, output in zip(input_files, output_files):
    df = add_hotel_number_to_dataframe(input,output, save_to_parquet = save)
    display(df.head())
    del df

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,342,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,c340004a-c472-4c6b-81eb-bf80c314f645
1,0,737,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,ec09aebb-4f9d-4273-abeb-d65c64506b68
2,0,7,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,60a02574-2682-4f08-b26d-98a3fd319eeb
3,0,13,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,d73a0de0-7008-40d4-95c2-cece848c308c
4,0,14,2015,July,27,1,0,2,2,0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,1d997f50-7fca-47f8-a293-d8ae13b5361b


Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,6,2015,July,27,1,0,2,1,0.0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-03,2,2bdb7c04-c714-45bb-aaed-047f4a053cc1
1,1,88,2015,July,27,1,0,4,2,0.0,...,,0,Transient,76.5,0,1,Canceled,2015-07-01,2,da9d2090-2381-4886-b155-097cd64c1e9c
2,1,65,2015,July,27,1,0,4,1,0.0,...,,0,Transient,68.0,0,1,Canceled,2015-04-30,2,8326f3d0-31c9-48db-ae83-15ddfa91ab12
3,1,92,2015,July,27,1,2,4,2,0.0,...,,0,Transient,76.5,0,2,Canceled,2015-06-23,2,f35b5f7e-25d3-451c-baef-34d401202e49
4,1,100,2015,July,27,2,0,2,2,0.0,...,,0,Transient,76.5,0,1,Canceled,2015-04-02,2,cee25a90-8139-482e-a2b1-ae57f644373e


In [6]:
# ## Loop through the source data; generate and append UUIDs; append the hotel number;
# ## then save the new results as a new parquet file.

# input_files = ['./data/H1.parquet', './data/h2.parquet']
# output_files = ['./data/H1_with_uuid.parquet', './data/H2_with_uuid.parquet']

# # Process each file
# for input_file, output_file in zip(input_files, output_files):
#     add_uuids_to_parquet(input_file, output_file)
# # List of Parquet file paths
# file_paths = ['./data/H1_with_uuid.parquet', './data/H2_with_uuid.parquet']

# for file in file_paths:
#     print(file[8])
    
#     temp_df = pd.read_parquet(file)
#     temp_df['HotelNumber'] = file[8]
#     temp_df.to_parquet(file)
# for file in file_paths:
#     temp_df = pd.read_parquet(file)
#     display(temp_df[['uuid', 'HotelNumber']].head(10))
#     del temp_df

# Convert Updated Parquets to DuckDB

---

After converting the source CSVs to parquet form, I will now create the database to be used in the rest of the project pipeline.

---

In [7]:
# List of Parquet file paths
file_paths = ['./data/H1_with_uuid.parquet', './data/H2_with_uuid.parquet']

# Path to the DuckDB database file
db_path = './data/hotel_reservations.duckdb'

# Check if the database file exists and remove it if it does
if os.path.exists(db_path):
    os.remove(db_path)

# Initialize connection to DuckDB
conn = duckdb.connect(database=db_path, read_only=False)

# Use the first file to create the table
conn.execute(f"CREATE TABLE source_data AS SELECT * FROM '{file_paths[0]}'")

# For subsequent files, append data to the existing table
for file_path in file_paths[1:]:  # Start from the second item
    conn.execute(f"INSERT INTO source_data SELECT * FROM '{file_path}'")
    
## Confirm successful creation of database and table(s)
display(conn.execute('SELECT * FROM source_data LIMIT 10').df())

conn.close()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,342,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,c340004a-c472-4c6b-81eb-bf80c314f645
1,0,737,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,ec09aebb-4f9d-4273-abeb-d65c64506b68
2,0,7,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,60a02574-2682-4f08-b26d-98a3fd319eeb
3,0,13,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,d73a0de0-7008-40d4-95c2-cece848c308c
4,0,14,2015,July,27,1,0,2,2,0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,1d997f50-7fca-47f8-a293-d8ae13b5361b
5,0,14,2015,July,27,1,0,2,2,0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,ec0b34e3-fd3a-4bc7-8ac8-fb6ddd41a6f2
6,0,0,2015,July,27,1,0,2,2,0,...,,0,Transient,107.0,0,0,Check-Out,2015-07-03,1,90764149-8563-4d62-bd79-bba503df4198
7,0,9,2015,July,27,1,0,2,2,0,...,,0,Transient,103.0,0,1,Check-Out,2015-07-03,1,1202cd51-8523-4af1-855f-f951fd635182
8,1,85,2015,July,27,1,0,3,2,0,...,,0,Transient,82.0,0,1,Canceled,2015-05-06,1,b0fcba54-a774-4e1f-a607-b417922303b6
9,1,75,2015,July,27,1,0,3,2,0,...,,0,Transient,105.5,0,0,Canceled,2015-04-22,1,35f7ab35-24af-4af0-b7f6-30464d00470c
