# Quack Quack - Creating the DuckDB

---

**Ducking the Data with a Database**

As this project will involve many data transformations; engineered features; and iterative modeling; I need an orderly, robust system to handle all of the data, models, etc. without creating too much complexity in the repository. Instead of creating separate files for each version of the data, I decided that I need to create a small database to store the information  effectively and efficiently. Before I can create the database, I need data!

**Hatching the Plan**

I obtained my source data from the article referenced in the `README.md` file. The source data comes in the form of two separate CSV files, which are both sizeable and take a while to load into a dataframe (or in this case, a database). To reduce the size and increase read/write times, I will convert the original source files from CSVs to parquet files. Then, I will take the raw reservation data and use it to create the first table within the database.

---

In [8]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import os
import sys

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [10]:
import duckdb
import glob
import pandas as pd
from pathlib import Path
import uuid

# Convert Source CSVs to Parquet

---

The following code loops through this repository's `/data/` directory; searches for the source CSVs; converts each of them to a parquet file; and then deletes the CSV.

---

In [6]:
# Define the directory containing the CSV files
path = Path('../data/source/')

for file in path.glob('*.csv'):
    try:
        # Read the CSV file into a DataFrame
        df = pd.read_csv(csv_file)
        
        df['HotelNumber'] = csv_file[1:2]
        
        # Define the Parquet file path (same name as the CSV file but with .parquet extension)
        parquet_file = csv_file.replace('.csv', '.parquet')
        
        # Convert the DataFrame to a Parquet file
        df.to_parquet(parquet_file)

        # If the conversion was successful, remove the CSV file
        os.remove(csv_file)
        print(f"Successfully converted and removed {csv_file}")
    except Exception as e:
        print(f"Error converting {csv_file}: {e}")

print("Conversion completed.")

Conversion completed.


# Generate and Append UUIDs

Since the source data was anonymized, there are no unique identifiers for each reservation. To support database joins and relationships between tables, I will add columns for both a UUID and the source hotel number to differentiate the reservations and preserve the unique details of each hotel.

In [9]:
input_files = ['../data/source/H1.parquet', '../data/source/H2.parquet']
output_files = ['../data/H1_with_uuid.parquet', '../data/H2_with_uuid.parquet']

save = True

for input_file, output_file in zip(input_files, output_files):
    df = db_utils.add_hotel_number_to_dataframe(input_file, output_file, save_to_parquet = save)

# Convert Updated Parquets to DuckDB

---

After converting the source CSVs to parquet form, I will now create the database to be used in the rest of the project pipeline.

---

In [13]:
# List of Parquet file paths
file_paths = ['../data/H1_with_uuid.parquet', '../data/H2_with_uuid.parquet']

# Path to the DuckDB database file
db_path = '../data/Hotel_reservations.duckdb'

# Check if the database file exists and remove it if it does
if os.path.exists(db_path):
    os.remove(db_path)

# Initialize connection to DuckDB
with duckdb.connect(database=db_path, read_only=False) as conn:
    
    # Use the first file to create the table
    conn.execute(f"CREATE TABLE source_data AS SELECT * FROM '{file_paths[0]}'")
    
    # For subsequent files, append data to the existing table
    for file_path in file_paths[1:]:  # Start from the second item
        conn.execute(f"INSERT INTO source_data SELECT * FROM '{file_path}'")
       
    ## Confirm successful creation of database and table(s)
    display(conn.execute('SELECT * FROM source_data LIMIT 10').df())

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,342,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,2a51e73a-1af5-4d4d-acdf-54f80e2df411
1,0,737,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,caf3114e-459e-4665-8c07-49d4fe5b52db
2,0,7,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,823abb44-a0b7-4c7c-a7f1-457406251f3f
3,0,13,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,e04d8d68-10ea-42b5-bd8b-22c09dd539f3
4,0,14,2015,July,27,1,0,2,2,0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,c7c8f889-5eb2-471a-ae26-7fcc3be081da
5,0,14,2015,July,27,1,0,2,2,0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,2a58d83b-1f43-4753-943d-46d6b2ab505d
6,0,0,2015,July,27,1,0,2,2,0,...,,0,Transient,107.0,0,0,Check-Out,2015-07-03,1,4c11dffe-27fa-4a67-8d87-5091d39e8bce
7,0,9,2015,July,27,1,0,2,2,0,...,,0,Transient,103.0,0,1,Check-Out,2015-07-03,1,36202b29-aabf-458b-badf-39e459a5fd50
8,1,85,2015,July,27,1,0,3,2,0,...,,0,Transient,82.0,0,1,Canceled,2015-05-06,1,13773a9d-3c62-48b4-a8e5-67d569dea593
9,1,75,2015,July,27,1,0,3,2,0,...,,0,Transient,105.5,0,0,Canceled,2015-04-22,1,bb92603d-a99a-44b9-a937-3a483dc77c43


In [14]:
raise Exception('Interrupt workflow - check duckdb integrity/setup.')

Exception: Interrupt workflow - check duckdb integrity/setup.

In [16]:
# Initialize connection to DuckDB
with duckdb.connect(database=db_path, read_only=True) as conn:
       
    ## Confirm successful creation of database and table(s)w
    display(conn.execute('SELECT * FROM source_data WHERE HotelNumber = 2 LIMIT 10').df())

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,6,2015,July,27,1,0,2,1,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-03,2,256ac979-a662-459f-b71e-49cd5d5966e1
1,1,88,2015,July,27,1,0,4,2,0,...,,0,Transient,76.5,0,1,Canceled,2015-07-01,2,7d9991a4-63e4-4045-a66a-9f118fbd6293
2,1,65,2015,July,27,1,0,4,1,0,...,,0,Transient,68.0,0,1,Canceled,2015-04-30,2,3942535d-dbc9-45cb-a72e-0abb1649fcd1
3,1,92,2015,July,27,1,2,4,2,0,...,,0,Transient,76.5,0,2,Canceled,2015-06-23,2,e05b81f2-19a9-47f9-8405-903c51db5c56
4,1,100,2015,July,27,2,0,2,2,0,...,,0,Transient,76.5,0,1,Canceled,2015-04-02,2,25796554-985e-44a3-b5f0-d6fa89d48ae2
5,1,79,2015,July,27,2,0,3,2,0,...,,0,Transient,76.5,0,1,Canceled,2015-06-25,2,2a835d53-890e-4fea-bdc3-0683ea56927c
6,0,3,2015,July,27,2,0,3,1,0,...,,0,Transient-Party,58.67,0,0,Check-Out,2015-07-05,2,b63ec25b-19ec-4d7d-ab32-91db8c337f36
7,1,63,2015,July,27,2,1,3,1,0,...,,0,Transient,68.0,0,0,Canceled,2015-06-25,2,419f0436-9d09-4510-b6c6-afc724d8b10a
8,1,62,2015,July,27,2,2,3,2,0,...,,0,Transient,76.5,0,1,No-Show,2015-07-02,2,b1ebfb52-cea1-4229-851b-2b9add7f8b4d
9,1,62,2015,July,27,2,2,3,2,0,...,,0,Transient,76.5,0,1,No-Show,2015-07-02,2,9e55cee4-a3b6-4270-b96a-f34aaf713a38


In [18]:
# Initialize connection to DuckDB
with duckdb.connect(database=db_path, read_only=True) as conn:
       
    ## Confirm successful creation of database and table(s)w
    display(conn.execute('SELECT COUNT(HotelNumber) FROM source_data').df())

Unnamed: 0,count(HotelNumber)
0,119390


# Copy Source Data Table

In [None]:
# Path to your DuckDB database
database_path = '../data/Hotel_reservations.duckdb'

# SQL command to copy the data from an existing table to a new table
copy_table_command = """
CREATE TABLE res_data AS
SELECT * FROM source_data;
"""

with db_utils.duckdb_connection(database_path) as conn:
    conn.execute(copy_table_command)
    print("Table copied successfully.")

In [None]:
file_paths

In [None]:
# for file in file_paths:

#     # Remove intermediate parquet files
#     if os.path.exists(file):
#         os.remove(file)

# Concatenate Updated Data

Previous workflow deleted these temporary files. However, a bug affecting the database creation process resulted in missing data. The concatenated data will serve as a replacement until the database is fixed.

In [None]:
df1 = pd.read_parquet(file_paths[0])
df1.head()

In [None]:
df2 = pd.read_parquet(file_paths[1])
df2.head()

In [None]:
df_condensed = pd.concat([df1, df2], axis = 0)
df_condensed = df_condensed.reset_index(drop = True)
df_condensed

In [None]:
## Confirm correction of bug affecting one of the target features
df_condensed['IsCanceled'].value_counts(dropna= False, ascending = False)

In [None]:
df_condensed.to_parquet('../data/data_condensed_with_uuid.parquet')