# Quack Quack - Creating the DuckDB

---

**Ducking the Data with a Database**

As this project will involve many data transformations; engineered features; and iterative modeling; I need an orderly, robust system to handle all of the data, models, etc. without creating too much complexity in the repository. Instead of creating separate files for each version of the data, I decided that I need to create a small database to store the information  effectively and efficiently. Before I can create the database, I need data!

**Hatching the Plan**

I obtained my source data from the article referenced in the `README.md` file. The source data comes in the form of two separate CSV files, which are both sizeable and take a while to load into a dataframe (or in this case, a database). To reduce the size and increase read/write times, I will convert the original source files from CSVs to parquet files. Then, I will take the raw reservation data and use it to create the first table within the database.

---

In [None]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [None]:
import duckdb
import glob
import os
import pandas as pd
import uuid

# Convert Source CSVs to Parquet

---

The following code loops through this repository's `/data/` directory; searches for the source CSVs; converts each of them to a parquet file; and then deletes the CSV.

---

In [None]:
# Define the directory containing the CSV files
directory = '../data/source/'

# Pattern match for files named 'h1.csv' or 'h2.csv'
file_patterns = [os.path.join(directory, 'h1.csv'), os.path.join(directory, 'h2.csv')]

# Initialize a list to hold the matched file paths
csv_files = []

# Loop through the patterns and extend the list with found files
for pattern in file_patterns:
    csv_files.extend(glob.glob(pattern))

# Check if no files were found
if not csv_files:
    print("No matching filepaths found. Stopping execution.")
else:
    # Loop through each found CSV file
    for csv_file in csv_files:
        try:
            # Read the CSV file into a DataFrame
            df = pd.read_csv(csv_file)
            
            df['HotelNumber'] = csv_file[1:2]
            
            # Define the Parquet file path (same name as the CSV file but with .parquet extension)
            parquet_file = csv_file.replace('.csv', '.parquet')
            
            # Convert the DataFrame to a Parquet file
            df.to_parquet(parquet_file)

            # If the conversion was successful, remove the CSV file
            os.remove(csv_file)
            print(f"Successfully converted and removed {csv_file}")
        except Exception as e:
            print(f"Error converting {csv_file}: {e}")

    print("Conversion completed.")

# Generate and Append UUIDs

Since the source data was anonymized, there are no unique identifiers for each reservation. To support database joins and relationships between tables, I will add columns for both a UUID and the source hotel number to differentiate the reservations and preserve the unique details of each hotel.

In [None]:
input_files = ['../data/source/H1.parquet', '../data/source/H2.parquet']
output_files = ['../data/H1_with_uuid.parquet', '../data/H2_with_uuid.parquet']

save = True

for input, output in zip(input_files, output_files):
    df = db_utils.add_hotel_number_to_dataframe(input,output, save_to_parquet = save)

# Convert Updated Parquets to DuckDB

---

After converting the source CSVs to parquet form, I will now create the database to be used in the rest of the project pipeline.

---

In [None]:
# List of Parquet file paths
file_paths = ['../data/H1_with_uuid.parquet', '../data/H2_with_uuid.parquet']

# Path to the DuckDB database file
db_path = '../data/Hotel_reservations.duckdb'

# Check if the database file exists and remove it if it does
if os.path.exists(db_path):
    os.remove(db_path)

# Initialize connection to DuckDB
conn = duckdb.connect(database=db_path, read_only=False)

# Use the first file to create the table
conn.execute(f"CREATE TABLE source_data AS SELECT * FROM '{file_paths[0]}'")

# For subsequent files, append data to the existing table
for file_path in file_paths[1:]:  # Start from the second item
    conn.execute(f"INSERT INTO source_data SELECT * FROM '{file_path}'")

In [None]:
   
## Confirm successful creation of database and table(s)
display(conn.execute('SELECT * FROM source_data LIMIT 10').df())

conn.close()

# Copy Source Data Table

In [None]:
# Path to your DuckDB database
database_path = '../data/Hotel_reservations.duckdb'

# SQL command to copy the data from an existing table to a new table
copy_table_command = """
CREATE TABLE res_data AS
SELECT * FROM source_data;
"""

with db_utils.duckdb_connection(database_path) as conn:
    conn.execute(copy_table_command)
    print("Table copied successfully.")

In [None]:
file_paths

In [None]:
# for file in file_paths:

#     # Remove intermediate parquet files
#     if os.path.exists(file):
#         os.remove(file)

# Concatenate Updated Data

Previous workflow deleted these temporary files. However, a bug affecting the database creation process resulted in missing data. The concatenated data will serve as a replacement until the database is fixed.

In [None]:
df1 = pd.read_parquet(file_paths[0])
df1.head()

In [None]:
df2 = pd.read_parquet(file_paths[1])
df2.head()

In [None]:
df_condensed = pd.concat([df1, df2], axis = 0)
df_condensed = df_condensed.reset_index(drop = True)
df_condensed

In [None]:
## Confirm correction of bug affecting one of the target features
df_condensed['IsCanceled'].value_counts(dropna= False, ascending = False)

In [None]:
df_condensed.to_parquet('../data/data_condensed_with_uuid.parquet')