# Converting Data Source Files to Database

---


I want to convert the source data (in CSV format) into a database to make it easier to analyze and manage the overall dataset.
I chose to use a DuckDB-style database as it works well with Pandas and is designed for analytical workflows. This will be particularly important when I am performing my EDA and feature engineering later in my workflow.

To support the use of a database, I will start by assigning each reservation a universally unique identifier (UUID) to use as the primary key for my tables. Then, I will split the source data into logical groups to replicate a live database, making it easier to query my data later.  

---

# Imports

In [1]:
import duckdb
import pandas as pd
import uuid

# Load Data

In [2]:
h1 = pd.read_csv('../../data/source/H1.csv')
h2 = pd.read_csv('../../data/source/H2.csv')

In [3]:
h1.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [4]:
h1['HotelName'] = 'H1'
h2['HotelName'] = 'H2'

In [7]:
data = pd.concat([h1, h2], axis = 0)
data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelName
0,0,342,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,H1
1,0,737,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,H1
2,0,7,2015,July,27,1,0,1,1,0.0,...,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,H1
3,0,13,2015,July,27,1,0,1,1,0.0,...,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,H1
4,0,14,2015,July,27,1,0,2,2,0.0,...,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,H1


# Add UUIDs to Reservations

In [8]:
data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelName
0,0,342,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
1,0,737,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
2,0,7,2015,July,27,1,0,1,1,0.0,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
3,0,13,2015,July,27,1,0,1,1,0.0,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
4,0,14,2015,July,27,1,0,2,2,0.0,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,2017,August,35,30,2,5,2,0.0,...,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2
79326,0,102,2017,August,35,31,2,5,3,0.0,...,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2
79327,0,34,2017,August,35,31,2,5,2,0.0,...,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2
79328,0,109,2017,August,35,31,2,5,2,0.0,...,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2


In [9]:
# Generate UUIDs for each row in the dataframe
data['UUID'] = [uuid.uuid4() for _ in range(len(data))]
data['UUID'].head()

0    f7add1ce-8fdb-4351-a20a-3c5cc66fc87d
1    990a199b-069b-4e78-9036-fe8ea73a1f3e
2    91d6303b-16f8-41b2-a195-f6bcb9ee605a
3    df309bdf-5fd8-4789-9b52-01de65e43a22
4    007c51b0-4d0e-40ca-b550-34c93d2ca9a7
Name: UUID, dtype: object

In [12]:
data.head().T

Unnamed: 0,0,1,2,3,4
IsCanceled,0,0,0,0,0
LeadTime,342,737,7,13,14
ArrivalDateYear,2015,2015,2015,2015,2015
ArrivalDateMonth,July,July,July,July,July
ArrivalDateWeekNumber,27,27,27,27,27
ArrivalDateDayOfMonth,1,1,1,1,1
StaysInWeekendNights,0,0,0,0,0
StaysInWeekNights,0,0,1,1,2
Adults,2,2,1,1,2
Children,0.0,0.0,0.0,0.0,0.0


## Prepare Data Types for Database

In [13]:
data['ReservationStatusDate'] = pd.to_datetime(data['ReservationStatusDate'])
data['ReservationStatusDate']

0       2015-07-01
1       2015-07-01
2       2015-07-02
3       2015-07-02
4       2015-07-03
           ...    
79325   2017-09-06
79326   2017-09-07
79327   2017-09-07
79328   2017-09-07
79329   2017-09-07
Name: ReservationStatusDate, Length: 119390, dtype: datetime64[ns]

In [14]:
data.dtypes

IsCanceled                              int64
LeadTime                                int64
ArrivalDateYear                         int64
ArrivalDateMonth                       object
ArrivalDateWeekNumber                   int64
ArrivalDateDayOfMonth                   int64
StaysInWeekendNights                    int64
StaysInWeekNights                       int64
Adults                                  int64
Children                              float64
Babies                                  int64
Meal                                   object
Country                                object
MarketSegment                          object
DistributionChannel                    object
IsRepeatedGuest                         int64
PreviousCancellations                   int64
PreviousBookingsNotCanceled             int64
ReservedRoomType                       object
AssignedRoomType                       object
BookingChanges                          int64
DepositType                       

# Create Database and Add Data

## Define Custom Functions to Create DB Table

In [21]:
def map_dtype(dtype):
    """
    Map pandas dtype to DuckDB SQL type.

    Args:
        dtype (pandas.dtype): The dtype of the pandas DataFrame column.

    Returns:
        str: The corresponding DuckDB SQL type.
    """
    if pd.api.types.is_integer_dtype(dtype):
        return "INTEGER"
    elif pd.api.types.is_float_dtype(dtype):
        return "FLOAT"
    elif pd.api.types.is_bool_dtype(dtype):
        return "BOOLEAN"
    elif pd.api.types.is_datetime64_any_dtype(dtype):
        return "DATE"
    else:
        return "VARCHAR"

def generate_create_table_sql(data, table_name):
    """
    Generate a SQL CREATE TABLE statement based on the DataFrame schema.

    Args:
        data (pandas.DataFrame): The DataFrame to generate the table schema from.
        table_name (str): The name of the table to be created.

    Returns:
        str: The SQL CREATE TABLE statement.
    """
    columns = []
    for col, dtype in data.dtypes.items():
        duckdb_type = map_dtype(dtype)
        if col == 'UUID':  # Ensure UUID is UUID type
            duckdb_type = "UUID"
        columns.append(f'"{col}" {duckdb_type}')

    create_table_sql = f'CREATE TABLE "{table_name}" ({", ".join(columns)});'
    return create_table_sql

def insert_dataframe_to_duckdb(con, data, table_name):
    """
    Insert a DataFrame into a DuckDB table.

    Args:
        con (duckdb.DuckDBPyConnection): The DuckDB connection object.
        data (pandas.DataFrame): The DataFrame to insert.
        table_name (str): The name of the table to insert the data into.
    """
    con.execute(f"INSERT INTO \"{table_name}\" SELECT * FROM data")


## Create the Database

In [26]:
# Create a DuckDB connection and a new database file
con = duckdb.connect(database='../../data/reservation_data.duckdb', read_only=False)

# Generate the CREATE TABLE statement
table_name = 'reservations'
create_table_sql = generate_create_table_sql(data, table_name)
print("Generated CREATE TABLE SQL:")
print(create_table_sql)

# Execute the CREATE TABLE statement
con.execute(create_table_sql)

# Insert the dataframe into the DuckDB table
con.execute('PRAGMA enable_progress_bar')
insert_dataframe_to_duckdb(con, data, table_name)

# Close the connection
con.close()


Generated CREATE TABLE SQL:
CREATE TABLE "reservations" ("IsCanceled" INTEGER, "LeadTime" INTEGER, "ArrivalDateYear" INTEGER, "ArrivalDateMonth" VARCHAR, "ArrivalDateWeekNumber" INTEGER, "ArrivalDateDayOfMonth" INTEGER, "StaysInWeekendNights" INTEGER, "StaysInWeekNights" INTEGER, "Adults" INTEGER, "Children" FLOAT, "Babies" INTEGER, "Meal" VARCHAR, "Country" VARCHAR, "MarketSegment" VARCHAR, "DistributionChannel" VARCHAR, "IsRepeatedGuest" INTEGER, "PreviousCancellations" INTEGER, "PreviousBookingsNotCanceled" INTEGER, "ReservedRoomType" VARCHAR, "AssignedRoomType" VARCHAR, "BookingChanges" INTEGER, "DepositType" VARCHAR, "Agent" VARCHAR, "Company" VARCHAR, "DaysInWaitingList" INTEGER, "CustomerType" VARCHAR, "ADR" FLOAT, "RequiredCarParkingSpaces" INTEGER, "TotalOfSpecialRequests" INTEGER, "ReservationStatus" VARCHAR, "ReservationStatusDate" DATE, "HotelName" VARCHAR, "UUID" UUID);


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [30]:
with duckdb.connect(database='../../data/reservation_data.duckdb') as con:
    result = con.execute(f'SELECT * FROM "{table_name}" LIMIT 25').df()

In [31]:
result

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelName,UUID
0,0,342,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,H1,f7add1ce-8fdb-4351-a20a-3c5cc66fc87d
1,0,737,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,H1,990a199b-069b-4e78-9036-fe8ea73a1f3e
2,0,7,2015,July,27,1,0,1,1,0.0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,H1,91d6303b-16f8-41b2-a195-f6bcb9ee605a
3,0,13,2015,July,27,1,0,1,1,0.0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,H1,df309bdf-5fd8-4789-9b52-01de65e43a22
4,0,14,2015,July,27,1,0,2,2,0.0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,H1,007c51b0-4d0e-40ca-b550-34c93d2ca9a7
5,0,14,2015,July,27,1,0,2,2,0.0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,H1,1d051ece-d5bc-405f-a30f-e4d611c52456
6,0,0,2015,July,27,1,0,2,2,0.0,...,,0,Transient,107.0,0,0,Check-Out,2015-07-03,H1,8e597b85-39f4-4db5-9123-545c766846ca
7,0,9,2015,July,27,1,0,2,2,0.0,...,,0,Transient,103.0,0,1,Check-Out,2015-07-03,H1,d8c1a4c9-12fc-4ee8-b992-41b59fa3ebf4
8,1,85,2015,July,27,1,0,3,2,0.0,...,,0,Transient,82.0,0,1,Canceled,2015-05-06,H1,491a160c-1c37-40c6-a0bb-f33a272ba8d1
9,1,75,2015,July,27,1,0,3,2,0.0,...,,0,Transient,105.5,0,0,Canceled,2015-04-22,H1,cfb7cc07-e3b7-4ffc-bf50-830ddc282b38
