# Concatenating Data Sources

---

**Introduction**

To begin this analysis, I first combined data from two separate CSV files, each representing reservation records from different hotels.

To distinguish between the two datasets, I included a simple identifier for each hotel ("H1" for the first hotel and "H2" for the second) before concatenating the data into a single dataframe. This step is crucial for maintaining the integrity of the analysis, as it allows me to analyze and compare reservation patterns across different properties while preserving the context of each hotel's unique characteristics.

By consolidating the data into one unified dataset, I can more efficiently perform subsequent data processing, feature engineering, and modeling tasks, ensuring that the insights derived are both comprehensive and actionable.

---

In [1]:
import pandas as pd

# Load and Concatenate Data

In [2]:
## Load datasets and add column to indicate hotel type/location

df_h1 = pd.read_csv('../../data/source/H1.csv')
df_h1['HotelNumber'] = 'H1'
df_h1['HotelNumber'] = df_h1['HotelNumber'].astype('string')
df_h1['ReservationStatusDate'] = pd.to_datetime(df_h1['ReservationStatusDate'], format='%Y-%m-%d')

df_h2 = pd.read_csv('../../data/source/H2.csv')
df_h2['HotelNumber'] = 'H2'
df_h2['HotelNumber'] = df_h2['HotelNumber'].astype('string')
df_h2['ReservationStatusDate'] = pd.to_datetime(df_h2['ReservationStatusDate'], format='%Y-%m-%d')

In [3]:
data = pd.concat([df_h1,df_h2],axis = 0).reset_index(drop = True)
data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber
0,0,342,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
1,0,737,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
2,0,7,2015,July,27,1,0,1,1,0.0,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
3,0,13,2015,July,27,1,0,1,1,0.0,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
4,0,14,2015,July,27,1,0,2,2,0.0,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0.0,...,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2
119386,0,102,2017,August,35,31,2,5,3,0.0,...,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2
119387,0,34,2017,August,35,31,2,5,2,0.0,...,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2
119388,0,109,2017,August,35,31,2,5,2,0.0,...,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   IsCanceled                   119390 non-null  int64         
 1   LeadTime                     119390 non-null  int64         
 2   ArrivalDateYear              119390 non-null  int64         
 3   ArrivalDateMonth             119390 non-null  object        
 4   ArrivalDateWeekNumber        119390 non-null  int64         
 5   ArrivalDateDayOfMonth        119390 non-null  int64         
 6   StaysInWeekendNights         119390 non-null  int64         
 7   StaysInWeekNights            119390 non-null  int64         
 8   Adults                       119390 non-null  int64         
 9   Children                     119386 non-null  float64       
 10  Babies                       119390 non-null  int64         
 11  Meal                      

# Save Results

In [5]:
data.to_parquet('../../data/source/concatenated_data.parquet', compression = 'zstd')

# data.to_excel('../../data/source/concatenated_data.xlsx', index = False)

In [6]:
df_h1.to_parquet('../../data/source/data_h1.parquet', compression = 'zstd')
df_h2.to_parquet('../../data/source/data_h2.parquet', compression = 'zstd')