## Hotel Demand Forecasting - Data Preparation

### 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np

### 1.1 Reading in datasets 

In [2]:
resort = pd.read_csv('H1_cleaned.csv')
hotel = pd.read_csv('H2_cleaned.csv')

## 2. Feature Engineering

### 2.1. Company variable - Assigning 'Null' as 1, Non-null as 0

In [3]:
# resort.assign(hasCompany= np.where(resort.Company.isnull(), 0, 1))

resort.Company = pd.to_numeric(resort.Company, errors='coerce')
resort['hasCompany'] = np.where(resort.Company.isnull(), 0, 1)

hotel.Company = pd.to_numeric(hotel.Company, errors='coerce')
hotel = hotel.assign(hasCompany= np.where(hotel.Company.isnull(), 0, 1))

### 2.2. Agent variable - Assigning 'Null' as 1, Non-null as 0

In [4]:
resort['Agent'] = pd.to_numeric(resort.Agent, errors='coerce')
resort['hasAgent'] = np.where(resort.Agent.isnull(), 0, 1)

hotel['Agent'] = pd.to_numeric(hotel.Agent, errors='coerce')
hotel['hasAgent'] = np.where(hotel.Agent.isnull(), 0, 1)

### 2.3. Country variable - Assigning PRT (portugal) as 1, Non-PRT countries as 0

Given that Portugal is where the resort and hotel is situated at (and also formed the largest portion where bookings originated from), a separate column is created to create the distiction between a domestic and foreign booking.

In [5]:
resort['isPRT'] = np.where(resort['Country'] == 'PRT', 1, 0)
hotel['isPRT'] = np.where(hotel['Country'] == 'PRT', 1, 0)

### 2.4. Dropping Unnecessary Features/Variables

- ArrivalDateYear - To be excluded given that it is not cyclical, even in the assumption of a yearly increase, there is insufficient data points for year to make an informed decision
- ArrivalDateDayOfMonth - Insufficient data points as well
- ReservationStatus - Strongly correlated with the dependent variable (IsCancelled) - Since IsCanceled (cancelled booking) will just be represented as 'Canceled' in the reservation status which provides no value.

In [6]:
resort_cleaned = resort.drop(columns={'ArrivalDateYear',
                                      'ArrivalDateDayOfMonth',
                                      'ReservationStatus',
                                      'ReservationStatusDate',
                                      'Agent', 'Company', 'Country'})

hotel_cleaned = hotel.drop(columns={'ArrivalDateYear',
                                    'ArrivalDateDayOfMonth',
                                    'ReservationStatus',
                                    'ReservationStatusDate',
                                    'Agent', 'Company', 'Country'})

Index(['IsCanceled', 'LeadTime', 'ArrivalDateYear', 'ArrivalDateMonth',
       'ArrivalDateWeekNumber', 'ArrivalDateDayOfMonth',
       'StaysInWeekendNights', 'StaysInWeekNights', 'Adults', 'Children',
       'Babies', 'Meal', 'Country', 'MarketSegment', 'DistributionChannel',
       'IsRepeatedGuest', 'PreviousCancellations',
       'PreviousBookingsNotCanceled', 'ReservedRoomType', 'AssignedRoomType',
       'BookingChanges', 'DepositType', 'Agent', 'Company',
       'DaysInWaitingList', 'CustomerType', 'ADR', 'RequiredCarParkingSpaces',
       'TotalOfSpecialRequests', 'ReservationStatus', 'ReservationStatusDate',
       'hasCompany', 'hasAgent', 'isPRT'],
      dtype='object')