<center><h1>Data Manipulation: Forming a Weekly Time Series<br><p style="font-size:8">(Data Manipulation and ARIMA Modeling with Pyramid)</p></h1></center>

# 0. Prelim

## 0.1 Packages

In [39]:
from pathlib import Path
import numpy as np
import pandas as pd

## 0.2 Paths

In [3]:
DATA_FOLDER = Path("../../../data")
RAW_DATA_FOLDER = DATA_FOLDER / "raw"

# 1. Extract Data

0       2015-07-01
1       2015-07-01
2       2015-07-01
3       2015-07-01
4       2015-07-01
           ...    
40055   2017-08-31
40056   2017-08-30
40057   2017-08-29
40058   2017-08-31
40059   2017-08-31
Length: 40060, dtype: datetime64[ns]

In [30]:
pd.to_datetime(pd.DataFrame(
    df_bookings[['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']].values.astype(str)
), format="%Y-%B-%d")

ValueError: Unable to parse string "July" at position 0

In [45]:
# A. Extract
df_bookings = pd.read_csv(RAW_DATA_FOLDER / "H1.csv", parse_dates=['ReservationStatusDate'])

# B. Transform
df_bookings['ArrivalDate'] = pd.to_datetime(
    df_bookings[['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']].astype(str)\
        .agg("-".join, axis=1)
)
df_bookings.drop(
    columns=['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth'],
    inplace=True
    )

# C. Display
print("Descriptive Statistics:")
print("-----------------------")
print(f"Shape : {df_bookings.shape}")
print("\n\n")
print(f"Columns:")
print(f"--------")
display(df_bookings.dtypes)
print("\n\n")
display(df_bookings)

Descriptive Statistics:
-----------------------
Shape : (40060, 29)



Columns:
--------


IsCanceled                              int64
LeadTime                                int64
ArrivalDateWeekNumber                   int64
StaysInWeekendNights                    int64
StaysInWeekNights                       int64
Adults                                  int64
Children                                int64
Babies                                  int64
Meal                                   object
Country                                object
MarketSegment                          object
DistributionChannel                    object
IsRepeatedGuest                         int64
PreviousCancellations                   int64
PreviousBookingsNotCanceled             int64
ReservedRoomType                       object
AssignedRoomType                       object
BookingChanges                          int64
DepositType                            object
Agent                                  object
Company                                object
DaysInWaitingList                 






Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,ArrivalDate
0,0,342,27,0,0,2,0,0,BB,PRT,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01
1,0,737,27,0,0,2,0,0,BB,PRT,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01
2,0,7,27,0,1,1,0,0,BB,GBR,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,2015-07-01
3,0,13,27,0,1,1,0,0,BB,GBR,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,2015-07-01
4,0,14,27,0,2,2,0,0,BB,GBR,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,35,2,8,2,1,0,BB,GBR,...,143,,0,Transient,89.75,0,0,Check-Out,2017-09-10,2017-08-31
40056,0,169,35,2,9,2,0,0,BB,IRL,...,250,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10,2017-08-30
40057,0,204,35,4,10,2,0,0,BB,IRL,...,250,,0,Transient,153.57,0,3,Check-Out,2017-09-12,2017-08-29
40058,0,211,35,4,10,2,0,0,HB,GBR,...,40,,0,Contract,112.80,0,1,Check-Out,2017-09-14,2017-08-31


## 2. Weekly Aggregation

In [53]:
df_weekly_cancelations.columns

Index(['IsCanceled'], dtype='object')

In [59]:
# A. Create Aggregation
df_weekly_cancelations = df_bookings.resample("W", on='ReservationStatusDate')['IsCanceled'].sum().to_frame()

# B. Display
print("Descriptive Statistics:")
print("-----------------------")
print(f"Shape : {df_weekly_cancelations.shape}")
print(f"Date Range: {df_weekly_cancelations.index.min() : %Y-%m-%d} - {df_weekly_cancelations.index.max() : %Y-%m-%d}")
print("\n\n")
print(f"Columns:")
print(f"--------")
display(df_weekly_cancelations.dtypes)
print("\n\n")
display(df_weekly_cancelations)

Descriptive Statistics:
-----------------------
Shape : (148, 1)
Date Range:  2014-11-23 -  2017-09-17



Columns:
--------


IsCanceled    int64
dtype: object






Unnamed: 0_level_0,IsCanceled
ReservationStatusDate,Unnamed: 1_level_1
2014-11-23,1
2014-11-30,0
2014-12-07,0
2014-12-14,0
2014-12-21,0
...,...
2017-08-20,32
2017-08-27,14
2017-09-03,4
2017-09-10,0
