# Studi Kasus: Pembersihan dan Pengolahan Data Pemesanan Hotel

## Latar Belakang
Anda adalah seorang analis data di sebuah perusahaan manajemen hotel yang ingin menganalisis data pemesanan untuk memahami pola pembatalan dan preferensi pelanggan. Dataset yang digunakan adalah `hotel_bookings.csv`, yang berisi informasi tentang pemesanan hotel, seperti nama pelanggan, tanggal pemesanan, tipe kamar, status pembatalan, dan lainnya. Namun, dataset ini memiliki masalah seperti nilai kosong (missing values) dan format data yang tidak konsisten, terutama pada kolom `customer_name` dan `reservation_status`. Anda diminta untuk melakukan pembersihan data, penggabungan informasi, dan ekstraksi pola menggunakan regular expression untuk mempersiapkan data untuk analisis lebih lanjut.

Dataset Sumber
- **`hotel_bookings.csv`**: Berisi kolom seperti `hotel`, `is_canceled`, `lead_time`, `arrival_date_year`, `arrival_date_month`, `arrival_date_week_number`, `arrival_date_day_of_month`, `stays_in_weekend_nights`, `stays_in_week_nights`, `adults`, `children`, `babies`, `meal`, `country`, `market_segment`, `distribution_channel`, `is_repeated_guest`, `previous_cancellations`, `previous_bookings_not_canceled`, `reserved_room_type`, `assigned_room_type`, `booking_changes`, `deposit_type`, `agent`, `company`, `days_in_waiting_list`, `customer_type`, `adr`, `required_car_parking_spaces`, `total_of_special_requests`, `reservation_status`, `reservation_status_date`.
- **`hotel_type_info.csv`**: Dataset tambahan yang berisi informasi tentang tipe hotel (`hotel`) dengan kolom `hotel` dan `hotel_description`.

Instruksi Latihan
Latihan ini dibagi menjadi dua bagian utama: **Data Join dan Validasi** serta **Pembersihan Data dan Regular Expression**. Ikuti langkah-langkah berikut dan lengkapi kode yang diberikan.

## Bagian 1: Data Join dan Validasi


In [1]:
import pandas as pd

# Load dataset
hotel_df = pd.read_csv('hotel_bookings.csv')
hotel_type_info = pd.read_csv('hotel_type_info.csv')

In [5]:
hotel_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

In [6]:
hotel_type_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   hotel              2 non-null      object
 1   hotel_description  2 non-null      object
dtypes: object(2)
memory usage: 164.0+ bytes


In [7]:
# Langkah 1: Gabungkan dataset
# Gunakan pd.merge() untuk melakukan inner join
# Tulis kode Anda di sini
df_inner = pd.merge(hotel_df, hotel_type_info, on = 'hotel', how = 'inner')

In [15]:
# Langkah 2: Validasi hasil gabungan
# Periksa nilai kosong pada kolom hotel_description
# Tulis kode Anda di sini
df_inner['hotel_description'].isna().sum()

np.int64(0)

In [28]:
# Langkah 3: Buat laporan jumlah pemesanan per tipe hotel
# Tulis kode Anda di sini
df_inner['hotel_description'].value_counts()

hotel_description
Urban Business Hotel    79330
Beachfront Resort       40060
Name: count, dtype: int64

## Bagian 2: Pembersihan Data dan Regular Expression

In [29]:
import re

In [30]:
# Langkah 1: Penanganan missing values
# Identifikasi nilai kosong dan lakukan imputasi
# Tulis kode Anda di sini
df_inner.isna().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

In [None]:
df_inner.shape[0]

119390

In [40]:
# Langkah 2: Ekstraksi informasi dengan regex
# Ekstrak region dari country
df_inner['country'].unique()

array(['PRT', 'GBR', 'USA', 'ESP', 'IRL', 'FRA', nan, 'ROU', 'NOR', 'OMN',
       'ARG', 'POL', 'DEU', 'BEL', 'CHE', 'CN', 'GRC', 'ITA', 'NLD',
       'DNK', 'RUS', 'SWE', 'AUS', 'EST', 'CZE', 'BRA', 'FIN', 'MOZ',
       'BWA', 'LUX', 'SVN', 'ALB', 'IND', 'CHN', 'MEX', 'MAR', 'UKR',
       'SMR', 'LVA', 'PRI', 'SRB', 'CHL', 'AUT', 'BLR', 'LTU', 'TUR',
       'ZAF', 'AGO', 'ISR', 'CYM', 'ZMB', 'CPV', 'ZWE', 'DZA', 'KOR',
       'CRI', 'HUN', 'ARE', 'TUN', 'JAM', 'HRV', 'HKG', 'IRN', 'GEO',
       'AND', 'GIB', 'URY', 'JEY', 'CAF', 'CYP', 'COL', 'GGY', 'KWT',
       'NGA', 'MDV', 'VEN', 'SVK', 'FJI', 'KAZ', 'PAK', 'IDN', 'LBN',
       'PHL', 'SEN', 'SYC', 'AZE', 'BHR', 'NZL', 'THA', 'DOM', 'MKD',
       'MYS', 'ARM', 'JPN', 'LKA', 'CUB', 'CMR', 'BIH', 'MUS', 'COM',
       'SUR', 'UGA', 'BGR', 'CIV', 'JOR', 'SYR', 'SGP', 'BDI', 'SAU',
       'VNM', 'PLW', 'QAT', 'EGY', 'PER', 'MLT', 'MWI', 'ECU', 'MDG',
       'ISL', 'UZB', 'NPL', 'BHS', 'MAC', 'TGO', 'TWN', 'DJI', 'STP',
       'KNA', 'E

In [45]:
# Identifikasi pola reservation_status
# Tulis kode Anda di sini
df_inner['reservation_status'].value_counts()


reservation_status
Check-Out    75166
Canceled     43017
No-Show       1207
Name: count, dtype: int64

In [52]:
# Filter pemesanan berdasarkan country
# Kode negara dimulai dengan P
# Tulis kode Anda di sini
df_p_country = df_inner[df_inner['country'].fillna('').str.contains('^P', regex=True)]
df_p_country

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,hotel_description
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,,,0,Transient,0.00,0,0,Check-Out,01-07-15,Beachfront Resort
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,,,0,Transient,0.00,0,0,Check-Out,01-07-15,Beachfront Resort
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,...,,,0,Transient,107.00,0,0,Check-Out,03-07-15,Beachfront Resort
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,...,303.0,,0,Transient,103.00,0,1,Check-Out,03-07-15,Beachfront Resort
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,...,240.0,,0,Transient,82.00,0,1,Canceled,06-05-15,Beachfront Resort
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119340,City Hotel,0,110,2017,August,35,29,0,5,2,...,14.0,,0,Transient,171.00,0,2,Check-Out,03-09-17,Urban Business Hotel
119351,City Hotel,0,72,2017,August,35,31,0,3,2,...,14.0,,0,Transient,134.82,0,1,Check-Out,03-09-17,Urban Business Hotel
119357,City Hotel,0,47,2017,August,35,31,1,3,1,...,423.0,,0,Transient,91.02,0,0,Check-Out,04-09-17,Urban Business Hotel
119366,City Hotel,0,210,2017,August,35,28,2,5,2,...,7.0,,0,Transient,85.59,0,1,Check-Out,04-09-17,Urban Business Hotel


In [54]:
# Kode negara tanpa R atau T
# Tulis kode Anda di sini
df_inner[~df_inner['country'].str.contains('R|T', na=False)]

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,hotel_description
12,Resort Hotel,0,68,2015,July,27,1,0,4,2,...,240.0,,0,Transient,97.00,0,3,Check-Out,05-07-15,Beachfront Resort
13,Resort Hotel,0,18,2015,July,27,1,0,4,2,...,241.0,,0,Transient,154.77,0,1,Check-Out,05-07-15,Beachfront Resort
30,Resort Hotel,0,118,2015,July,27,1,4,10,1,...,,,0,Transient,62.00,0,2,Check-Out,15-07-15,Beachfront Resort
36,Resort Hotel,0,15,2015,July,27,2,1,3,2,...,240.0,,0,Transient,98.00,0,0,Check-Out,06-07-15,Beachfront Resort
42,Resort Hotel,0,16,2015,July,27,2,2,3,2,...,,,0,Transient,123.00,0,0,Check-Out,07-07-15,Beachfront Resort
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119383,City Hotel,0,164,2017,August,35,31,2,4,2,...,42.0,,0,Transient,87.60,0,0,Check-Out,06-09-17,Urban Business Hotel
119384,City Hotel,0,21,2017,August,35,30,2,5,2,...,394.0,,0,Transient,96.14,0,2,Check-Out,06-09-17,Urban Business Hotel
119385,City Hotel,0,23,2017,August,35,30,2,5,2,...,394.0,,0,Transient,96.14,0,0,Check-Out,06-09-17,Urban Business Hotel
119387,City Hotel,0,34,2017,August,35,31,2,5,2,...,9.0,,0,Transient,157.71,0,4,Check-Out,07-09-17,Urban Business Hotel


In [57]:
# Langkah 3: Validasi akhir dan simpan dataset
# Tulis kode Anda di sini
# hotel_df.to_csv('hotel_bookings_cleaned.csv', index=False)
df_inner.isnull().sum()
df_inner.dtypes
df_inner.head()
df_inner.to_csv('hotel_bookings_cleaned.csv', index=False)