# UAS Project Streamlit: 
- **Nama:** Syahrizal Yonanda Mahfiridho
- **Dataset:** https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand
- **URL Website:** [Di isi jika web streamlit di upload]



## Menentukan Pertanyaan Bisnis

- Bagaimana variasi harga menurut musim?
- Bagaimana variasi harga per malam perbulannya?
- Bagaimana harga permalamnya dari tipe kamarnya?
- Rata-rata Market Segment dan Tipe Kamar

## Import Semua Packages/Library yang Digunakan

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import datetime
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
sns.set(rc={'figure.figsize':(15,7.5)})
from datetime import datetime

## Data Wrangling

### Gathering Data

In [None]:
df = pd.read_csv("hotel_bookings.csv")

In [None]:
pd.set_option('display.max_columns', None)
df

### Assessing Data

In [None]:
df[df.duplicated(keep=False)]

In [None]:
df.drop_duplicates(inplace=True)
df.info()

In [None]:
df.isnull().sum()

### Cleaning Data

In [None]:
df.drop('company', axis=1, inplace=True)
df.drop('agent', axis=1, inplace=True)

In [None]:
df["country"].fillna(df["country"].mode()[0],inplace=True)
df.describe()

In [None]:
df["meal"].replace("Undefined", "SC", inplace=True)

In [None]:
df["children"].value_counts()

In [None]:
df.sort_values('children',ascending = False)

In [None]:
df.drop(328, axis=0, inplace=True)

In [None]:
df["babies"].value_counts()

In [None]:
df.sort_values('babies',ascending = False)

In [None]:
df.drop([46619,78656], axis=0, inplace=True)

In [None]:
df["children"]=df["children"].fillna(0.0).astype(int)

In [None]:
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format='%Y/%m/%d', errors='coerce')

In [None]:
df["adr"].value_counts(ascending=False)

In [None]:
adr_index =pd.DataFrame(df.loc[df['adr'] == 0.00]).index

In [None]:
df.drop(adr_index, axis=0, inplace=True)

In [None]:
df["distribution_channel"].value_counts()

In [None]:
distribution_channel_index = df.loc[df['distribution_channel'] == "Undefined"].index

In [None]:
df.drop(distribution_channel_index, axis=0, inplace=True)

In [None]:
df["reservation_status"].value_counts()

In [None]:
reservation_status_index = df.loc[df['reservation_status'] == "No-Show"].index

In [None]:
df.drop(reservation_status_index, axis=0, inplace=True)

In [None]:
lead_time_index = df.loc[df['lead_time'] == 0].index

In [None]:
df.drop(lead_time_index , axis=0,inplace = True)

In [None]:
sns.boxplot(x="adr",data=df)

In [None]:
df.sort_values('adr',ascending = False)

In [None]:
df.drop([48515,14969] , axis=0,inplace = True)

In [None]:
adult_index = df.loc[df['adults'] < 1].index
adult_index

In [None]:
df.drop(adult_index , axis=0,inplace = True)

In [None]:
df["required_car_parking_spaces"].value_counts()

In [None]:
df.sort_values("required_car_parking_spaces",ascending=False)

In [None]:
df.drop([29045,29046],axis=0,inplace=True)
df.describe()

## Exploratory Data Analysis (EDA)

### Explore ...

In [None]:
#collect arrival date in one column
df['arrival_date'] = df[["arrival_date_year","arrival_date_month","arrival_date_day_of_month"]].apply(lambda x: '/'.join(x.dropna().astype(str)),axis=1)

In [None]:
#convert arrival date to date format
df['arrival_date'] = pd.to_datetime(df['arrival_date'], format='%Y/%B/%d', errors='coerce')

In [None]:
df["Total_Guests"]=df["adults"]+df["children"]

In [None]:
total_guests_index = df.loc[df['Total_Guests'] == 0].index

In [None]:
df.drop(total_guests_index, axis=0, inplace=True)

In [None]:
def season(x):
    if x in ['December','January','February']:
        return "Winter"
    if x in['March','April','May']:
        return "Spring"
    if x in['June','July', 'August']:
        return "Summer"
    if x in['September', 'October', 'November']:
        return "Autumn"

In [None]:
df['Seasons']=df['arrival_date_month'].apply(season)

In [None]:
df["Total_Days"]=df["stays_in_weekend_nights"]+df["stays_in_week_nights"]

## Visualization & Explanatory Analysis

In [None]:
resort_hotel_df = df.loc[(df["hotel"] == "Resort Hotel") & (df["is_canceled"] == 0)]
city_hotel_df =df.loc[(df["hotel"] == "City Hotel") & (df["is_canceled"] == 0)]

### Pertanyaan 1: Bagaimana variasi harga menurut musim?

In [None]:
sns.barplot(data = city_hotel_df , x='Seasons' , y ='adr')
plt.title("Average Room price per night and person over the season in City", fontsize=16)
plt.xlabel("Seasons", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

In [None]:
sns.barplot(data = resort_hotel_df , x='Seasons' , y ='adr')
plt.title("Average Room price per night and person over the season in Resort", fontsize=16)
plt.xlabel("Seasons", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

### Pertanyaan 2: Bagaimana variasi harga per malam perbulannya?

In [None]:
resort_month=resort_hotel_df.groupby(["arrival_date_month"])["adr"].mean().reset_index()
city_month=city_hotel_df.groupby(["arrival_date_month"])["adr"].mean().reset_index()
ordered_months = ["January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"]
resort_month.index = pd.CategoricalIndex(resort_month["arrival_date_month"],categories = ordered_months,ordered=True)
city_month.index = pd.CategoricalIndex(city_month["arrival_date_month"],categories = ordered_months,ordered=True)
resort_month = resort_month.sort_index()
city_month = city_month.sort_index()

In [None]:
sns.barplot(data = city_month , x='arrival_date_month' , y ='adr')
plt.xlabel("Month in city", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

In [None]:
sns.barplot(data = resort_month , x='arrival_date_month' , y ='adr')
plt.xlabel("Month in resort", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

### Pertanyaan 3: Bagaimana harga permalamnya dari tipe kamarnya?

In [None]:
resort_room=resort_hotel_df.groupby(["reserved_room_type"])["adr"].mean().reset_index()
city_room=city_hotel_df.groupby(["reserved_room_type"])["adr"].mean().reset_index()
ordered_room = ["A","B","C","D","E","F","G","H"]
resort_room.index = pd.CategoricalIndex(resort_room["reserved_room_type"],categories = ordered_room,ordered=True)
city_room.index = pd.CategoricalIndex(city_room["reserved_room_type"],categories = ordered_room,ordered=True)
resort_room = resort_room.sort_index()
city_room = city_room.sort_index()

In [None]:
sns.barplot(data = city_room , x='reserved_room_type' , y ='adr')
plt.xlabel("room in city", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

In [None]:
sns.barplot(data = resort_room , x='reserved_room_type' , y ='adr')
plt.xlabel("room in resort", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

### Pertanyaan 4: Rata-rata Market Segment dan Tipe Kamar

In [None]:
sns.barplot(data = city_hotel_df , x='market_segment' , y ='adr',hue="reserved_room_type")
plt.xlabel("Market segment for city", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

In [None]:
sns.barplot(data = resort_hotel_df , x='market_segment' , y ='adr',hue="reserved_room_type")
plt.xlabel("Market segment for resort", fontsize=16)
plt.ylabel("Price", fontsize=16)
plt.show()

## Membuat Model 

In [None]:
df['reservation_status'] = df['reservation_status'].astype('category')
y = df['reservation_status'].cat.codes
y_names = list(df['reservation_status'].cat.categories)
X = df.drop(columns=['reservation_status'])

In [None]:
X.drop(["country","is_canceled",'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number','arrival_date_day_of_month',"meal","assigned_room_type","adr","required_car_parking_spaces","reservation_status_date","adults","children","babies","days_in_waiting_list","arrival_date","Total_Days"],axis=1,inplace=True)

In [None]:
X = pd.get_dummies(X, drop_first=True)

### Training Model

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

### SVM


In [None]:
start_time = datetime.now()
from sklearn.svm import SVC
from sklearn.metrics import classification_report
svm = SVC(kernel='rbf')
svm.fit(x_train_scaled,y_train)
y_pred = svm.predict(x_test_scaled)
print(classification_report(svm.predict(x_train_scaled),y_train))
print(classification_report(y_pred,y_test))
print(datetime.now())
end_time = datetime.now()
estimated = end_time-start_time
print(estimated)

### Evaluasi Model

In [None]:
start_time = datetime.now()
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
kfold =KFold(n_splits=3)
scores = cross_validate(svm,x_train_scaled,y_train,cv=kfold)
end_time = datetime.now()
estimated = end_time-start_time
print(estimated)

In [None]:
scores

In [None]:
start_time = datetime.now()
from sklearn.model_selection import GridSearchCV
param = {'C':[1,2,3],'kernel':['linear','rbf',]}
svm = SVC()
grid_search =GridSearchCV(svm,param_grid=param,scoring='accuracy',cv = 3)
grid_search.fit(x_train_scaled,y_train)
end_time = datetime.now()
estimated = end_time-start_time
print(estimated)

In [None]:
grid_search.best_params_

In [None]:
svm = SVC(C=3,kernel ='rbf')
svm.fit(x_train_scaled,y_train)
y_pred = svm.predict(x_test_scaled)
print(classification_report(y_test,y_pred))

### Menyimpan Model

In [None]:
X = df.drop(columns=['reservation_status'])

In [None]:
X.drop(["country","is_canceled",'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number','arrival_date_day_of_month',"meal","assigned_room_type","adr","required_car_parking_spaces","reservation_status_date","adults","children","babies","days_in_waiting_list","arrival_date"],axis=1,inplace=True)

In [None]:
y = df['reservation_status'].cat.codes

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)

In [None]:
numeric_columns = x_train.select_dtypes(exclude='object').columns

In [None]:
cat_columns = x_train.select_dtypes(include='object').columns

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
                               ('scaler', StandardScaler(with_mean=False))])
cat_transformer = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                            ('onehot', OneHotEncoder()),
                            ('scaler', StandardScaler(with_mean=False))])

In [None]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
     transformers=[
          ('num', numeric_transformer, numeric_columns),
          ('cat', cat_transformer, cat_columns)])

In [None]:
from sklearn.svm import SVC
pipe = Pipeline([('processing',preprocessor),('model',SVC(C=3,kernel ='rbf'))])
pipe.fit(x_train,y_train)

In [None]:
preprocessor

In [None]:
y_pred = pipe.predict(x_test)
print(classification_report(y_test,y_pred))

In [None]:
import joblib
save =  joblib.dump(pipe,'Hotel Prediction model.pkl')

## Conclusion

- Conclution pertanyaan 1:
    - City dan Resort pada musim panas memiliki harga paling tinggi dengan nilai lebih dari 120
    - Dilihat dari ketiga musim lainnya Resort jauh lebih murah dibandingkan dengan City

- Conclution pertanyaan 2:
    - Resort memiliki harga paling tinggi di bulan Agustus yang perbedaannya sangat jauh dengan bulan lainnya
    - City harganya hampir sama namun jika dibandingkan dengan Resort harganya relatif tinggi

- Conclution pertanyaan 3:
    - Di City dan Resort, Room Type G memiliki harga tertinggi
    
- Conclution pertanyaan 4:
    - Pemesanan secara direct sangat banyak dilihat di masing masing City dan Resort
    - Complementary sangat sedikit