#Machine Learning Model

**Preprocessing**
- handling missing value
- encoding categorical features
- splitting the data into training and testing sets
- scaling numerical features




**Model Training**
- Train both Random Forest and XGBoost models
- Compare the accuracy of the models

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pickle

#load data
df = pd.read_csv("Dataset_B_hotel.csv")

df.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0.0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0.0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0.0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0.0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0.0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled


**handling missing value** : agar missing value di dataset ga bikin error atau bikin hasilnya ga akurat pas lagi training model
- handling missing value for numeric columns = missing values di numeric columns bakal diisi dgn nilai median dari column tsb. median digunain karena median lebih tahan terhadap outlier dibandingkan dengan mean.
- handling missing value for categorical columns = missing values di categorical columns bakal diisi dgn nilai mode(nilai yg paling sering muncul) dr column tsb.

**encoding categorical features** : columns categorical diubah bentuk jadi numeric pake si "LebelEncoder" (LabelEncoder : ngubah tiap category unik dalam column jadi angka). kenapa harus ada encoding? karna di machine learning unutk model random fprest n xgboost, bisa di proses dengan data yang numeric

**splitting the data into training and testing sets** : dibagi jadi 2.
- training set (train): data yang digunain buat training/train model, biasanya 80% dari dataset.
- testing set (test): data yang biasanya digunain buat test/nguji kinerja model after training set, bisanya 20% dari dataset.

**scaling numerical features** : menggunakan StandardScaler buat menyamain skala fitur numeric, agar model ga terpengaruh oleh perbedaan skala antara fitur.

In [4]:
#handling missing values
#handling missing values for numeric columns
df.fillna(df.select_dtypes(include=['float64', 'int64']).median(), inplace=True)

#handling missing values for categorical columns
for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

#encode categorical columns
label_encoder = LabelEncoder()

#encode 'type_of_meal_plan', 'room_type_reserved', 'market_segment_type', and 'booking_status'
df['type_of_meal_plan'] = label_encoder.fit_transform(df['type_of_meal_plan'])
df['room_type_reserved'] = label_encoder.fit_transform(df['room_type_reserved'])
df['market_segment_type'] = label_encoder.fit_transform(df['market_segment_type'])
df['booking_status'] = label_encoder.fit_transform(df['booking_status'])

#split the data into features (X) and target (y)
X = df.drop('booking_status', axis=1)  #drop target variable
y = df['booking_status']  #target variable

#remove any non-numeric columns like 'Booking_ID'
X = X.select_dtypes(include=[np.number])  #selects only numeric columns

#split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#scale numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#proceed training models (Random Forest n XGBoost)
#train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

#train XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train_scaled, y_train)

#evaluate models
rf_pred = rf_model.predict(X_test_scaled)
xgb_pred = xgb_model.predict(X_test_scaled)

rf_accuracy = accuracy_score(y_test, rf_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

print(f"Random Forest Accuracy: {rf_accuracy}")
print(f"XGBoost Accuracy: {xgb_accuracy}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


Random Forest Accuracy: 0.9026878015161958
XGBoost Accuracy: 0.8898690558235699


Hasil Accuracy dari 2 model diatas, Random Forest n XGBoost :
- **Random Forest Accuracy: 0.9026878015161958** -> random forest ini memiliki acccuracy sekitar 90.27%, berarti sekitar 90.27% dari model ini sesuai dengan nilai actual didata uji
- **XGBoost Accuracy: 0.8898690558235699** -> XGBoost ini memiliki accuracy sekitar 88.99%. berarti berdasarkan perbandingan accuracy 2 model ini, radnom forest lebih baik dalam prediction data uji dibandingan dengan XGBoost.

=> maka model terbaik berdasarkan acccuracynya adalah ***random forest***.


In [None]:
e

**save model using pickle** -> best model dengan better accuracy

In [5]:
#save the best model (Pickle)
if rf_accuracy > xgb_accuracy:
    best_model = rf_model
    print("Random Forest is the best model.")
else:
    best_model = xgb_model
    print("XGBoost is the best model.")

#save the best model to a file
with open('best_model.pkl', 'wb') as file:
    pickle.dump(best_model, file)

print("Best model saved successfully.")

Random Forest is the best model.
Best model saved successfully.


In [6]:
#load the best model from the file
with open('best_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)