# Data Analytics Competition Find IT UGM - H1N1 and Seasonal Vaccine

## Tim Oh Data Euy : 
- Gerend Christopher 
- Felix Fernando 
- Jeremy

## Permasalahan : 
Kyozo, salah satu jaringan hotel dunia, membutuhkan bantuan konsultan data untuk membuat model prediksi harga untuk pengembangan hotel-hotel baru mereka. Sebagai konsultan data, Anda diberikan dataset dari ribuan hotel yang mereka miliki saat ini.

Namun, Kyozo bukanlah tim yang ahli dalam hal data. Anda diberikan dataset apa adanya dan Anda harus mencocokkan kolom dan nilai yang diberikan ke beberapa contoh hotel untuk diprediksi.

## Goal:
Membuat model prediksi harga pengembangan hotel-hotel baru dengan performa yang baik (diukur dengan metrik Mean Absolute Error)


# Importing Library

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
from IPython.display import display

# Model Library
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
import optuna

from catboost import CatBoostClassifier, Pool, cv

import lightgbm as lgb

import xgboost as xgb

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from category_encoders import OrdinalEncoder as oe

# custom plot seaborn
plt.rcParams["figure.figsize"] = (8,6)
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params, palette='tab10')

np.random.seed(10)
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Data Preparation + Cleansing

### Features

`facilities` - fasilitas yang disediakan

`rating` - rating yang diberikan pengunjung

`location` - lokasi kota tempat hotel berada 

### Target variables :
`price` - harga pengembangan hotel

### Load Data

In [6]:
df_features = pd.read_csv('data/train_features.csv') # Membaca  feature data train
df_labels = pd.read_csv('data/train_label.csv') # Membaca label data train 

df_test_features = pd.read_csv('data/test_feature.csv', index_col="ID") # membaca feature data test

In [7]:
df_labels.columns # Mengambil data kolom dari df_labels

Index(['Price'], dtype='object')

In [8]:
df_features.columns # Mengambil data kolom dari df_features

Index(['facilities', 'rating', 'location'], dtype='object')

In [9]:
df_test_features.columns # Mengambil data kolom dari df_test_features

Index(['facilities', 'rating', 'location'], dtype='object')

# Data Cleansing

In [11]:
df_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066 entries, 0 to 3065
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   facilities  2765 non-null   object
 1   rating      2429 non-null   object
 2   location    3066 non-null   object
dtypes: object(3)
memory usage: 72.0+ KB


In [16]:
df_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066 entries, 0 to 3065
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Price   3066 non-null   object
dtypes: object(1)
memory usage: 24.1+ KB


In [14]:
df_test_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 767 entries, 0 to 766
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   facilities  767 non-null    object
 1   rating      767 non-null    object
 2   location    767 non-null    object
dtypes: object(3)
memory usage: 24.0+ KB


In [10]:
df_features.isna().sum() # Memeriksa kolom dengan data null pada df_features

facilities    301
rating        637
location        0
dtype: int64

In [13]:
df_labels.isna().sum() # Memeriksa kolom dengan data null pada df_labels

Price    0
dtype: int64

In [17]:
df_test_features.isna().sum() # Memeriksa kolom dengan data null pada df_test_features

facilities    0
rating        0
location      0
dtype: int64

In [24]:
df_features

Unnamed: 0,facilities,rating,location
0,RestaurantBARSwimmingPools,7.8 Very GoodFrom 10 reviews,Stokol
1,intrnetRestaurantgym,5.6 GoodFrom 4 reviews,Machlessvile
2,restaurantgympoolBar,7.2 Very GoodFrom 38 reviews,Wanderland
3,BARRestaurant,7.3 Very GoodFrom 6 reviews,Uberlandia
4,InternetRestaurant,7.2 Very GoodFrom 30 reviews,Stokol
...,...,...,...
3061,barInternet,,Andeman
3062,restaurantBarInternet,8.1 ExcellentFrom 4 reviews,Uberlandia
3063,Barrestaurantswimmingpools,6.7 Very GoodFrom 10 reviews,Willsmian
3064,Restaurant,,Hallerson
