# Data Analytics Competition Find IT UGM - Kyozo Hotel Price

## Tim Oh Data Euy : 
- Gerend Christopher 
- Felix Fernando 
- Jeremy

## Permasalahan : 
Kyozo, salah satu jaringan hotel dunia, membutuhkan bantuan konsultan data untuk membuat model prediksi harga untuk pengembangan hotel-hotel baru mereka. Sebagai konsultan data, Anda diberikan dataset dari ribuan hotel yang mereka miliki saat ini.

Namun, Kyozo bukanlah tim yang ahli dalam hal data. Anda diberikan dataset apa adanya dan Anda harus mencocokkan kolom dan nilai yang diberikan ke beberapa contoh hotel untuk diprediksi.

## Goal:
Membuat model prediksi harga pengembangan hotel-hotel baru dengan performa yang baik (diukur dengan metrik Mean Absolute Error)


# Importing Library

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
from IPython.display import display

# Model Library
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, roc_auc_score, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
import optuna

from catboost import CatBoostClassifier, Pool, cv

import lightgbm as lgb

import xgboost as xgb

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from category_encoders import OrdinalEncoder as oe

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# custom plot seaborn
plt.rcParams["figure.figsize"] = (8,6)
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params, palette='tab10')

np.random.seed(10)
%matplotlib inline

  from .autonotebook import tqdm as notebook_tqdm


# Data Preparation

### Features

`facilities` - fasilitas yang disediakan

`rating` - rating yang diberikan pengunjung

`location` - lokasi kota tempat hotel berada 

### Target variables :
`price` - harga pengembangan hotel

### Load Data

In [13]:
df_features = pd.read_csv('data/train_features.csv') # Membaca  feature data train
df_labels = pd.read_csv('data/train_label.csv') # Membaca label data train 
df_test_features = pd.read_csv('data/test_feature.csv') # membaca feature data test

In [3]:
display(df_labels.columns) # Mengambil data kolom dari df_labels
display(df_features.columns) # Mengambil data kolom dari df_features
display(df_test_features.columns) # Mengambil data kolom dari df_test_features

Index(['Price'], dtype='object')

Index(['facilities', 'rating', 'location'], dtype='object')

Index(['facilities', 'rating', 'location'], dtype='object')

In [16]:
df_features.head()

Unnamed: 0,facilities,rating,location
2035,RestaurantSwimmingPoolsBARintrnet,7.5 Very GoodFrom 35 reviews,Hallerson
502,barrestaurantinternet,7.0 Very GoodFrom 14 reviews,Andeman
1152,poolBARRestaurant,7.8 Very GoodFrom 10 reviews,Stokol
1029,BarrestaurantGymswimmingpools,5.7 GoodFrom 37 reviews,Andeman
1404,BARSwimmingPoolsrestaurantinternet,5.8 GoodFrom 6 reviews,Uberlandia


In [17]:
df_test_features.head()

Unnamed: 0,ID,facilities,rating,location
0,0,GymrestaurantbarInternetSwimmingPools,8.0 ExcellentFrom 1 reviews,Stokol
1,1,Poolrestaurantgyminternetbar,7.4 Very GoodFrom 22 reviews,Hallerson
2,2,BARSwimmingPoolsInternetgym,0.0 FairFrom 4 reviews,Hallerson
3,3,gymSwimmingPoolsBARintrnetRestaurant,6.8 Very GoodFrom 13 reviews,Andeman
4,4,gymRestaurantpoolbarintrnet,0.0 FairFrom 9 reviews,Hallerson


### Data Cleansing

In [11]:
print('Features Dataset')
display(df_features.info())

print('Labels Dataset')
display(df_labels.info())

print('Test Features Dataset')
df_test_features.info()

Features Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066 entries, 0 to 3065
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   facilities  2765 non-null   object
 1   rating      2429 non-null   object
 2   location    3066 non-null   object
dtypes: object(3)
memory usage: 72.0+ KB


None

Labels Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066 entries, 0 to 3065
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Price   3066 non-null   object
dtypes: object(1)
memory usage: 24.1+ KB


None

Test Features Dataset
<class 'pandas.core.frame.DataFrame'>
Int64Index: 767 entries, 0 to 766
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   facilities  767 non-null    object
 1   rating      767 non-null    object
 2   location    767 non-null    object
dtypes: object(3)
memory usage: 24.0+ KB


In [10]:
df_features.isna().sum() # Memeriksa kolom dengan data null pada df_features

facilities    301
rating        637
location        0
dtype: int64

In [13]:
df_labels.isna().sum() # Memeriksa kolom dengan data null pada df_labels

Price    0
dtype: int64

In [17]:
df_test_features.isna().sum() # Memeriksa kolom dengan data null pada df_test_features

facilities    0
rating        0
location      0
dtype: int64

In [24]:
df_features

Unnamed: 0,facilities,rating,location
0,RestaurantBARSwimmingPools,7.8 Very GoodFrom 10 reviews,Stokol
1,intrnetRestaurantgym,5.6 GoodFrom 4 reviews,Machlessvile
2,restaurantgympoolBar,7.2 Very GoodFrom 38 reviews,Wanderland
3,BARRestaurant,7.3 Very GoodFrom 6 reviews,Uberlandia
4,InternetRestaurant,7.2 Very GoodFrom 30 reviews,Stokol
...,...,...,...
3061,barInternet,,Andeman
3062,restaurantBarInternet,8.1 ExcellentFrom 4 reviews,Uberlandia
3063,Barrestaurantswimmingpools,6.7 Very GoodFrom 10 reviews,Willsmian
3064,Restaurant,,Hallerson


In [18]:
df_train = df_features.join(df_labels)
df_train

Unnamed: 0,facilities,rating,location,Price
0,RestaurantBARSwimmingPools,7.8 Very GoodFrom 10 reviews,Stokol,"13,500avg/night"
1,intrnetRestaurantgym,5.6 GoodFrom 4 reviews,Machlessvile,"13,000avg/night"
2,restaurantgympoolBar,7.2 Very GoodFrom 38 reviews,Wanderland,"19,000avg/night"
3,BARRestaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,"6,000avg/night"
4,InternetRestaurant,7.2 Very GoodFrom 30 reviews,Stokol,"20,000avg/night"
...,...,...,...,...
3061,barInternet,,Andeman,"31,625avg/night"
3062,restaurantBarInternet,8.1 ExcellentFrom 4 reviews,Uberlandia,"30,500avg/night"
3063,Barrestaurantswimmingpools,6.7 Very GoodFrom 10 reviews,Willsmian,"14,000avg/night"
3064,Restaurant,,Hallerson,"8,500avg/night"


In [19]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066 entries, 0 to 3065
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   facilities  2765 non-null   object
 1   rating      2429 non-null   object
 2   location    3066 non-null   object
 3   Price       3066 non-null   object
dtypes: object(4)
memory usage: 95.9+ KB


### Handling Null

READ THIS: kalo dua duanya null hmm, drop aja la ya (ger)

In [20]:
# check both facilities and rating null values

df_train[df_train['facilities'].isnull() & df_train['rating'].isnull()]

Unnamed: 0,facilities,rating,location,Price
16,,,Machlessvile,"3,200avg/night"
44,,,Uberlandia,"17,000avg/night"
58,,,Stokol,"1,800avg/night"
73,,,Stokol,"23,050avg/night"
79,,,Stokol,"1,800avg/night"
...,...,...,...,...
2998,,,Machlessvile,"3,500avg/night"
3015,,,Stokol,"8,000avg/night"
3021,,,Uberlandia,"8,000avg/night"
3030,,,Willsmian,"3,700avg/night"


In [21]:
# drop if facilities and rating both null

df_train = df_train.dropna(subset=['facilities', 'rating'], how='all').reset_index(drop=True)
df_train

Unnamed: 0,facilities,rating,location,Price
0,RestaurantBARSwimmingPools,7.8 Very GoodFrom 10 reviews,Stokol,"13,500avg/night"
1,intrnetRestaurantgym,5.6 GoodFrom 4 reviews,Machlessvile,"13,000avg/night"
2,restaurantgympoolBar,7.2 Very GoodFrom 38 reviews,Wanderland,"19,000avg/night"
3,BARRestaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,"6,000avg/night"
4,InternetRestaurant,7.2 Very GoodFrom 30 reviews,Stokol,"20,000avg/night"
...,...,...,...,...
2862,barInternet,,Andeman,"31,625avg/night"
2863,restaurantBarInternet,8.1 ExcellentFrom 4 reviews,Uberlandia,"30,500avg/night"
2864,Barrestaurantswimmingpools,6.7 Very GoodFrom 10 reviews,Willsmian,"14,000avg/night"
2865,Restaurant,,Hallerson,"8,500avg/night"


READ THIS: susah kalo fasilitias nya engga ada untuk fill nan nya. jadi mnrt ger di drop ae

In [22]:
# drop null values in facilities

df_train[df_train['facilities'].isnull()]

Unnamed: 0,facilities,rating,location,Price
28,,6.0 Very GoodFrom 43 reviews,Wanderland,"15,000avg/night"
57,,10.0 ExcellentFrom 1 review,Wanderland,"20,000avg/night"
102,,6.4 Very GoodFrom 1 review,Andeman,"10,000avg/night"
133,,6.0 Very GoodFrom 43 reviews,Andeman,"15,000avg/night"
145,,6.0 Very GoodFrom 43 reviews,Hallerson,"15,000avg/night"
...,...,...,...,...
2706,,6.4 Very GoodFrom 1 review,Stokol,"10,000avg/night"
2728,,7.6 Very GoodFrom 1 review,Stokol,"11,000avg/night"
2746,,8.3 ExcellentFrom 4 reviews,Ubisville,"35,000avg/night"
2767,,6.0 Very GoodFrom 43 reviews,Wanderland,"15,000avg/night"


In [23]:
# drop null values in facilities

df_train = df_train.dropna(subset=['facilities']).reset_index(drop=True)
df_train

Unnamed: 0,facilities,rating,location,Price
0,RestaurantBARSwimmingPools,7.8 Very GoodFrom 10 reviews,Stokol,"13,500avg/night"
1,intrnetRestaurantgym,5.6 GoodFrom 4 reviews,Machlessvile,"13,000avg/night"
2,restaurantgympoolBar,7.2 Very GoodFrom 38 reviews,Wanderland,"19,000avg/night"
3,BARRestaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,"6,000avg/night"
4,InternetRestaurant,7.2 Very GoodFrom 30 reviews,Stokol,"20,000avg/night"
...,...,...,...,...
2760,barInternet,,Andeman,"31,625avg/night"
2761,restaurantBarInternet,8.1 ExcellentFrom 4 reviews,Uberlandia,"30,500avg/night"
2762,Barrestaurantswimmingpools,6.7 Very GoodFrom 10 reviews,Willsmian,"14,000avg/night"
2763,Restaurant,,Hallerson,"8,500avg/night"


READ THIS: nah gimn skrang ttg rating. bisa jadi drop?, bisa jadi isi tapi cuma angka rating paling atau tipe(kek excellent dll), tapi ga bisa yang byk review nya

In [120]:
# drop null values rating

df_train = df_train.dropna(subset=['rating']).reset_index(drop=True)
df_train

Unnamed: 0,facilities,rating,location,Price,restaurant,bar,pool,internet,gym
0,restaurant bar pool,7.8 Very GoodFrom 10 reviews,Stokol,"13,500avg/night",1,1,1,0,0
1,internet restaurant gym,5.6 GoodFrom 4 reviews,Machlessvile,"13,000avg/night",1,0,0,1,1
2,restaurant gym pool bar,7.2 Very GoodFrom 38 reviews,Wanderland,"19,000avg/night",1,1,1,0,1
3,bar restaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,"6,000avg/night",1,1,0,0,0
4,internet restaurant,7.2 Very GoodFrom 30 reviews,Stokol,"20,000avg/night",1,0,0,1,0
...,...,...,...,...,...,...,...,...,...
2322,restaurant bar,7.6 Very GoodFrom 3 reviews,Andeman,"5,000avg/night",1,1,0,0,0
2323,bar restaurant pool gym,7.8 Very GoodFrom 351 reviews,Andeman,"30,000avg/night",1,1,1,0,1
2324,restaurant bar internet,8.1 ExcellentFrom 4 reviews,Uberlandia,"30,500avg/night",1,1,0,1,0
2325,bar restaurant pool,6.7 Very GoodFrom 10 reviews,Willsmian,"14,000avg/night",1,1,1,0,0


### Feature Engineering

READ THIS: split rating angka, tipe(kek excellent, good, blabla), dan berapa review nya, hapus avg/night di price, split facilites (hati-hati masalah case sensitive), trus ger pikir bgusnya kita encoding ada apa aja fasilitas nya 

In [60]:
# function to split facilities string by keywords and encoding it

def transform_strings_facilities(df, keywords):
    """
    Lowercase and split string values in a Pandas Series based on specified keywords.
    
    Args:
        data (pd.Series): The input Pandas Series.
        keywords (list): The list of keywords to split the string values.
    
    Returns:
        pd.Series: The transformed Pandas Series.
    """
    data = df['facilities']
    data = data.str.lower()  # Convert values to lowercase
    
    # Replace keywords with a space before and after
    for keyword in keywords:
        data = data.str.replace(keyword.lower(), f' {keyword.lower()} ', regex=False)
    
    # Replace 'swimmingpools' with 'pool'
    data = data.str.replace('swimming', '', regex=False)
    data = data.str.replace('s', '', regex=False)
    
    # Replace 'intrnet' with 'internet'
    data = data.str.replace('intrnet', 'internet', regex=False)
    
    # Replace 'retaurant' with 'restaurant'
    data = data.str.replace('retaurant', 'restaurant', regex=False)
    
    for keyword in keywords:
        df[keyword] = data.str.contains(keyword, regex=False).astype(int)
        
    # Split values by space
    data = data.str.split()
    
    # Join the lists of substrings into a single string
    data = data.str.join(' ')
    
    df['facilities'] = data
    
    return df

In [61]:
keywords = ['restaurant', 'bar', 'pool', 'internet', 'gym']
df_train = transform_strings_facilities(df_train, keywords)

df_train

Unnamed: 0,facilities,rating,location,Price,restaurant,bar,pool,internet,gym
0,restaurant bar pool,7.8 Very GoodFrom 10 reviews,Stokol,"13,500avg/night",1,1,1,0,0
1,internet restaurant gym,5.6 GoodFrom 4 reviews,Machlessvile,"13,000avg/night",1,0,0,1,1
2,restaurant gym pool bar,7.2 Very GoodFrom 38 reviews,Wanderland,"19,000avg/night",1,1,1,0,1
3,bar restaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,"6,000avg/night",1,1,0,0,0
4,internet restaurant,7.2 Very GoodFrom 30 reviews,Stokol,"20,000avg/night",1,0,0,1,0
...,...,...,...,...,...,...,...,...,...
2760,bar internet,,Andeman,"31,625avg/night",0,1,0,1,0
2761,restaurant bar internet,8.1 ExcellentFrom 4 reviews,Uberlandia,"30,500avg/night",1,1,0,1,0
2762,bar restaurant pool,6.7 Very GoodFrom 10 reviews,Willsmian,"14,000avg/night",1,1,1,0,0
2763,restaurant,,Hallerson,"8,500avg/night",1,0,0,0,0


In [62]:
df_test_features = transform_strings_facilities(df_test_features, keywords)
df_test_features

Unnamed: 0,ID,facilities,rating,location,restaurant,bar,pool,internet,gym
0,0,gym restaurant bar internet pool,8.0 ExcellentFrom 1 reviews,Stokol,1,1,1,1,1
1,1,pool restaurant gym internet bar,7.4 Very GoodFrom 22 reviews,Hallerson,1,1,1,1,1
2,2,bar pool internet gym,0.0 FairFrom 4 reviews,Hallerson,0,1,1,1,1
3,3,gym pool bar internet restaurant,6.8 Very GoodFrom 13 reviews,Andeman,1,1,1,1,1
4,4,gym restaurant pool bar internet,0.0 FairFrom 9 reviews,Hallerson,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...
762,762,bar pool restaurant gym,8.1 ExcellentFrom 9 reviews,Andeman,1,1,1,0,1
763,763,internet gym pool restaurant bar,8.0 ExcellentFrom 4 reviews,Wanderland,1,1,1,1,1
764,764,restaurant gym internet bar pool,7.4 Very GoodFrom 19 reviews,Andeman,1,1,1,1,1
765,765,gym internet bar pool restaurant,9.0 ExcellentFrom 17 reviews,Hallerson,1,1,1,1,1


In [64]:
display(df_train.loc[df_train['restaurant'] == 0].loc[df_train['bar'] == 0].loc[df_train['pool'] == 0].loc[df_train['internet'] == 0].loc[df_train['gym'] == 0])
display(df_test_features.loc[df_test_features['restaurant'] == 0].loc[df_test_features['bar'] == 0].loc[df_test_features['pool'] == 0].loc[df_test_features['internet'] == 0].loc[df_test_features['gym'] == 0])

Unnamed: 0,facilities,rating,location,Price,restaurant,bar,pool,internet,gym


Unnamed: 0,ID,facilities,rating,location,restaurant,bar,pool,internet,gym


READ THIS: Splitting rating

In [None]:
# split rating by number, category, and how many reviews

def transform_rating(df):
    data = df['rating'].str.split('From ')
    
    rate = pd.Series([data[i][0] for i in range(len(data))])
    review = pd.Series([data[i][1] for i in range(len(data))])
    
    # check nan values in rate
    idx = []
    for i in range(len(data)):
        if len(rate[i]) == 0:
            idx.append(i)
    
    # drop null values
    df = df.drop(idx).reset_index(drop=True)
    rate = rate.drop(idx).reset_index(drop=True)
    review = review.drop(idx).reset_index(drop=True)
    
    review = review.str.rstrip(' reviews').astype('int64')
    
    rate_num = pd.Series([rate[i][:3] for i in range(len(rate))]).astype('float64')
    rate_cat = pd.Series([rate[i][4:] for i in range(len(rate))])
    rate_cat = rate_cat.str.replace(' Excellent', 'Excellent', regex=False)
    
    df['rate_num'] = rate_num
    df['rate_cat'] = rate_cat
    df['review_num'] = review
    
    return df
    

In [202]:
df_train = transform_rating(df_train)
display(df_train)
df_test_features = transform_rating(df_test_features)
display(df_test_features)

Unnamed: 0,facilities,rating,location,Price,restaurant,bar,pool,internet,gym,rate_num,rate_cat,review_num
0,restaurant bar pool,7.8 Very GoodFrom 10 reviews,Stokol,"13,500avg/night",1,1,1,0,0,7.8,Very Good,10
1,internet restaurant gym,5.6 GoodFrom 4 reviews,Machlessvile,"13,000avg/night",1,0,0,1,1,5.6,Good,4
2,restaurant gym pool bar,7.2 Very GoodFrom 38 reviews,Wanderland,"19,000avg/night",1,1,1,0,1,7.2,Very Good,38
3,bar restaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,"6,000avg/night",1,1,0,0,0,7.3,Very Good,6
4,internet restaurant,7.2 Very GoodFrom 30 reviews,Stokol,"20,000avg/night",1,0,0,1,0,7.2,Very Good,30
...,...,...,...,...,...,...,...,...,...,...,...,...
2299,restaurant bar,7.6 Very GoodFrom 3 reviews,Andeman,"5,000avg/night",1,1,0,0,0,7.6,Very Good,3
2300,bar restaurant pool gym,7.8 Very GoodFrom 351 reviews,Andeman,"30,000avg/night",1,1,1,0,1,7.8,Very Good,351
2301,restaurant bar internet,8.1 ExcellentFrom 4 reviews,Uberlandia,"30,500avg/night",1,1,0,1,0,8.1,Excellent,4
2302,bar restaurant pool,6.7 Very GoodFrom 10 reviews,Willsmian,"14,000avg/night",1,1,1,0,0,6.7,Very Good,10


Unnamed: 0,ID,facilities,rating,location,restaurant,bar,pool,internet,gym,rate_num,rate_cat,review_num
0,0,gym restaurant bar internet pool,8.0 ExcellentFrom 1 reviews,Stokol,1,1,1,1,1,8.0,Excellent,1
1,1,pool restaurant gym internet bar,7.4 Very GoodFrom 22 reviews,Hallerson,1,1,1,1,1,7.4,Very Good,22
2,2,bar pool internet gym,0.0 FairFrom 4 reviews,Hallerson,0,1,1,1,1,0.0,Fair,4
3,3,gym pool bar internet restaurant,6.8 Very GoodFrom 13 reviews,Andeman,1,1,1,1,1,6.8,Very Good,13
4,4,gym restaurant pool bar internet,0.0 FairFrom 9 reviews,Hallerson,1,1,1,1,1,0.0,Fair,9
...,...,...,...,...,...,...,...,...,...,...,...,...
762,762,bar pool restaurant gym,8.1 ExcellentFrom 9 reviews,Andeman,1,1,1,0,1,8.1,Excellent,9
763,763,internet gym pool restaurant bar,8.0 ExcellentFrom 4 reviews,Wanderland,1,1,1,1,1,8.0,Excellent,4
764,764,restaurant gym internet bar pool,7.4 Very GoodFrom 19 reviews,Andeman,1,1,1,1,1,7.4,Very Good,19
765,765,gym internet bar pool restaurant,9.0 ExcellentFrom 17 reviews,Hallerson,1,1,1,1,1,9.0,Excellent,17


In [None]:
display(df_train['rate_cat'].unique())
display(df_test_features['rate_cat'].unique())
display(df_train['rate_num'].unique())
display(df_test_features['rate_num'].unique())
display(df_train['review_num'].unique())
display(df_test_features['review_num'].unique())

array(['Very Good', 'Good', 'Excellent', 'Fair'], dtype=object)

array(['Excellent', 'Very Good', 'Fair', 'Good'], dtype=object)

array([ 7.8,  5.6,  7.2,  7.3,  5.4,  7.9,  7.7,  6.9,  9.6,  7.4,  2.4,
        8.1,  8.8,  8. ,  6.7, 10. ,  8.6,  7. ,  6.4,  8.4,  5.9,  7.1,
        5.8,  6.2,  6.3,  8.2,  7.6,  6.8,  6. ,  6.1,  4. ,  4.2,  8.7,
        8.9,  9.3,  5.2,  5.7,  6.5,  4.9,  3.9,  7.5,  9.5,  8.3,  6.6,
        5.5,  9.1,  8.5,  4.3,  9.8,  4.4,  3.6,  2.8,  5. ,  3.7,  9.4,
        2. ,  5.3,  9.2,  5.1,  2.5,  3.2])

array([ 8. ,  7.4,  0. ,  6.8,  7.1,  5.1,  5.8,  7.3,  4.4,  7.8,  8.1,
        7.2,  8.7,  7.6,  9.6,  6.5,  8.5,  6.4,  7.5,  8.4,  6.9,  6.3,
        3.6, 10. ,  7.7,  5.2,  8.2,  9.3,  7.9,  6. ,  4.3,  7. ,  6.6,
        8.6,  9.4,  6.7,  9.1,  2.5,  8.9,  8.3,  6.2,  9.8,  8.8,  5. ,
        9.2,  4.9,  5.6,  4. ,  2. ,  6.1,  2.8,  3.7,  9. ,  2.3,  5.4,
        5.7,  5.9,  9.5,  5.5])

array([ 10,   4,  38,   6,  30,  13,   3,   7,   8,  23,   2,  50,   1,
        16,  36,  43,  12,   9,  46,  24,  29,  17,  28, 226,  14,  25,
        20,  44,  21,   5,  11,  94,  97,  37,  41,  18,  39,  49,  34,
        32,  79,  31, 145,  59,  61,  35,  15,  40,  26,  68,  22,  27,
       351,  57,  19, 152, 125,  33, 154,  42])

array([ 1, 22,  4, 13,  9, 23,  5, 10, 17, 16,  3, 19, 14, 18, 20, 15,  8,
       11,  7,  2, 12, 21,  6, 24, 25])

READ THIS: yg price posisinya terakhir aja

In [203]:
# strip price unit, replace commas and change data type

df_train['Price'] = df_train['Price'].str.rstrip('avg/night').str.replace(",", "").astype('int64')
df_train

Unnamed: 0,facilities,rating,location,Price,restaurant,bar,pool,internet,gym,rate_num,rate_cat,review_num
0,restaurant bar pool,7.8 Very GoodFrom 10 reviews,Stokol,13500,1,1,1,0,0,7.8,Very Good,10
1,internet restaurant gym,5.6 GoodFrom 4 reviews,Machlessvile,13000,1,0,0,1,1,5.6,Good,4
2,restaurant gym pool bar,7.2 Very GoodFrom 38 reviews,Wanderland,19000,1,1,1,0,1,7.2,Very Good,38
3,bar restaurant,7.3 Very GoodFrom 6 reviews,Uberlandia,6000,1,1,0,0,0,7.3,Very Good,6
4,internet restaurant,7.2 Very GoodFrom 30 reviews,Stokol,20000,1,0,0,1,0,7.2,Very Good,30
...,...,...,...,...,...,...,...,...,...,...,...,...
2299,restaurant bar,7.6 Very GoodFrom 3 reviews,Andeman,5000,1,1,0,0,0,7.6,Very Good,3
2300,bar restaurant pool gym,7.8 Very GoodFrom 351 reviews,Andeman,30000,1,1,1,0,1,7.8,Very Good,351
2301,restaurant bar internet,8.1 ExcellentFrom 4 reviews,Uberlandia,30500,1,1,0,1,0,8.1,Excellent,4
2302,bar restaurant pool,6.7 Very GoodFrom 10 reviews,Willsmian,14000,1,1,1,0,0,6.7,Very Good,10


# EDA

# Modeling

# Submission