# Start of Data Exploration

In [5]:
# Imports
import sklearn as sk
import pandas as pd

In [None]:
train_data = pd.read_csv('data/training_set_VU_DM.csv')
train_data.head(5) # Show top 5

# Manual Column exploration
---
## Main columns
- `search_id` seems to represent each individual 'user'.
- `booking_bool` is essentially the answer.

## Categorical features
The following features are categorical (to be onehot-encoded):

User-specific
- `site_id`: category of website Expedia used
- `visitor_location_country_id`: categories of which country user is from
- `srch_destination_id`: where did the user search from
- `srch_saturday_night_bool`: boolean if stay includes staturday

Hotel-specific:
- `prop_id`: categories of associated hotels
- `prop_brand_bool`: boolean if hotel is part of chain or not
- `promotion_flag`: displaying promotion or not

Expedia-specific vs competitors 1_8:
- `comp{i}_rate`: if expedia has a lower price, do +1, 0 if same, -1 price is higher, null if no competitive data
- `comp{i}_inv`: if competitor has no availability, +1, 0 if both have availability, null if no competitive data

## Numerical features

User-specific
- `visitor_hist_starrating`: average of previous stars of associated user
- `visitor_hist_adr_usd`: average of average price of hotels of associated user
- `srch_length_of_stay`: number of nights stays **searched**
- `srch_booking_window`: number of days ahead the start of booking window **searched**
- `srch_adults_count`: number of adults **searched**
- `srch_children_count`: number of children **searched**
- `srch_room_count`: number of rooms **searched**
- `random_bool`: if sort was random at time of search
- `gross_booking_usd`: ❗Training-only❗ payment includign taxes, etc for hotel

Hotel-specific
- `prop_starrating`: star rating of hotel (1-5)
- `prop_review_score`: average review score of hotel (1-5)
- `prop_location_score_1`: score1 of hotel's location desirability
- `prop_location_score_2`: score2 of hotel's location desirability
- `prop_log_historical_price`: logarithm of average price of hotel lately (0 == not sold)
- `price_usd`: displayed price of hotel.
    - ❗ Important: Different countries have different conventions.
    - Value can change per night
- `srch_query_affinity_score`: log probability a hotel is clicked in internet searches

User-hotel coupled:
- `orig_destination_distance`: distance between hotel and customer at search-time (null means no distance calculated)

Expedia-specific vs competitors 1_8:
- `comp{i}_rate_percent_diff`: absolute difference between expedia and competitor's price, with null being no competitive data


## Unknown type
- `date_time`

# Initial feature transformation

In [53]:
# Imports for feature transformation
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

In [None]:
# Based on above specs, we encode our data

# Initial model feature selection

In [49]:
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import SVC

In [51]:
model = SVC()
feature_selector = SelectFromModel(model)

In [None]:
feature_selector()