# Machine Learning in Python - Group Project 2

**Due Friday, April 14th by 16.00 pm.**

*include contributors names here (such as Name1, Name2, ...)* (Group Name)

## General Setup

In [1]:
# Add any additional libraries or submodules below

# Display plots inline
%matplotlib inline

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn modules
import sklearn

In [2]:
# Plotting defaults
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

In [3]:
# Load data
d = pd.read_csv("hotel.csv")

In [4]:
d.head()

Unnamed: 0,is_canceled,hotel,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests
0,0,Resort Hotel,342,2015,July,27,1,0,0,2,...,C,3,No Deposit,,,0,Transient,0.0,0,0
1,0,Resort Hotel,737,2015,July,27,1,0,0,2,...,C,4,No Deposit,,,0,Transient,0.0,0,0
2,0,Resort Hotel,7,2015,July,27,1,0,1,1,...,C,0,No Deposit,,,0,Transient,75.0,0,0
3,0,Resort Hotel,13,2015,July,27,1,0,1,1,...,A,0,No Deposit,304.0,,0,Transient,75.0,0,0
4,0,Resort Hotel,14,2015,July,27,1,0,2,2,...,A,0,No Deposit,240.0,,0,Transient,98.0,0,1


For the details about data set;

- please check the Project Description pdf file and related article from the Learn page. 

## 1. Introduction

*This section should include a brief introduction to the task and the data (assume this is a report you are delivering to a client).* 

- If you use any additional data sources, you should introduce them here and discuss why they were included.

- Briefly outline the approaches being used and the conclusions that you are able to draw.

## 2. Exploratory Data Analysis and Feature Engineering

*Include a detailed discussion of the data with a particular emphasis on the features of the data that are relevant for the subsequent modeling.* 

- Including visualizations of the data is strongly encouraged - all code and plots must also be described in the write up. 
- Think carefully about whether each plot needs to be included in your final draft - your report should include figures but they should be as focused and impactful as possible.

*Additionally, this section should also implement and describe any preprocessing / feature engineering of the data.*

- Specifically, this should be any code that you use to generate new columns in the data frame `d`. All of this processing is explicitly meant to occur before we split the data in to training and testing subsets. 
- Processing that will be performed as part of an sklearn pipeline can be mentioned here but should be implemented in the following section.*

**All code and figures should be accompanied by text that provides an overview / context to what is being done or presented.**

In [5]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 30 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   is_canceled                     119390 non-null  int64  
 1   hotel                           119390 non-null  object 
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

There are several columns related to date, we decided to use the number of week because it is easy to interpret and the time interval is not too small to lose the generalizability such as `arrival_date_day_of_month` (besides, this variable should be combined with `arrival_date_month` to make it meaningful), and it is not too large to miss the details such as `arrival_date_month`. Besides, it is better to set `arrival_date_year` and `arrival_date_week_number` as categorical variables because we cannot explain the linear changes within these two variables (and we cannot generalise to the values outside the scope).

In [6]:
d.drop(['arrival_date_month', 'arrival_date_day_of_month'], axis=1, inplace=True)

In [7]:
d.arrival_date_week_number = d.arrival_date_week_number.astype('category')
d.arrival_date_year = d.arrival_date_year.astype('category')

In [8]:
d.hotel.unique()

array(['Resort Hotel', 'City Hotel'], dtype=object)

There are only two kinds of hotel, and we are interested in the cancel rate for two hotel types, so we set it as a categorical variable.

In [9]:
d.hotel = d.hotel.astype('category')

We also set `is_canceled`, `meal`, `market_segment`, `distribution_channel`, `is_repeated_guest`, `reserved_room_type`, `assigned_room_type`, `deposit_type`, `customer_type` as categorical variables for better interpretation.

In [10]:
d.is_canceled = d.is_canceled.astype('category')
d.meal = d.meal.astype('category')
d.market_segment = d.market_segment.astype('category')
d.distribution_channel = d.distribution_channel.astype('category')
d.is_repeated_guest = d.is_repeated_guest.astype('category')
d.reserved_room_type = d.reserved_room_type.astype('category')
d.assigned_room_type = d.assigned_room_type.astype('category')
d.deposit_type = d.deposit_type.astype('category')
d.customer_type = d.customer_type.astype('category')

As for `country`, because we want the model to be unbiased and no discrimination, we didn't include `country` to our model.

In [11]:
d.drop(['country'], axis=1, inplace=True)

For simplicity, we used `previous_cancel_rate` to represent `previous_cancellations` and `previous_bookings_not_canceled`.

In [12]:
d['previous_cancel_rate'] = d.loc[:,['previous_cancellations', 'previous_bookings_not_canceled']]\
                            .apply(lambda x: x[0]/(x[0]+x[1]) if (x[0]+x[1]) != 0 else 0, axis=1)

In [13]:
d.drop(['previous_cancellations', 'previous_bookings_not_canceled'], axis=1, inplace=True)

Checking the columns with N/A in the dataset.

In [14]:
d.columns[d.isna().any()]

Index(['children', 'agent', 'company'], dtype='object')

Now we dealt with N/A values. First are `agent` and `company`, we think that the values with N/A represent individual travelers (making/paying the booking by themselves), and we denote them as 0 and set them as categorical variables (change to `int` first to avoid the float numbers).

In [15]:
d.agent = d.agent.fillna(0)
d.agent = d.agent.astype(int).astype('category')

d.company = d.company.fillna(0)
d.company = d.company.astype(int).astype('category')

After the procedures above, now we only have one column with N/A, `children`. We assumed that it means that there is no children in that booking record, so we set it as 0.

In [16]:
d.children = d.children.fillna(0).astype(int)

In [17]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 26 columns):
 #   Column                       Non-Null Count   Dtype   
---  ------                       --------------   -----   
 0   is_canceled                  119390 non-null  category
 1   hotel                        119390 non-null  category
 2   lead_time                    119390 non-null  int64   
 3   arrival_date_year            119390 non-null  category
 4   arrival_date_week_number     119390 non-null  category
 5   stays_in_weekend_nights      119390 non-null  int64   
 6   stays_in_week_nights         119390 non-null  int64   
 7   adults                       119390 non-null  int64   
 8   children                     119390 non-null  int64   
 9   babies                       119390 non-null  int64   
 10  meal                         119390 non-null  category
 11  market_segment               119390 non-null  category
 12  distribution_channel         119390 non-null

## 3. Model Fitting and Tuning

*In this section you should detail your choice of model and describe the process used to refine and fit that model.*

- You are strongly encouraged to explore many different modeling methods (e.g. logistic regression, classification trees, SVC, etc.) but you should not include a detailed narrative of all of these attempts. 
- At most this section should mention the methods explored and why they were rejected - most of your effort should go into describing the model you are using and your process for tuning and validatin it.

*For example if you considered a logistic regression model, a classification tree, and a SVC model and ultimately settled on the logistic regression approach then you should mention that other two approaches were tried but do not include any of the code or any in depth discussion of these models beyond why they were rejected. This section should then detail is the development of the logistic regression model in terms of features used, interactions considered, and any additional tuning and validation which ultimately led to your final model.* 

**This section should also include the full implementation of your final model, including all necessary validation. As with figures, any included code must also be addressed in the text of the document.**

## 4. Discussion & Conclusions


*In this section you should provide a general overview of **your final model**, its **performance**, and **reliability**.* 

Your report must include the following:

* Some discussion of the features that are most important for predicting a cancelation - we do not need discussion of specific coefficient values but direction of the effect should be clear (e.g. the earlier a booking is made the more likely it is to be canceled).

* A validated assessment of your model's performance, but this must be specifically discussed in the context of bookings and running a hotel. 

* It is not sufficient to report summary statistics like the accuracy or AUC - you must address the perfomance in terms of potential gains and losses for the hotel (e.g. think about what happens if your model predicts a cancelation that does not actually occur and a room ends up being double booked or vice versa). 

* Explain why you think your particular model would or would not be economically viable.

## 5. References

*In this section, you should present a list of external sources (except the course materials) that you used during the project, if any*

- Additional data sources can be cited here, in addition to related python documentations, any other webpage sources that you benefited from