<a href="https://colab.research.google.com/github/Samarjeet-singh-chhabra/EDA-Hotel-Booking-Analysis/blob/main/Hotel_Booking_Analysis_Samarjeet_Singh_Chhabra.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem Statement


Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.

Explore and analyze the data to discover important factors that govern the bookings.

#**Lets Start our work**

##Adding required libraries.

In [40]:
# Importing Libraries
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


##Adding the CSV file




In [None]:
#accessing it by link
url = "https://drive.google.com/file/d/1EGYfR6Q0LIN7DWJg9rp0WMwVcW9F9UG4/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]

## Reading csv file 

In [None]:
#making hotels_df the dataframe name
hotels_df = pd.read_csv(url) 

##Exploring Dataset

In [None]:
#Checking the shape of the dataset
hotels_df.shape

In [None]:
#Checking basic information of all columns of the dataset
hotels_df.info()

In [None]:
#looking at the min, max values,mean values etc. NAN values for mean,25% ,50%,75%,max indicates those are categorical columns.
hotels_df.describe()

In [None]:
#printing first 5 rows of dataset
hotels_df.head()

In [None]:
#printing last 5 rows of dataset
hotels_df.tail()

## Checking for Types of Variable in some columns(unique values)


# Cleaning the Dataset

In [32]:
#creating copy of dataframe 
hotels_df_copy =hotels_df.copy()

In [19]:
# Removing duplicate rows if any
hotels_df_copy[hotels_df_copy.duplicated()].shape 

(31994, 32)

In [31]:
# Droping duplicate values 
hotels_df_copy.drop_duplicates(inplace = True)   
hotels_df_copy.shape

(87333, 30)

In [21]:
# cheking  for null values 
hotels_df.isna().sum(axis = 0)

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

## Agent, company, children and country are having most null values, Cheking on them and fixing them.

In [None]:
# Cheking content in agent column
agent_list = hotels_df['agent'].tolist()
list(agent_list)

In [None]:
# Cheking content in company column
company_list = hotels_df['company'].tolist()
list(company_list)

In [None]:
# Cheking content in country column
country_list = hotels_df['country'].tolist()
list(country_list)

In [None]:
# Cheking content in country column
children_list = hotels_df['children'].tolist()
list(children_list)

##fixing nulls in children, country, agent and company.


In [33]:
# Handling null values 
hotels_df_copy['children'].fillna(0, inplace = True)
hotels_df_copy['country'].fillna('Others', inplace = True)
hotels_df_copy['agent'].fillna(0, inplace = True)

# Deleting two Columns which are not useful
hotels_df_copy.drop(labels = ['previous_bookings_not_canceled','company'], axis=1, inplace = True)

In [None]:
# checking if their any null values in df
hotels_df_copy.isnull().sum()

##Converting columns datatype to required datatypes.

In [36]:
# Converting datatype from float to int.
hotels_df_copy[['children', 'agent']] = hotels_df_copy[['children', 'agent']].astype('int64')

In [41]:
# changing datatype of column 'reservation_status_date' to data_type.
hotels_df_copy['reservation_status_date'] = pd.to_datetime(hotels_df_copy['reservation_status_date'], format = '%Y-%m-%d')

## Adding important columns.

In [43]:
# Adding total days of stay in hotels.
hotels_df_copy['total_stay'] = hotels_df_copy['stays_in_weekend_nights']+hotels_df_copy['stays_in_week_nights']

# Adding total number of guest as column.
hotels_df_copy['total_people'] = hotels_df_copy['adults']+hotels_df_copy['children']+hotels_df_copy['babies']

#**Final dataset check before finding insights.**

In [None]:
hotels_df_copy.info()

#**Starting getting insights from the data**

**Now we will get valuable and meaningful insights out of our dataframe while keeping the problem statement in mind.**
