<a href="https://colab.research.google.com/github/AdityaSingh1907/EDA_Capstone-Project/blob/main/EDA_on_AirBnb_Bookings_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -**Aditya Singh


# **Project Summary -**

Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values.
Explore and analyse the data to discover key understandings.

# **GitHub Link -**

https://github.com/AdityaSingh1907/EDA_Capstone-Project

# **Problem Statement**


1. Find the highest share of hotels in Neighbourhood Group.
2. Distribution of Prices in neighbourhoods according the Room Category.
3. Analyse The Avg Price Based on Rooms type.
4. Find The Relationship between Neighbourhood AND Availability.
5. Calculate No of Guest in months of the years


#### **Define Your Business Objective?**

The short-term rental market has seen a significant increase in popularity in recent years, with Airbnb being one of the leading platforms in this industry. However, the industry is highly competitive, and property owners and managers must navigate a complex market to optimize their revenue and occupancy rates. In this context, there is a need for a comprehensive analysis of Airbnb booking data to identify key trends, patterns, and factors that influence short-term rental prices and occupancy rates. By understanding these factors, property owners and managers can make informed decisions about their pricing strategies, property listing optimization, and other business decisions that can impact their success in the short-term rental market. Therefore, the problem statement for this project is to conduct an Airbnb booking analysis to identify trends, patterns, and factors that influence the short-term rental market, with the goal of providing insights and recommendations for property owners and managers to optimize their revenue and occupancy rates.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset

file_path = '/content/drive/MyDrive/Colab Notebooks/Airbnb NYC 2019 (1).csv'
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

df.shape

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe(include='all')

### Variables Description



*   id : a unique id identifying an airbnb lisitng
*   name : name representating the accomodation
*   host_id : a unique id identifying an airbnb host
*   host_name : name under whom host is registered
*   neighbourhood_group : a group of area
*   beighbourhood : area falls under neighbourhood_group
*   latitude : coordinate of listing
*   longitude : coordinate of listing
*   room_type : type to categorize listing rooms
*   price : price of listing
*   minimum_nights : the minimum nights required to stay in a single visit
*   number_of_reviews : total count of reviews given by visitors
*   last_review : date of last review given
*   reviews_per_month : rate of reviews given per month
*   calculated_host_listings_count : total no of listing registered under the host
*   availability_365 : the number of days for which a host is available in a year.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

df.isnull().sum().sort_values(ascending= False)


In [None]:
df.fillna(0, inplace=True)

In [None]:
# we're excluding lat long as they are coordinate, id & host_id as they're unique id
col_after_excluding = set(df.columns) - {'latitude', 'longitude', 'id', 'host_id'}
df[col_after_excluding].describe()

In [None]:
dist_col_list = df[col_after_excluding].describe().columns.tolist()

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(24, 10))
axes = axes.flatten()
for col, ax in zip(dist_col_list, axes):
    sns.histplot(x=col, data=df, ax=ax, kde=True, element='poly')
    ax.set_title(f'Column {col} skewness : {df[col].skew()}')

plt.tight_layout(h_pad=0.4, w_pad=0.7)

From the distribution of filtered numerical data columns, it can be concluded that all these has a positive skewed distribution including price. However, availability distributed uniformly throughout days of a year, so it means we have all sort of listing available uniformly throughout the year.

In [None]:
plt.figure(figsize=(10, 6))
heatmap = sns.heatmap(df[dist_col_list].corr(), linewidths=0, vmin=-1, annot=True, cmap="YlGnBu")
plt.show()

no_of_reviews and reviews_per_month has high co-relation for obvious reason. But the price column also has very low corelation with other features. Let check it how it differes or related.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(23, 11))
ax = axes.flatten()

sns.set_theme(style="white")
sns.scatterplot(data=df, x='longitude', y='latitude', hue='neighbourhood_group', ax=ax[0]);
ax[0].set_title('Location of neighbourhood groups')
sns.scatterplot(data=df[df['price'] < 300], x='longitude', y='latitude', hue='price', size="price", sizes=(20, 60), palette='GnBu_d', ax=ax[1])
ax[1].set_title('Variation of Price based on Location ($0 - 300)')
sns.scatterplot(data=df, x='longitude', y='latitude', hue='number_of_reviews', size="number_of_reviews", sizes=(20, 150), palette='GnBu_d', hue_norm=(0, 5), ax=ax[2])
ax[2].set_title('Variation of number of reviews given based on location')
sns.scatterplot(data=df, x='longitude', y='latitude', hue='availability_365', style="room_type", palette='GnBu_d', ax=ax[3])
ax[3].set_title('Availability in terms of Room Type')
plt.show()

### What all manipulations have you done and insights you found?

1.In the first plot we can check the neighbourhood location of New York city where our dataset currently belongs to.

2.In the second plot, we have considered only listing with a price range max to usd 300, as our 75th percentile data lies in range of usd 175. We can check how variation in prices distributed throughout the city location. The south of Manhattan and north of Brooklyn belongs to the expensive areas of New York.

3.In the third plot, we can follow a trend in rise of the review count throughout the outskirts of city.

4.In the last plot, we've tried to visualize the availability in terms of room type. Though availability based on room type is spreaded well, still we can follow a pattern where the heart of new york stays the busiest or booked for most of the time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Find the highest share of hotels in Neighbourhood Group.

In [None]:
# Chart - 1 visualization code
# Neighbourhood Group


plt.title("Neighbourhood Group")


# Dependant Variable Column Visualization
df.neighbourhood_group.value_counts().plot(kind='pie',
                              figsize=(13,7),
                              autopct="%1.1f%%",
                               startangle=180,
                               shadow=True,

                              )

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

The pie chart above shows that Airbnb Listings in Newyork are near Manhattan and Brooklyn has the highest share of hotels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No, There is no negative growth.

#### Chart - *2* - Distribution of Prices in neighbourhoods according the Room Category.

In [None]:
# Chart - 2 visualization code

fig = plt.figure(figsize=(24, 6))
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False)
sns.lineplot(data=df, x='neighbourhood', y='price', hue='room_type')
plt.title('Distribution of Prices in neighbourhoods')
sns.despine(fig)

##### 1. Why did you pick the specific chart?

The type of graph being used here is a line plot , which is commonly used to display trends and relationship between two continuous variable , In this case, it is being used to show how the price varies across different neighborhoods, while also highlighting difference in room types.

##### 2. What is/are the insight(s) found from the chart?

Clearly, room type Entire home/apt has maintained higher price range in almost all neighbourhoods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

####Chart - 3- Analyse The Avg Price Based on Rooms type.



In [None]:
# Chart - 3 visualization code

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(22, 6))
ax = axes.flatten()

mean_price_df = df.groupby('room_type', as_index=False)[['price']].mean()
sns.barplot(data=mean_price_df, x='room_type', y='price', palette='GnBu_d', ax=ax[0])
ax[0].set_title("Avg Price vs Rooms type");

labels = df['room_type'].value_counts().index
sizes = df['room_type'].value_counts().values
ax[1].pie(sizes, labels=labels, autopct='%1.1f%%', colors = ['#009999','#007399','#20B2AA'])
ax[1].set_title('Proportion of Room Types')

sns.countplot(data=df, x='room_type', ax=ax[2])
ax[2].set_title('Room Type Counts')

sns.despine(fig)
plt.tight_layout(h_pad=0.5, w_pad=0.8)


##### 1. Why did you pick the specific chart?

The first plot is a bar plot, It's helps to comparer the avarage price across different from types.

The second plot is a pie chart that shows the proportion of different room types, it's help to understand the relative frequency of different room types in the dataset.

The third plot is a cout plot, it's helps to understand the frequency of each type of room in the dataset.

##### 2. What is/are the insight(s) found from the chart?

these visualizations help to provide insights into the distribution and frequency of different types of rooms in the dataset, as well as the avarege prices across different room types.these insights could be useful for businesses in the hospitality industry to understand market trends and adjust thier pricing and marketing strategies accordingly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insighes can ultimately help businesses optimize thier operations and improve customer satisfaction, leading to psitve business outcomes such as increased revenue and customer loyalty.

#### Chart - 4 Find The Relationship between Neighbourhood_group' VS 'Availability_365

In [None]:
# Chart - 4 visualization code

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(24, 8))
ax = axes.flatten()

sns.lineplot(data=df, x='neighbourhood_group', y='availability_365', hue='room_type', ax=ax[0])
ax[0].set_title('Room Availability throughout Neighbourhood/Room Type')
sns.scatterplot(data=df[df['price'] < 500], x="availability_365", y='price', hue='room_type', alpha=.9, palette="muted", ax=ax[1])
ax[1].set_title('Price vs Availability (Range $ 0 - 500)')

sns.countplot(data=df[df['availability_365']  == 365], x='neighbourhood_group', hue='room_type', palette='GnBu_d', ax=ax[2])
ax[2].set_title('Property Available 365 days')
sns.despine(fig)


##### 1. Why did you pick the specific chart?

I have used three types of chart in this visualization:-  
1.line plot: it is used to show the trend of
the avalability of rooms throughout different neighbourhoods based on the room types.

2.Scatter Plot: it is used to show relationship between the price and availability of the properties.

3.Count Plot: it is used to show the count of the properties available throughout the year in different neighbourhoods based on room type.



##### 2. What is/are the insight(s) found from the chart?

Using this graphs, We have gained insights into the avalability and price of properties in different neighbourhoods based on the room type. the insights gained from these graphs can help businesses make data-driven decisions by understanding the demand and supply of properties in different neighbourhoods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can ultimately lead to positive business impact, such as increased occupancy rates and revenue.

####Chart - 5 - Calculate No of Guest in months of the years






In [None]:
# Chart - 5 visualization code
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')

df['Days'], df['Month'], df['Years'] = (df['last_review'].dt.day, df['last_review'].dt.month, df['last_review'].dt.year)
df['last_review'] = pd.to_datetime(df['last_review']).dt.date


In [None]:
# as we've converted to date, null column date becomes system min date i.e. 1970-01-01
# so we're removing these datas
filtered_df = df[df['Years'] != 1970]

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(22, 6))
ax = axes.flatten()

sns.histplot(data=filtered_df, x='Days', hue='room_type', multiple="stack", ax=ax[0])
ax[0].set_title('No of Guest in days of the months')
sns.histplot(data=filtered_df, x='Month', hue='room_type', multiple="stack", ax=ax[1])
ax[0].set_title('No of Guest in months of the years')
sns.despine(fig, left=True)

##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, I used you display the distribution of the number of guests in days of the month and month of the year based on the room type. this plot helps in understanding the distribution of guest throughout the month and year for different types of room.

##### 2. What is/are the insight(s) found from the chart?

We've created day month based on the last_review date, though it is not fully accurate many guest prefer to not give a rating (just assumption). We can see a trend where, first day and last day of month most number of guests. Also in the middle of the year, in June Month, there is a surge in guest count.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from these plot businesses can adjust their pricing strategy based on the demand for different types of rooms in different months of the year. they can also identify the peak and off-peak periods and plan their operations accordingly.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = df.corr()
cmap = sns.diverging_palette(5, 250, as_cmap=True)

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .set_precision(2)\
    .set_table_styles(magnify())

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

From the above correlation heatmap, we can see id, host_id, latitude, longitude, price, minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df[['price', 'calculated_host_listings_count', 'number_of_reviews', 'minimum_nights']])

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Businesses can invest in neighbourhood that have more demand for certain types of rooms or offer promotions during off-peak seasons when there is less availability.

# **Conclusion**

So, this AirBNB dataset is a rich in data but not on features. From the entire above analysis we can conclude that,

* Most visitors don't prefer shared rooms, they tend to visit private room or entire home.
* Manhattan and Brooklyn are the two distinguished, expensive & posh areas of NY
* Though location of property has high relation on deciding its price, but a property in popular location doesn't it will stay occupied in most of the time.   
* Performing a regression on this dataset may result in high error rate, as the features given in this dataset, are of very poor quality in deciding the property valuation.
* We can see this by looking at corelation heatmap. We would need more features like bedrooms, bathroom, property age (guessed it'd be a very important one), tax_rate applicable on land, room extra amenities, distance to nearest hospital, stores or schoolds. These features might have a high relation with price.
* We could use a time series analysis to make prediction of occupancy rate at particular time of a month, or particukar time of a season.
* It'd be a better if we had avg guest ratings of a property, that would be beneficial in understanding the property more and could also be a factor in deciding price (a low rated property tends to lower their price)




### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***