<a href="https://colab.research.google.com/github/Ank339/MODULE-2-project/blob/main/Airbnb_booking_Analyses_By_Ankit_Sharma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Airbnb Booking Analyses



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around **49,000 observations** in it with **16 columns** and it is a mix between categorical and numeric values.



# **GitHub Link -**

# **Problem Statement**



Lets Explore and analyze the Data Set and find some insights(Few Questions Listed Below)

1.What can we learn about different hosts and areas?                       
2.What we learn from room type and their prices according to area?               
3.What can we learn from Data? (ex: locations, prices, reviews, etc)                   
4.Which hosts are the busiest and what is the reason?            
5.Which Hosts are charging higher price?              
6.Is there any traffic difference among different areas and what could be the reason for it?                
7.What is the correlation between different variables ?           
8.What is the room count in overall NYC according to the listing of room types?

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


### Mounting drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
airbnb_df = pd.read_csv('/content/drive/MyDrive/module 2/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
airbnb_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb_df.shape

### Dataset Information

In [None]:
airbnb_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

len(airbnb_df[airbnb_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(airbnb_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(airbnb_df.isnull(), cbar=True)

### What did you know about your dataset?

The dataset given is a dataset of traveling industry where, Airbnb is a home-sharing site that connects travelers with hosts who offer unique places to stay, long term and short term stay.     
So, our main goal is to optimize revenue and identify the best rental opportunities. By analyzing the data, hosts can understand market trends and competition, set optimal pricing strategies, and adjust their rental offerings to meet demand

The above dataset has 48895 rows and 16 columns. There are no duplicate values in the dataset


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_df.columns

In [None]:
# Dataset Describe
airbnb_df.describe()

### Variables Description


* id                   -  Unique Id   
* name                 -  Name of listings    
* host_id              - Unique host_id       
* host_name            - Name of host_id       
* neighbourhood_group  - Location      
* neighbourhood        - Area     
* latitude             - Latitude Range  
* longitude            - Longitude Range   
* room_type            - Type of Listing  
* price                - Price of listing  
* minimum_nights       - Maximum nights to be paid for  
* number_of_reviews    - Number of reviews     
* last_review          - Content of the last review    
* reviews_per_month    - Number of checks per month             
* calculated_host_listings_count - Total count       
* availability_365     - Availabily around the year









### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in airbnb_df.columns.tolist():
  print("No. of unique values in ",i,"is",airbnb_df[i].nunique(),".")

## 3. ***Data Wrangling***

## Data Cleaning


In [None]:
# Creating a copy of the current dataset and assigning to df
df=airbnb_df.copy()

# Removing unnecessary Columns
df.drop(['latitude','longitude','last_review','reviews_per_month'],axis=1,inplace=True)
df.head(1)

### Data Wrangling Code

In [None]:
# 1. What can we learn about different hosts and areas?
hosts_areas = df.groupby(['host_name','neighbourhood_group'])['calculated_host_listings_count'].max().reset_index()
hosts_areas.sort_values(by='calculated_host_listings_count',ascending=False).head()


In [None]:
# 2.What we learn from room type and their prices according to area?
room_price_area_wise = df.groupby(['neighbourhood_group','room_type'])['price'].max().reset_index()
room_price_area_wise.sort_values(by='price', ascending=False).head(10)

In [None]:
# 3.What can we learn from Data? (ex: locations, prices, reviews, etc)
areas_reviews = df.groupby(['neighbourhood_group'])['number_of_reviews'].max().reset_index()
areas_reviews.sort_values(by='number_of_reviews',ascending=False)

In [None]:
# 4.Which hosts are the busiest and what's the reason?
busiest_hosts = df.groupby(['host_id','host_name','room_type'])['number_of_reviews'].max().reset_index()
busiest_hosts = busiest_hosts.sort_values(by='number_of_reviews',ascending=False).head(10)
busiest_hosts

In [None]:
# 5. Which Hosts are charging higher price?

Highest_price = df.groupby(['host_id','host_name','room_type','neighbourhood_group'])['price'].max().reset_index()
Highest_price = Highest_price.sort_values(by='price',ascending=False).head(10)
Highest_price

In [None]:
# 6. Is there any traffic difference among different areas and what could be the reason for it?
traffic_areas = df.groupby(['neighbourhood_group','room_type'])['minimum_nights'].count().reset_index()
traffic_areas = traffic_areas.sort_values(by='minimum_nights',ascending=False).head()
traffic_areas

In [None]:
# 7. Number of reviews according to price :

price_area = df.groupby(['price'])['number_of_reviews'].max().reset_index()
price_area.head(10)


### What all manipulations have you done and insights you found?

Answer Here -The following manupilations insisted those are:-    
1.We learnt about different hosts and areas    
2.We learnt from room type and their prices according to area     
3.We learnt from Data.(ex: locations, prices, reviews, etc)     
4.We learnt about the busiest hosts.     
5.Hosts charging higher prices.     
6.Traffic difference among different areas.    
7.Number of reviews according to prices.

Insights found :-
1. The Sounder(NYC) has most no. of listing which is in manhattan.
2. Entire home/apt is the highest no. of room type present and they charging higher as well, in airbnb dataframe.
3. The data tells us that no. of reviews in Queen and manhattan are higher than any other hosts also the customer moving more towards the cheaper options i.e. (0 - 2000) range.   
4. We got the Busiest hosts - Dona,Ji etc.
5.We got the hosts charging the higher price than anyone i.e. Jelena, Kathrine etc.
6. Traffic difference found between hosts and areas i.e Manhattan, Brooklyn & Queens has most of the traffic.     
7. The customer's moving more towards the cheaper options i.e. (0 - 2000) range.   


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 To learn from room type and their prices according to area

neighbourhood_group = ['Brooklyn', 'Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Staten Island', 'Queens', 'Bronx', 'Queens', 'Bronx']
room_type = ['Entire home/apt', 'Entire home/apt', 'Private room', 'Private room', 'Private room', 'Entire home/apt', 'Entire home/apt', 'Private room', 'Shared room', 'Entire home/apt']

room_dict = {} #Create a dictionary named room_dict to store the count of each room type. Loop through the room_type list and increase the count of the room type in the dictionary if it already exists. If not, add the room type as a key with the count as 1.

for i in room_type:
    room_dict[i] = room_dict.get(i, 0) + 1

plt.bar(room_dict.keys(), room_dict.values(), color='green', edgecolor='blue') #Plot a bar graph using the plt.bar function. The x-axis will be the room types which are the keys of the room_dict dictionary and the y-axis will be the count of each room type which are the values of the room_dict dictionary.
plt.title('Room Types')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.      
 Since these are categorical variables, to analyse data from 'Bar chart' it is easy to understand like, count of which room_type is more according to hosts so, it gives us difference insights from data in an ease but interpritable

##### 2. What is/are the insight(s) found from the chart?

Answer here --> We found that **Entire home/apt** is the highest number of room types overall and prices are **high** in the **brooklyn** and **Manhattan** for entire home/apt.

##### 3. Will the gained insights help creating a positive business impact?


Answer Here - The above insight creating the **negative** impact on the business, why because in **brooklyn manhatten** and others where **Entire home/ap**t are high as the target customers who wants airbnb service affordable in the budget of customer then target audience is not collabing with airbnb,so that creating a depreciating aspect in these hosts areas.

#### Chart - 2

In [None]:
# Chart - 2 Which hosts are the busiest and what is the reason?

name_hosts = busiest_hosts['host_name']
review_got = busiest_hosts['number_of_reviews']

fig = plt.figure(figsize =(10,5))

plt.bar(name_hosts,review_got, color ='purple', width =0.5)
plt.xlabel('Name of the Host')
plt.ylabel('Review')
plt.title("Busiest Host in terms of reviews")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable.
To show the maximum no. of reviews per hosts that's why I used Bar chart

##### 2. What is/are the insight(s) found from the chart?

Answer Here - We have found Busiest hosts :
Dona,
Ji,
Maya,
Carol and
Danielle
Because these hosts listed their room type as Entire home and Private room which is preferred by most number of people and also their reviews are higher.

##### 3. Will the gained insights help creating a positive business impact?



Answer Here - Yes, above visualization can help create a positive business impact:

1.**Identify target audience**: By knowing that most people prefer entire homes and private rooms, businesses can target their marketing efforts towards those types of listings.   
2.**Improve reviews**: The insights show that reviews are important for success. Businesses can focus on improving their reviews by providing excellent customer service and amenities.    
3.**Increase bookings**: By understanding the preferences of their target audience and improving their reviews, businesses can increase their chances of getting bookings.   
4.**Optimize pricing**: The insights can also be used to optimize pricing. Businesses can see what other hosts are charging for similar listings and adjust their prices accordingly.     
**Make data-driven decisions**: The insights gained from the data can be used to make data-driven decisions about all aspects of the business, from marketing to operations.
By taking action on the insights gained from the data, businesses can improve their performance and achieve a positive business impact.

#### Chart - 3

In [None]:
# Chart - 3 Hosts are charging higher price.

name_of_host = Highest_price['host_name']
price_charge = Highest_price['price']

fig = plt.figure(figsize =(10,5))

plt.bar(name_of_host,price_charge, color ='brown', width =0.5)
plt.xlabel('Name of the Host')
plt.ylabel('Price')
plt.title("Hosts with maximum price charges")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. To show maximum price per host more interpretable that's why I used Bar plot

##### 2. What is/are the insight(s) found from the chart?

Answer Here - Now we have found that **10 Hosts** who are charging **maximum price:**
Jelena, Kathrine, Erin, Matt, Olson, Amy, Rum, Jessica, Sally, Jack

Max Price is **10000 USD**

##### 3. Will the gained insights help creating a positive business impact?



Answer Here - Yes, the gained insights can help create a positive business impact:

**Identify high-value customers**:
 These hosts are likely to be experienced and reliable, attracting guests willing to pay a premium.
Understanding their qualities can help attract and retain similar high-value hosts.    
**Optimize pricing strategy**: Knowing the maximum price guests are willing to pay can inform pricing strategies.
Hosts can adjust their prices to maximize revenue while maintaining competitiveness.         
**Improve service quality**: Studying the qualities of these high-value hosts can reveal best practices.
Other hosts can learn from them to improve their own service quality and potentially increase their prices.   
**Targeted marketing**:These hosts may have a specific target audience or marketing strategy.
Understanding their approach can help develop more effective marketing campaigns.    
**Positive reputation**:   High-value hosts can contribute to the platform's reputation for quality and reliability.
This can attract more guests and hosts, leading to overall business growth.

#### Chart - 4

In [None]:
# Chart - 4  Number of Reviews VS Price

price_list = price_area['price']
review = price_area['number_of_reviews']
fig =plt.figure(figsize =(10,5))

plt.scatter(price_list, review)
plt.xlabel('Price')
plt.ylabel('Number of reviews')
plt.title('Number of Reviews VS Price')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here - Answer Here - A **scatter plot** uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an **individual data point**. Scatter plots are used to observe relationships between variables.

Thus, I have used the scatter plot to depict the relationship between **No. of Reviews** given by customer and the **price** they prefer to pay

##### 2. What is/are the insight(s) found from the chart?

From the above visualization we found that there are most of the customer likely to stay where the **prices** are **less** ,means **under 2000** there are **most** number of reviews than any other places, reviews are higher there.

##### 3. Will the gained insights help creating a positive business impact?


Answer Here - Yes, the above gained insights can help create a positive business impact:

**Set competitive pricing**: Businesses can use the data to set competitive pricing for their services. By offering prices that are lower than or comparable to the competition, businesses can attract more customers.    
**Improve customer experience:** The data shows that customers are more likely to leave reviews for places where they have had a positive experience. Businesses can use this information to identify areas where they can improve the customer experience, such as providing better customer service or offering more amenities.     
**Increase brand awareness:** The data shows that Airbnb booking analyses have a higher number of reviews than any other place. Businesses can use this information to increase brand awareness by promoting their positive reviews on social media and other online platforms.

#### Chart - 5

In [None]:
# Chart - 5 Is there any traffic difference among different areas and what could be the reason for it.

areas_Traffic = traffic_areas['room_type']
room_stayed = traffic_areas['minimum_nights']

fig = plt.figure(figsize =(7,5))

plt.bar(areas_Traffic,room_stayed, color ="blue", width = 0.2)

plt.xlabel("Room Type")
plt.ylabel("Minimum Night")
plt.title("Traffic Areas based on Minimum Nights Booked")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here - The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the room types where no. of minimum booking nights are increasing

##### 2. What is/are the insight(s) found from the chart?

Answer Here - From this visualization We found that most of the people likely to stay at **Entire home and Private room** but more like Entire home/apt which are present in **Manhattan, Brooklyn & Queens** there traffic is much more higher and also vistors referring stay in room which **listing price** is **less**.

##### 3. Will the gained insights help creating a positive business impact?

Answer Here - Yes the above insight can create a great positive impact impact on the business    
**Identify audience**: People looking for places to stay in Manhattan, Brooklyn, and Queens who are on a budget.     
**Highlighting the benefits of staying in an entire home/apt**: More space, privacy, and amenities.     
**Offer competitive pricing**: Price your listings lower than similar properties in the same area.    
**Promote your listings in areas with high traffic**: Advertise your listings on websites and social media platforms that are popular with people looking for places to stay in areas like Manhattan, Brooklyn, and Queens.      
**Provide excellent customer service**: Respond quickly to inquiries and requests, and make sure your guests have a positive experience.


#### Chart - 6 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code -

corr = df[df.describe().columns].corr()
plt.figure(figsize=(10, 8))
sns.set(font_scale=1.2)  # Increasing font size for readability

# Defining custom color palette with more contrast
cmap = sns.diverging_palette(220, 20, as_cmap=True)

# Drawing the heatmap with improved annotation and aesthetics
sns.heatmap(corr, cmap=cmap, annot=False, fmt=".2f", annot_kws={"size": 12, "weight": "bold"},
            linewidths=0.5, cbar_kws={"shrink": 0.8, "label": "Correlation Coefficient"},
            square=True, linecolor='white', vmin=-1, vmax=1)

# Adding title
plt.title('Correlation Heatmap', fontsize=18)

# Rotate y-axis labels for better readability
plt.yticks(rotation=0)

# Showing plot
plt.tight_layout()  # Adjusting layout to prevent clipping of labels
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here - A **correlation matrix** is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

Thus to know the correlation between all the variables along with the **correlation coeficients**, So that's why I used **correlation heatmap**.

##### 2. What is/are the insight(s) found from the chart?

Answer Here -  The correlation heatmap says it all that is expresses the relationship between the variables like **id** - **host_id** are slightly more correlated to each other than any of the variable

#### Chart - 7 - Pair Plot

In [None]:
# Pair Plot visualization code -

sns.pairplot(df, hue = 'room_type')

# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

Answer Here -
From the above chart I got to know that there are people who likely to have more space like Entire room/ apt,private rooms and also they want cheaper price options ,So it is obvious that the lacing part is our categories where we need some changes in the prices strategy , after market strategy i.e. get reviews from the customer as well as the recognition that the other options(room_types) needs, we have to target that also, and lastly the major thing that is we need to expand the coverage of "Airbnb" , we need collaborate with more and more partners to solve the problem of traffic

### Chart- 8 -

In [None]:

plt.rcParams['figure.figsize'] = (8, 5)
ax= sns.countplot(y='room_type',hue='neighbourhood_group',data=df,palette='bright')

total = len(df['room_type'])
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))

plt.title('Count of each room types in NYC')
plt.xlabel('Rooms')
plt.xticks(rotation=90)
plt.ylabel('Room Counts')

plt.show()


### Insights Found       
 Manhattan has more listed properties with Entire home/apt around 27% of total listed properties followed by Brooklyn with around 19.6%.
Private rooms are more in Brooklyn as in 20.7% of the total listed properties followed by Manhattan with 16.3% of them. While 6.9% of private rooms are from Queens.
We can infer that Brooklyn,Queens,Bronx has more private room types while Manhattan which has the highest no of listings in entire NYC has more Entire home/apt room types

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

**Business Objective :-**   

### 1. Focus on popular areas and room types:

Target hosts in Manhattan, Brooklyn, and Queens, as these areas have the most traffic.
Encourage hosts to list entire homes/apartments, as this is the most popular room type and commands higher prices.
### 2. Offer competitive pricing:

Analyze the pricing strategies of successful hosts, such as Jelena and Kathrine.
Encourage hosts to price their listings competitively, especially if they are located in popular areas or offering desirable room types.
### 3. Improve customer experience:

Encourage hosts to improve their customer service and hospitality to increase the number of positive reviews.
Offer incentives to hosts who consistently receive high ratings.
### 4. Target budget-conscious customers:

Promote the platform to budget-conscious customers by highlighting the availability of affordable listings.
Offer discounts or promotions to attract new customers and encourage repeat bookings.
### 5. Personalize the experience:

Use data and analytics to understand customer preferences and tailor the platform accordingly.
Offer personalized recommendations to customers based on their previous searches and bookings.
### 5. Manage the Traffic:

The traffic should be managed because its the better time to expand the business by collaborating more and more hosts in the busiest areas that will generate more and more revenue to airbnb.

# **Conclusion**

We find that Host name Sonder(NYC) has listed highest number of listings in Manhattan followed by Blueground.

We found that Entire home/apt is the highest number of room types overall and prices are high in the brooklyn and Manhattan for entire home/apt.

From above visualization we can say that most number of people like to stay in less price and their reviews are higher in those areas.

We have found Busiest hosts : Dona, Ji, Maya,Carol,Danielle

Because these hosts listed their room type as Entire home and Private room which is preferred by most number of people and also their reviews are higher.

Now we have seen that 10 Hosts who are charging maximum price: Jelena,Kathrine,Erin,Matt,Olson,Amy,Rum,Jessica,Sally & Jack
Max Price is 10000 USD

From this visualization We found that most of the people likely to stay at Entire home and Private room which are present in Manhattan, Brooklyn & Queens and also vistors referring stay in room which listing price is less.

We have seen all the correlation between different variables

Manhattan has more listed properties with Entire home/apt around 27% of total listed properties followed by Brooklyn with around 19.6%. Private rooms are more in Brooklyn as in 20.7% of the total listed properties followed by Manhattan with 16.3% of them. While 6.9% of private rooms are from Queens. We can infer that Brooklyn,Queens,Bronx has more private room types while Manhattan which has the highest no of listings in entire NYC has more Entire home/apt room types.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***