<a href="https://colab.research.google.com/github/Dr-Datalogy/AirBnb-Booking-Exploratory-Data-Analysis/blob/main/AirBnb_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** SAEEDUR RAHMAN


# **Project Summary -**

Write the summary here within 500-600 words.

The analysis of the Airbnb dataset provides valuable insights that can guide strategic decision-making to achieve the client's business objectives. The data highlights potential seasonality in pricing, with variations in average prices across different months. To leverage this information effectively, a dynamic pricing strategy is recommended, wherein prices are adjusted based on demand fluctuations. During peak periods, such as summer or holidays, prices can be increased to capitalize on higher demand, while special promotions or discounts can be offered during off-peak months to stimulate bookings.

Targeted marketing campaigns aligned with seasonal trends can further amplify the impact of dynamic pricing. By focusing on promoting summer getaways or winter retreats during the corresponding months, the client can resonate with the preferences and interests of potential customers. Retargeting strategies can also be employed to engage previous customers with personalized offers or reminders, enhancing customer retention and loyalty.

Effective inventory and resource management are pivotal to ensuring seamless operations during peak periods. Adequate preparation, such as ensuring sufficient inventory or resource availability and possibly hiring seasonal staff, can help meet the increased demand, thereby maximizing revenue potential. Simultaneously, developing special packages or experiences tailored for off-peak months can help maintain consistent revenue streams and customer engagement throughout the year.

Customer engagement and loyalty programs play a crucial role in fostering long-term relationships and enhancing the overall customer experience. By implementing personalized offers or loyalty programs based on customer preferences and historical booking data, the client can incentivize repeat bookings and drive customer satisfaction. Encouraging customers to provide feedback and reviews further enhances service quality and builds a positive brand reputation.

Market diversification is another strategic avenue to explore, aiming to reduce dependency on seasonal fluctuations in specific regions. By expanding target markets both geographically and demographically, the client can tap into new customer segments and revenue streams, fostering business growth and resilience.

Continuous monitoring and adaptation are essential in the ever-evolving hospitality industry. Investing in advanced technology solutions, such as data analytics tools and CRM platforms, enables the client to gather real-time insights, streamline operations, and make data-driven decisions. Regularly updating pricing, promotions, and marketing strategies based on evolving market dynamics ensures alignment with customer preferences and competitive landscape changes.

In conclusion, leveraging the insights from the Airbnb dataset and implementing strategic recommendations tailored to seasonal trends, customer preferences, and market dynamics can empower the client to optimize revenue streams, enhance customer satisfaction, and foster sustainable growth in the competitive marketplace. Adopting a proactive approach to adapt to changing market conditions and continuously refining strategies based on data-driven insights are pivotal to achieving long-term success and fulfilling the client's business objectives.

# **GitHub Link -**

Provide your GitHub Link here.
https://github.com/Dr-Datalogy/AirBnb-Booking-Exploratory-Data-Analysis

# **Problem Statement**


**Write Problem Statement Here.**

The client's Airbnb listings experience significant fluctuations in occupancy rates and revenue across different months, with noticeable declines during off-peak periods, indicating a potential opportunity to optimize pricing strategies and implement targeted marketing campaigns to stimulate demand and enhance revenue streams throughout the year

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/Colab Notebooks/AirBnb_EDA_Project/Airbnb NYC 2019.csv'
data=pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
data.head(5)


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'The shape of Airbnb Dataset is {data.shape}')

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count duplicate rows in the entire dataset
duplicate_count = data.duplicated().sum()

print(f"Number of duplicate rows in the dataset: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing values in each column
missing_values_count = data.isnull().sum()

print("Missing values count for each column:")
print(missing_values_count)


In [None]:
total_cells = np.product(data.shape)
total_missing =  missing_values_count.sum()

#calculating the percentage of missing values
percentage_missing = np.round((total_missing / total_cells) * 100, 2)
print("Percentage of missing values in the dataset:", percentage_missing)

In [None]:
# Filling missing values
data['name'].fillna('Not_mapped', inplace = True)
data['host_name'].fillna('Not_mapped', inplace =  True)


In [None]:
data = data.drop(['last_review','reviews_per_month'], axis =1)

In [None]:
data.head(2)

In [None]:
# we can see null values for every columns
missing_values_count = np.sum(data.isnull())
missing_values_count

In [None]:
total_cells = np.product(data.shape)
total_missing =  missing_values_count.sum()

#calculating the percentage of missing values
percentage_missing = np.round((total_missing / total_cells) * 100, 2)
print("Percentage of missing values in the dataset:", percentage_missing)

In [None]:
#Viewing correlation of the numerical values
plt.figure(figsize=(20,12))
abnb_corr =data.corr()
_ = sns.heatmap(abnb_corr ,cbar=True,annot=True, cmap="Greens")

### What did you know about your dataset?

Answer Here

**id**-- Unique identifier for each listing.

**name**-- Name or title of the listing.

**host_id**-- Unique identifier for the host.

**host_name**: Name of the host.

**neighbourhood_group**: Grouping of neighbourhoods

**neighbourhood**: Specific neighbourhood where the listing is located.

**latitude and longitude**: Geographic coordinates of the listing.

**room_type**: Type of room offered (e.g., Entire home/apt, Private room, Shared room).

**price**: Price of the listing per night.

**minimum_nights**: Minimum number of nights required for booking.

**number_of_reviews**: Total number of reviews received for the listing.

**last_review**: Date of the last review.

**reviews_per_month**: Average number of reviews received per month.

**calculated_host_listings_count**: Number of listings managed by the host.

**availability_365**: Number of days the listing is available in a year.

## ***2. Understanding Your Variables***

In [None]:
# Creating back_up of Data_Set
air_df2=data

In [None]:
# Dataset Columns
air_df2.columns

In [None]:
# grouping the data according to the categories
host_ar_rept = air_df2.groupby(['host_name','neighbourhood_group'])['calculated_host_listings_count'].max()
# Reset index
host_ar_rept=host_ar_rept.reset_index()
# sorting the values calculated_host_listings_count
host_ar_rept.sort_values(by='calculated_host_listings_count', ascending=False).head(5)

we can conclude that the name of the host with the most listings is Sonder(NYC) who has 327 listings in Manhattan

In [None]:
# Dataset Describe
data.describe()

### Variables Description

Answer Here
Answer Here

**id**-- Unique identifier for each listing.

**name**-- Name or title of the listing.

**host_id**-- Unique identifier for the host.

**host_name**: Name of the host.

**neighbourhood_group**: Grouping of neighbourhoods

**neighbourhood**: Specific neighbourhood where the listing is located.

**latitude and longitude**: Geographic coordinates of the listing.

**room_type**: Type of room offered (e.g., Entire home/apt, Private room, Shared room).

**price**: Price of the listing per night.

**minimum_nights**: Minimum number of nights required for booking.

**number_of_reviews**: Total number of reviews received for the listing.

**last_review**: Date of the last review.

**reviews_per_month**: Average number of reviews received per month.

**calculated_host_listings_count**: Number of listings managed by the host.

**availability_365**: Number of days the listing is available in a year.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
import pandas as pd

for column in data.columns:
    unique_values = data[column].unique()
    print(f"Unique values for '{column}': {unique_values}\n")


In [None]:
for column in data.columns:
    unique_count = len(data[column].unique())
    print(f"Number of unique values for '{column}': {unique_count}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Drop rows with any missing values
data.dropna(inplace=True)


In [None]:
#Handle Duplicates:
data.drop_duplicates(inplace=True)


In [None]:
#Convert Data Types:
data['room_type'] = data['room_type'].astype('category')
data['neighbourhood'] = data['neighbourhood'].astype('category')


### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Visualizing the hosts with most listings.
# selecting top 10 host
top_hosts=host_ar_rept.sort_values(by='calculated_host_listings_count', ascending=False).head(10)

plt.rcParams['figure.figsize'] = (12,6)
host_name = top_hosts['host_name']
host_lisitng = top_hosts['calculated_host_listings_count']
plt.bar(host_name,host_lisitng)
plt.title('Hosts with most listings in NYC',{'fontsize':18})
plt.xlabel('Host Names',{'fontsize':18})
plt.ylabel('Number of host listings',{'fontsize':18})
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
A bar chart was chosen because it effectively represents the distribution of categorical data.showing Snder(NYC) having maximum Number of host listing.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Maximum Number of listing by Soder(NYC) Followed by Blueground, Kara, Kazuya and Jeremy & Laura.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**positive impact:**
Knowing the variation in listings by hosts will help to understand the categories variation of categories and the type of properties will help to increase listing in the weak piont.


#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.boxplot(x='room_type', y='price', data=data)
plt.title('Price Distribution by Room Type')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
 I picked the boxplot because it's an excellent visualization to display the distribution of a numerical variable (price) across different categories (room_type). It provides insights into the central tendency, variability, and potential outliers within each category.

##### 2. What is/are the insight(s) found from the chart?

Answer Here.We can observe the median price (the line inside the box) for each room type, which gives us a sense of the central value.
The height of the box represents the interquartile range (IQR), indicating the variability of prices within each room type.
Any points outside the whiskers (often defined as 1.5 times the IQR) can be considered outliers, showing exceptionally high or low prices for specific room types

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

 If the boxplot reveals consistent outliers at the lower end of the price range for specific room types, it might suggest that those listings are either underpriced (leading to missed revenue) or offer fewer amenities/services than perceived competitors.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
top_neighbourhoods = data['neighbourhood'].value_counts().head(10)
sns.barplot(x=top_neighbourhoods.index, y=top_neighbourhoods.values)

plt.title('Top 10 Neighbourhoods by Listings')
plt.xticks(rotation=45)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
#Creating a Report on number of Reviews vs Price

area_price = air_df2.groupby(['price'])['number_of_reviews'].max().reset_index()
area_price.head(5)


In [None]:
Area = area_price['price']
Review = area_price['number_of_reviews']
fig = plt.figure(figsize = (10, 5))

# creating the scatter plot
plt.scatter(Area, Review)

plt.xlabel("Price")
plt.ylabel("Review")
plt.title("Price vs Number of Reviews")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I chose the scatter plot to visualize the relationship between reviews and price . Scatter plots are effective for visualizing relationships between two numerical variables and can help in identifying patterns, trends, or correlations between them.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Looks like price range under USD2000 gain maximum reviews compare to others. We can implies here that nned to focus to gain revies in others ranges also.

##### 3. Will the gained insights help creating a positive business impact?



Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


#### Chart - 5

In [None]:
# Creating a Report on host name and reviews on it
busy_hosts = air_df2.groupby(['host_name','host_id','room_type'])['number_of_reviews'].max()
busy_hosts=busy_hosts.reset_index()
busy_hosts = busy_hosts.sort_values(by='number_of_reviews', ascending=False).head(10)


In [None]:
busy_hosts

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(14,8))
sns.barplot(data=busy_hosts,x='host_name',y='number_of_reviews',ax=ax,capsize=.2)
ax.set(title='Busiest Hosts ')
plt.xlabel('Host Names',{'fontsize':18})
plt.ylabel('Number of Reviews',{'fontsize':18})

##### 1. Why did you pick the specific chart?

Answer Here.


I choose the bar plot because it is suitable for comparing different datas with respect to eachother.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#Busiest hosts are:

1.   Dona
2.   Ji
3.   Maya
4.   Carol
5.   Danielle

Because these hosts listed room type as Entire home and Private room which is preferred by most number of people.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Types of rooms
room_types = data['room_type'].value_counts()
sns.barplot(x=room_types.index, y=room_types.values)
plt.title('Distribution of Room Types')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

A bar chart was chosen because it effectively represents the distribution of categorical data. For the variable room_type, which likely contains distinct categories like 'Entire home/apt', 'Private room', and 'Shared room', a bar chart provides a clear visual representation of the count or frequency of each category.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart reveals the distribution of listings based on room types. Specifically, you can easily see which room type has the highest count and which ones are less common. This provides a snapshot of the popularity or availability of different room types within the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


**Positive Impact**: Knowing the distribution of room types can guide pricing strategies, marketing efforts, and resource allocation. For instance, if 'Entire home/apt' is the most popular, the business might focus more on acquiring such listings or promoting them more aggressively.

**Negative Impact**: If a particular room type (e.g., 'Shared room') has very few listings compared to others, and there's a significant demand for it, the platform might be missing out on potential revenue. Not addressing this gap could lead to negative growth or reduced customer satisfaction if users are looking specifically for that type of accommodation.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Bar Chart - Number of Listings by Room Type (Categorical data)
listings_by_type = data['room_type'].value_counts()
sns.barplot(x=listings_by_type.index, y=listings_by_type.values)
plt.title('Number of Listings by Room Type')
plt.xlabel("Room type")
plt.ylabel("No. Of Listings")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
This specific chart i.e. bar plot is suitable for comapring differnt parameter to eachother. here we can capmare different type of room types with no of listing gain.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

We can see here by comparing different types of room types( entire home/apt, private room, shared) and the no of listing gain, we can see that entire home/apt is having higher no of listing followed by private room and last one is shared type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The maximum listing gain observed in Entire home/apt which shows the intrest of the host in the type of appartment. low no of listing found in shared type of room which is comaparatively low with others.
Focus should on shared room type to increase maximum listings.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12,8))
sns.scatterplot(x=data.longitude,y=data.latitude,hue=data.neighbourhood_group)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
data.head()
plt.figure(figsize=(12,4))
df = data[data['minimum_nights']==1]
df1 = df.groupby(['room_type','neighbourhood_group'])['price'].mean().sort_values(ascending=True)
df1.plot(kind='bar')
plt.title('Average Price for rooms in neighbourhood group')
plt.ylabel('Average Daily Price')
plt.xlabel('Neighbourhood Group')
plt.show()
print('List of Average Price per night based on the neighbourhood group')
pd.DataFrame(df1).sort_values(by='room_type')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
fig = plt.figure(figsize=(12,4))
review_50 = data[data['number_of_reviews']>=50]
df2 = review_50['neighbourhood_group'].value_counts()
df2.plot(kind='bar',color=['r','b','g','y','m'])
plt.title('Location and Review Score(Min of 50)')
plt.ylabel('Number of Review')
plt.xlabel('Neighbourhood Group')
plt.show()
print(' Count of Review v/s neighbourhood group')
pd.DataFrame(df2)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.style.use('fivethirtyeight')
plt.figure(figsize=(13,7))
plt.title("Neighbourhood Group")
g = plt.pie(data.neighbourhood_group.value_counts(), labels=data.neighbourhood_group.value_counts().index,autopct='%1.1f%%', startangle=180)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The pie chart is chosen to visualize the distribution or proportion of each neighbourhood_group in the dataset. Pie charts are particularly effective when the goal is to show relative proportions of a whole.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The dominant neighbourhood_group can be identified based on the largest slice.
It's possible to observe the relative sizes and proportions of each neighbourhood_group in the dataset.
For instance, if "Brooklyn" has the largest slice, it indicates that the dataset has a significant number of listings from Brooklyn compared to other neighbourhood groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numerical_data = data.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
correlation_matrix = numerical_data.corr()

# Plotting the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
numerical_data = data.select_dtypes(include=['float64', 'int64'])

# Creating the pair plot
sns.pairplot(numerical_data)
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)  # Adjust the title position if needed
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

**Neighbourhood Distribution**: The majority of listings are concentrated in specific neighbourhoods, particularly in areas like Manhattan and Brooklyn. This concentration suggests potential areas of high demand and can guide strategic decisions for hosts and property managers.


**Price Trends:** The dataset showcased varying price ranges across different room types and neighbourhoods. This variability highlights the importance of dynamic pricing strategies, taking into account factors such as location, seasonality, and amenities.

**Availability Patterns:** Certain listings had high availability throughout the year, suggesting they might be more suited for long-term stays or might be priced competitively to attract guests consistently.

**To leverage these insights for business growth:**

**Targeted Marketing**: By understanding the popular neighbourhoods and their corresponding price ranges, hosts can tailor their marketing efforts to attract the right audience.

**Optimized Pricing**: Implementing dynamic pricing models can help hosts maximize revenue by adjusting prices based on demand, seasonality, and other factors

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***