<a href="https://colab.research.google.com/github/AkashDas-AD/EDR/blob/main/EDA_AirBnb_Booking_Analysis_AkashDas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Name - AirBnb Booking Analysis**




```
# This is formatted as code
```

##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual


# **Project Summary -**

The Summary Of The Analysis: Understanding the factors that influence Airbnb prices in New York City, or identifying patterns of all variables, and our analysis provides useful information for travelers and hosts in the city and also provides some best insights for Airbnb business.

This project involved exploring and cleaning a dataset to prepare it for analysis. The data exploration process involved identifying and understanding the characteristics of the data, such as the data types, missing values, and distributions of values. The data cleaning process involved identifying and addressing any issues or inconsistencies in the data, such as errors, missing values, or duplicate records and outliers.

Through this process, we were able to identify and fix any issues with the data and ensure that it was ready for further analysis. This is an important step in any data analysis project, as it allows us to work with high-quality data and avoid any potential biases or errors that could affect the results. The clean and prepared data can now be used to answer specific research.

Using data visualization to explore and understand patterns and the observations and insights we identified through this process will be useful for future analysis and decision-making related to Airbnb. And also, our analysis provides useful information for travelers and hosts.




# **GitHub Link -**

In [None]:
https://github.com/AkashDas-AD/EDR

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Airbnb hosts and the platform itself strive to optimize the hosting experience and attract more bookings. The challenge is to analyze Airbnb booking patterns, user preferences, and pricing strategies to uncover valuable insights. By conducting an in-depth Exploratory Data Analysis (EDA), we aim to understand the dynamics influencing booking trends, user behavior, and host success. The ultimate goal is to provide actionable insights for hosts to enhance their listings and for Airbnb to refine its platform features, ultimately fostering a deeper understanding of the vacation rental market.

Business Objective:

Maximize Airbnb Host Success:

Enhance host strategies by leveraging data insights, resulting in increased bookings and revenue on the Airbnb platform.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt     #for visualization
%matplotlib inline
import seaborn as sns               #for visualization
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# Uplod the file in session storage and the loaded the dataset

df=pd.read_csv('Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

shape=df.shape
print( "No. of Rows and Columns in Dataset are :", shape)

### Dataset Information

In [None]:
# Dataset Info
print("Dataset Info :")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_data=df.duplicated().sum()
print("Duplicate Values in Dataset are :", duplicate_data)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print("Missing Values/Null Values Count :")
print(df.isna().sum())

In [None]:
# Visualizing the missing values

columns=list(df.columns)
null_counts = df[columns].isnull().sum()                  # Caluated total null values in each columns
null_counts = null_counts.reset_index()                   # Convert it into data frame for plotting
null_counts.columns = ['Column', 'Null Count']

plt.figure(figsize=(10, 6))
sns.barplot(x='Column', y='Null Count', data=null_counts)
plt.xticks(rotation=90)
plt.title('Null Values Count by Column')
plt.xlabel('Columns')
plt.ylabel('Number of Null Values')
plt.show()

### What did you know about your dataset?

**DATASET OVERVIEW** :

The dataset comprises of 48895 rows and 16 columns, each representing a unique booking. Features include host details, neighborhood information, room types, prices, availability, and reviews.

**Data Types Check** :
Evaluation of data types ensured proper handling of numeric, categorical, and datetime features.
**Categorical variable category** - name,host_name, neighbourhood_group, neighbourhood and room_type
**Numerical variables** - id, host_id, latitude, longitude, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, host_listings_count, availability_365

**Missing Values** :
Notable missing values in the "last_review" and "reviews_per_month" column were observed

**Duplicate Values** :
There are no duplicate values in the dataset.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

columns=list(df.columns)
print("Columns present in Dataset are :")
columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description


* **id** - Unique ID
* **name** - Name of the listing
* **host_id** - Unique host_id
* **host_name** - Name of the host
* **neighbourhood_group** - Location
* **neighborhood** - Area
* **latitude** - Latitude range
* **longitude** - Longitude range
* **room_type** - Type of listing
* **price** - Price of listing
* **minimum_nights** - Minimum nights to be paid for
* **number of reviews** - Number of reviews
* **last_review** - Content of the last review
* **reviews_per_month**- Number of checks per month
* **calculated_host_listing_count** -Total count
* **availability_365** - Availability around the year

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in df.columns.tolist():
  print("No. of unique values in ",i," = ",df[i].nunique(),)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
null_counts       # previously calucated how much null value each column contains.

df['name'].fillna('unknown',inplace=True)
df['host_name'].fillna('no_name',inplace=True)
df['last_review'].fillna(pd.NaT,inplace=True)
df['reviews_per_month'].fillna(0,inplace=True)


In [None]:
df.isna().sum() # verify if all null values have been handled or not

In [None]:
#Removing the ouliers in the data using income column

# visualizing the ouliers using boxplot

sns.boxplot(x=df['price'])
plt.title('Box Plot for Price')
plt.show()

In [None]:
#Remove outliers using IQR technique.

Q1=df['price'].quantile(0.25)
Q3=df['price'].quantile(0.75)

IQR = Q3 - Q1
lower_bound= Q1 - 1.5 * IQR
upper_bound= Q3 + 1.5 * IQR

df = df[(df['price'] > lower_bound) & (df['price'] < upper_bound)]

In [None]:
# visualizing the dataset using boxplot after removing the ouliers

sns.boxplot(x=df['price'])
plt.title('Boxplot for price after removing outliers')
plt.show()

In [None]:
#Analyszing the unique neighborhood groups, room types, and their respective counts, as well as computes average prices,
#host listing counts, and minimum night stays for each room type and each neighborhood group.

#Unique Neighbourhood Group
unique_neighbourhood_group = df['neighbourhood_group'].value_counts()
print("Unique Neighbourhood Group present in Dataset , each accompanied by its respective count :")
print(unique_neighbourhood_group)
print("---------------------------------------------------------------------")

#Count of neighbourhood within each neighbourhood group
unique_neighbourhood = df.groupby('neighbourhood_group')['neighbourhood'].nunique()
print("Count of Neighbourhood present within each Neighbourhood Group :")
print(unique_neighbourhood)
print("---------------------------------------------------------------------")

#Count of unique room type present
unique_room_type = df['room_type'].value_counts()
print("Unique Room Type present in Dataset,each accompanied by its respective count :")
print(unique_room_type)
print("---------------------------------------------------------------------")

#Average price for each room type
room_avg_price = df.groupby('room_type').describe()['price']['mean'].round(2)
print("Average Price of each Room type present :")
print(room_avg_price)
print("----------------------------------------------------------------------")

#Averaege price of each room type in each neighbourhood group
room_avg_price = df.groupby(['neighbourhood_group','room_type']).describe()['price']['mean'].round(2)
print("Average Price of each Room Type in each Neighbourhood Group :")
print(room_avg_price)
print("---------------------------------------------------------------------")

#Average number of host listings count for each room type
print("Average number of Host listings count for each room type :")
room_avg_host_listing_count = df.groupby('room_type').describe()['calculated_host_listings_count']['mean'].round(2)
print(room_avg_host_listing_count)
print("---------------------------------------------------------------------")

#Average Host listings count for each Room Type in each Neighbourhood Group
avg_host_listing_count = df.groupby(['neighbourhood_group','room_type']).describe()['calculated_host_listings_count']['mean'].round(2)
print('Average Host listings count for each Room Type in each Neighbourhood Group')
print(avg_host_listing_count)
print("---------------------------------------------------------------------")

#Average number of minimum nights for each room type
room_avg_nights=df.groupby('room_type').describe()['minimum_nights']['mean'].astype(int)
print("Average Number of Minimum Nights for each room type present :")
print(room_avg_nights)

In [None]:
#Analyzing the booking count for each month and for each room type.

#Converting the data_type for last_review to DATETIME and extracing month form it
df['last_review'] = pd.to_datetime(df['last_review'])
df['month'] = df['last_review'].dt.month.astype(float)

#All months Bookings count for each room type
booking_counts_by_room_type =df.groupby(['room_type', 'month']).size().reset_index(name='booking_count')
print("Count of Bookings made for each month across all room types.")
print(booking_counts_by_room_type)
print('--------------------------------------------------------------')

#Top 3 months with the maximum booking counts for each room type
booking_counts_by_rtype=df.groupby(['room_type', 'month']).size()
top_3_months_by_room_type_max = booking_counts_by_rtype.groupby('room_type', group_keys=False).nlargest(3).reset_index(name='booking_count')
print("Top 3 months for each room type when the maximum number of bookings occurred :")
print(top_3_months_by_room_type_max)
print('----------------------------------------------------------')

#Top 3 months with the minimum booking counts for each room type
top_3_months_by_room_type_min = booking_counts_by_rtype.groupby('room_type', group_keys=False).nsmallest(3).reset_index(name='booking_count')
print("Top 3 months for each room type when the minimum number of bookings occurred :")
print(top_3_months_by_room_type_min)


In [None]:
#Analyzing the booking count for each year and for each room type.

#Extracing year form last_review column.
df['year'] = df['last_review'].dt.year.astype(float).astype('Int64')

#All years' Booking count for each room type
booking_counts_by_room_type_yearly = df.groupby(['room_type', 'year']).size().reset_index(name='booking_count_yearly')
print("Count of Bookings made for each year across all room types.")
print(booking_counts_by_room_type_yearly)
print('--------------------------------------------------------------')

#Top 3 years with the maximum booking counts for each room type
booking_counts_by_rtype_yearly = df.groupby(['room_type', 'year']).size()
top_3_years_by_room_type_max = booking_counts_by_rtype_yearly.groupby('room_type', group_keys=False).nlargest(3).reset_index(name='booking_count_yearly')
print("Top 3 years for each room type when the maximum number of bookings occurred :")
print(top_3_years_by_room_type_max)
print('----------------------------------------------------------')

#Top 3 years with the minimum booking counts for each room type
top_3_years_by_room_type_min = booking_counts_by_rtype_yearly.groupby('room_type', group_keys=False).nsmallest(3).reset_index(name='booking_count_yearly')
print("Top 3 years for each room type when the minimum number of bookings occurred :")
print(top_3_years_by_room_type_min)
print('-----------------------------------------------------------')

In [None]:
#Calculating the average min nights booked,price per night, income per listing and how many listing are there for short
#and long term listing.

#Assuming each review in number of reviews corresponts to atleast 1 night hence using number of reviews
#Average nights booked per listing
average_nights_booked = df['number_of_reviews'].mean().astype(int)
print(f"The average number of nights booked per listing is: {average_nights_booked}")
print('-------------------------------------------------------------------------')

#Average price per night
df['average_price_per_night'] = df['price']/df['minimum_nights']
average_price_per_night =df['average_price_per_night'].mean()
print(f"The average price per night is: ${average_price_per_night:.2f}")
print('-------------------------------------------------------------------------')

#Average income per listing
df['income'] = df['price'] * df['calculated_host_listings_count']           #Calculate the income for each listing
average_income_per_listing = df['income'].sum()/len(df['income'])           #Calculate the average income per listing
print(f"The average income per listing is: ${average_income_per_listing:.2f}")
print('-------------------------------------------------------------------------')

# Analysis whether Short-Term rentals(minimum nights<=30) are more preferred or Long-Term-rentals(minimum nights > 30)

# Classify listings as short-term or long-term
df['rental_type'] = pd.cut(df['minimum_nights'], bins=[0, 30, float('inf')], labels=['Short-term', 'Long-term'])
rental_type_counts = df['rental_type'].value_counts()
print("Count of Short-term Rentals:")
print(rental_type_counts['Short-term'])
print("Count of Long-term Rentals:")
print(rental_type_counts['Long-term'])




In [None]:
# Calculate how many host have multiple listing or single listing and who are the top host having multiple listing.

#Identify hosts with multiple listings (i.e , hosts with more than one listing)
multiple_listings_hosts = df[df['calculated_host_listings_count'] > 1]

# Display the count and percentage of hosts with multiple listings
total_hosts = len(df['calculated_host_listings_count'])
multiple_listings_count = len(multiple_listings_hosts)
single_listings_count = total_hosts - multiple_listings_count

print(f"\nTotal Hosts: {total_hosts}")
print(f"Hosts with Multiple Listings: {multiple_listings_count} ({(multiple_listings_count / total_hosts) * 100:.1f}%)")
print(f"Hosts with Single Listing: {single_listings_count} ({(single_listings_count / total_hosts) * 100:.1f}%)")

top_hosts = df.groupby(['host_name', 'room_type']).size().unstack(fill_value=0)
top_hosts['total_listing'] = top_hosts.sum(axis=1)
top_10_hosts = top_hosts.sort_values(by='total_listing', ascending=False).head(10)

print(top_10_hosts)

### What all manipulations have you done and insights you found?

**Data Manipulations:**
Identified and filled all the null data for the respective columns
(name - unknown
host_name - no_name
last_review - Nan
reviews_per_month - 0).Removed the outliers in price column using IQR.

**Neighbourhood  Group Analysis**:Identified the count of unique neighborhood groups and neighborhoods, as well as room type distributions. Additionally, it calculates the average price for each room type and neighborhood group combination, offering insight into pricing trends.

**Room_type Booking analysis**:Analyzed the booking trends based on the month and room type.Firsly converted the last_review column to a datetime format and extracting the month and year from the data. Then, it calculates the total number of bookings for each room type in each month, providing a month-wise breakdown of bookings. Additionally, it identifies the top 3  and top 3 years with the highest booking counts for each room type. Similarly, it finds the top 3 months/top 3 years with the lowest booking counts for each room type, giving insight into off-peak times for bookings.

**Rental Type Analysis**: Identified the average listing price and average income per listing and also classified listings as short-term or long-term based on minimum nights.

**Host Listings Analysis**: Analyzed the distribution of listings per host and Identified hosts with multiple listing also identified the top 20 hosts with the maximum number of listings


Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart 1 -Yearly Booking count

In [None]:
# Chart - 1 visualization code

yearly_booking_counts =booking_counts_by_room_type_yearly.groupby('year')['booking_count_yearly'].sum().reset_index()

plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='booking_count_yearly', data=yearly_booking_counts, marker='o')
plt.title('Yearly Booking Counts')
plt.xlabel('Year')
plt.ylabel('Total Bookings')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?


The choosen graph is a line graph, it  is used to show trends over time by connecting data points with a line, making it easy to visualize changes and patterns.

##### 2. What is/are the insight(s) found from the chart?

After 2014 number of listing has been increased exponentionally.*italicized text*

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Lisitng of house has been increased year on year, it is apositive sign that the business in growing.

#### Chart - 2 - Distribution of Airbnb Prices

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,6))
sns.distplot(df['price'])
plt.xlabel('Price')
plt.ylabel('Density')
plt.title('Distribution of Airbnb Prices')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

The chosen graph is a histogram, it is used to represent the distribution of numerical data by dividing it into bins or intervals. It will helps to visualize the frequency of data points within each range.

##### 2. What is/are the insight(s) found from the chart?

* The range of prices being charged on Airbnb appears to be from 20 to 330 dollars , with the majority of listings falling in the price range of 50 to 150 dollars.
* The distribution of prices appears to have a peak in the 50 to 150 dollars range, with a relatively lower density of listings in higher and lower price ranges.

* There may be fewer listings available at prices above 250 dollars, as the density of listings drops significantly in this range.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Airbnb hosts could focus on optimizing pricing within the 50 to 150 dollar range, where demand is highest. Additionally, considering the lower density of listings above 250 dollar, hosts could explore offering unique amenities or experiences to justify higher prices,discount offers for attracting premium customer.

#### Chart - 3 - Preferred Neighborhoods Group

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(10, 6))
sns.countplot(x='neighbourhood_group', data=df)
plt.xticks(rotation=90)
plt.xlabel('Neighborhood Group')
plt.ylabel('Count')
plt.title('Preferred Neighborhoods Group')
plt.show()

##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the number of listings in each neighborhood group.

##### 2. What is/are the insight(s) found from the chart?

Manhattan and Brooklyn are the most preferred neighborhood group.Followed by Queens,Bronx,Staten Island.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Impact:  Manhattan and Brooklyn seem like top spots for growth and making some extra cash.
* Negative Impact:
On the downside, having more people leads to more competition and increased costs, such as property maintenance, marketing, and possibly higher property purchase prices.



#### Chart - 4 - Preferred Room Type

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10, 6))
sns.countplot(x='room_type', data=df)
plt.xticks(rotation=90)
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.title('Preferred Room Type')

plt.show()

##### 1. Why did you pick the specific chart?

The chosen chart is a countplot. It is suitable for comparing the frequency or distribution of categorical data, making it ideal for visualizing

##### 2. What is/are the insight(s) found from the chart?

 Entire home/apartment is in high demand, indicating a preference for more private and exclusive accommodations, Private rooms are also highly preferred, suggesting a balance between privacy and affordability.Shared rooms have a lower count, indicating a lower level of preference

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing which room types guests prefer benefits the entire Airbnb platform. Hosts can use this knowledge to adjust their listings, creating a positive chain reaction.

#### Chart - 5 - 'Average Price Trend Over Years'

In [None]:
# Chart - 5 visualization code

average_price_by_year = df.groupby('year')['price'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='price', data=average_price_by_year, marker='o')
plt.title('Average Price Trend Over Years')
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose the line plot to visually represent the annual evolution of average listing prices from 2011 to 2019. This specific chart is effective in highlighting trends and patterns in pricing over the specified time period.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates a general decline in the average price of listings from 2011 to 2019, expentional year 2013
It implies that the business may have adopted a pricing strategy aimed at maintaining competitiveness or attracting a broader customer base.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The drop in prices can help the business adjust its pricing strategies to match market trends and what customers expect.

#### Chart - 6 - Average Airbnb Price Trend Over Years by Room Type

In [None]:
# Chart - 6 visualization code

average_price_by_year_room_type = df.groupby(['year', 'room_type'])['price'].mean().reset_index()

# Plotting the data with Seaborn lineplot
plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='price', hue='room_type', data=average_price_by_year_room_type, marker='o')
plt.title('Average Airbnb Price Trend Over Years by Room Type')
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.grid(True)
plt.legend(title='Room Type')
plt.show()

##### 1. Why did you pick the specific chart?

The choosen graph is a line graph, it is used to show trends over time by connecting data points with a line, making it easy to visualize changes and patterns.

##### 2. What is/are the insight(s) found from the chart?

**Entire Home/Apartment:**

Prices for entire homes/apartments exhibit fluctuations, with a significant spike in 2013 and slight variations in subsequent years.This suggests that potential external factors are influencing prices, and it may be beneficial to investigate the reasons behind the 2013 spike.

**Private Room:**

Private room prices got peaked in 2011 and 2013, followed by a consistent decrease over the years. This drop in prices might be because people's preferences changed, or more competition came in. Another possibility is that lowering prices is a strategy to get more customers, as guests often prefer lower-priced listings.

**Shared Room:**

Shared room prices started in 2014 and have generally decreased since then. This could be because of changes in how the market works or because more people prefer private spaces over shared ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , Understanding the fluctuations in prices for each room type enables businesses to make strategic adjustments, potentially attracting more customers and making a positive business impact.
Adapting pricing strategies based on customer demand trends can positively impact customer satisfaction.

#### Chart - 7 - Average price in different neighbourhood_group

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(8, 6))
sns.barplot(x='neighbourhood_group', y='price', data=df)
plt.xlabel('Neighbourhood Group')
plt.ylabel('Average Price')
plt.title('Average price in different neighbourhood_group')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the prices of neighborhood group

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Manhattan has the most expensive prices, and Brooklyn is next, followed by Queens, Staten Island, and the Bronx. Even though Manhattan keeps prices high, it attracts a lot of hosts and guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can positively impact business decisions. Hosts can use this information to set competitive and strategic prices for their listings based on the neighborhood. Understanding the pricing dynamics can attract guests looking for specific locations and help optimize revenue.

#### Chart - 8 - Listings by Top Neighborhoods in NYC

In [None]:
# Chart - 8 visualization code

# Get the top 10 neighborhoods
top_10_neighbourhoods = df['neighbourhood'].value_counts().nlargest(10).reset_index()
top_10_neighbourhoods.columns = ['neighbourhood', 'listing_count']
plt.figure(figsize=(10, 6))
sns.barplot(x='neighbourhood', y='listing_count', data=top_10_neighbourhoods)
plt.xlabel('Neighbourhood', fontsize=14)
plt.ylabel('Total Listing Counts', fontsize=14)
plt.title('Listings by Top Neighborhoods in NYC', fontsize=15)
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the prices of neighborhood group

##### 2. What is/are the insight(s) found from the chart?

The top neighborhoods in New York City in terms of listing counts are Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick, and the Upper West Side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , Hosts can identify the most popular neighborhoods and adjust their pricing or marketing strategies accordingly.
Businesses can focus marketing efforts or partnerships in these high-demand neighborhoods to attract more customers.

#### Chart - 9 - Preferred Room Types in Each Neighbourhood Group'

In [None]:
# Chart - 9 visualization code

preferred_room_types = df.groupby(['neighbourhood_group', 'room_type']).size().reset_index(name='count')
preferred_room_types_sorted = preferred_room_types.sort_values(by='count', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='count', y='neighbourhood_group', hue='room_type', data=preferred_room_types_sorted)
plt.xlabel('Count of Listings')
plt.ylabel('Neighbourhood Group')
plt.title('Preferred Room Types in Each Neighbourhood Group')
plt.legend(title='Room Type', title_fontsize='12')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the values.

##### 2. What is/are the insight(s) found from the chart?

**Manhattan**: Entire homes are the most preferred, followed by private rooms and shared rooms.

**Brooklyn**: Private rooms are the most preferred, followed by entire homes and shared rooms.

**Queens**: Private rooms are the most preferred, followed by entire homes and shared rooms.

**Bronx**: Private rooms are the most preferred, followed by entire homes and shared rooms.

**Staten Island**: Private rooms are the most preferred, followed by entire homes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , Hosts can tailor their listings based on the predominant preferences in each neighborhood, potentially attracting more guests.
Understanding room type preferences allows hosts to optimize their offerings and cater to the most popular choices in a given area.

#### Chart - 10 - 'Top 10 Hosts Based on Number of Listings in NYC'

In [None]:
# Chart - 10 visualization code

top_hosts = df['host_name'].value_counts().nlargest(10).reset_index()
top_hosts.columns = ['host_name', 'listing_count']
plt.figure(figsize=(10, 6))
sns.barplot(x='host_name', y='listing_count', data=top_hosts)
plt.xlabel('Top 10 Hosts', fontsize=14)
plt.ylabel('Total NYC Listings', fontsize=14)
plt.title('Top 10 Hosts Based on Number of Listings in NYC', fontsize=15)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the values.

##### 2. What is/are the insight(s) found from the chart?

The top three hosts in terms of total listings are Michael, David, and John, who have 383, 368, and 276 listings, respectively.

There is a relatively large gap between the top two hosts and the rest of the hosts. For example, john has 276 listings, which is significantly fewer than Michael's 383 listings.

In this top10 list Mike has 184 listings, which is significantly fewer than Michael's 383 listings. This could indicate that there is a lot of variation in the success of different hosts on Airbnb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11 - Average Price by Room Type in Each Neighbourhood Group

In [None]:
# Chart - 11 visualization code


plt.figure(figsize=(12, 6))
sns.barplot(x='neighbourhood_group', y='price', hue='room_type', data=df, palette='Set2')
plt.title('Average Price by Room Type in Each Neighbourhood Group', fontsize=16)
plt.xlabel('Neighbourhood Group', fontsize=14)
plt.ylabel('Average Price', fontsize=14)
plt.legend(title='Room Type')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the values.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Manhattan has the highest prices across all room types, while the Bronx and Staten Island have the lowest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can guide pricing strategies and help target customers based on neighborhood affordability, creating a positive business impact.

#### Chart- 12 - Monthly Booking Counts by Room Type





In [None]:

# booking_counts_by_room_type - calculated eralier

plt.figure(figsize=(12, 6))
sns.lineplot(data=booking_counts_by_room_type, x='month', y='booking_count', hue='room_type', marker='o')

# Customize the plot
plt.title('Monthly Booking Counts by Room Type')
plt.xlabel('Month')
plt.ylabel('Booking Count')
plt.xticks(range(1, 13))
plt.grid(True)
plt.legend(title='Room Type')
plt.show()


##### 1. Why did you pick the specific chart?

The choosen graph is a line graph, it is used to show trends over time by connecting data points with a line, making it easy to visualize changes and patterns.

##### 2. What is/are the insight(s) found from the chart?


Booking for aparment and private room increased in the month of June, also shared aparment booking increased in June but the listings are significantly lower.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Increase the rates during the June season will help in increase in revenue.


#### Chart - 13 - Average Minimum nights by Room Type

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
sns.barplot(x='room_type', y='minimum_nights', data=df)
plt.xlabel('Room Type')
plt.ylabel('Minimum Nights')
plt.title('Average Minimum nights by Room Type')
plt.show()




##### 1. Why did you pick the specific chart?

The chosen graph is countplot it is suitable when you want to see the frequency or distribution of categorical data,it helps in quickly comparing the values.

##### 2. What is/are the insight(s) found from the chart?

Minimum number of night for entire room/apartment type is high i.e 9, then it is follwed by shared room with 7 nights and private room for 5 nights.

#### Chart 14 - Visualze longitude and latitude of the listings in the Airbnb NYC dataset with room_types and avg price

In [None]:
# create a scatter plot that displays the longitude and latitude of the listings in the Airbnb NYC dataset with room_types.

sns.set(rc={"figure.figsize": (10, 8)})

ax = sns.scatterplot(x=df.longitude, y=df.latitude, hue=df.room_type, palette='muted')

# set the title of the plot
ax.set_title('Distribution of type of rooms across NYC', fontsize='14')

#Price variations in NYC Neighbourhood groups using scatter plot
lat_long = df.plot(kind='scatter', x='longitude', y='latitude', label='price_variations', c='price',
                  cmap=plt.get_cmap('jet'), colorbar=True, alpha=0.4, figsize=(10, 8))
lat_long.legend()


The range of prices for accommodations in Manhattan is particularly high, indicating that it is the most expensive place to stay in NYC due to its various attractive amenities, as shown in the attached image.

they are likely to attract a lot of tourists or visitors because of more valuable things to visit so price is higher than other neighbourhood groups.

Travelers are likely to spent more days in this area because of popular amenities, high concentration of tourist attractions and public transports.




#### Chart - 15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

numeric_df = df.select_dtypes(include=['int64', 'float64'])
heatmap_data = numeric_df.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap="coolwarm", fmt=".2f", linewidths=.5)
plt.title("Airbnb Dataset Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

Corelation Heatmap is a graphical representation between multiple variables it uses colors to indicate the strength and direction of the relationships, with darker colors typically representing stronger correlations. This visualization helps to quickly identify patterns, trends, and potential relationships among variables.

##### 2. What is/are the insight(s) found from the chart?

There is a moderate positive correlation (0.58) between the host_id and id columns, which suggests that hosts with more listings are more likely to have unique host IDs.

There is a weak positive correlation (0.17) between the price column and the calculated_host_listings_count column, which suggests that hosts with more listings tend to charge higher prices for their listings.

There is a moderate positive correlation (0.23) between the calculated_host_listings_count column and the availability_365 column, which suggests that hosts with more listings tend to have more days of availability in the next 365 days.

There is a strong positive correlation (0.59) between the number_of_reviews column and the reviews_per_month column, which suggests that listings with more total reviews tend to have more reviews per month.

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)

# show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot consists of multiple scatterplots arranged in a grid, with each scatterplot showing the relationship between two variables
It can be used to visualize relationships between multiple variables and to identify patterns in the data.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Hosts should strategically position their properties in Manhattan and Brooklyn to increase bookings. Choosing popular neighborhoods within these areas can enhance visibility.

Focusing on entire home or private room listings is beneficial, as guests tend to prefer these options over shared accommodations.

Although guests generally prefer lower-priced listings, it is essential to understand seasonal trends and customer behavior. Adjust your pricing according to the season: lower prices during off- season months and higher prices during festive season. Guests typically prefer longer stays in entire homes, so set your minimum night requirements accordingly.

In competitive neighborhoods like Manhattan and Brooklyn, hosts need to implement effective strategies to differentiate themselves. Encouraging guests to provide feedback and reviews can improve visibility and attract more bookings, ultimately maximizing

# **Conclusion**


Despite being the most expensive, Manhattan witnesses the highest number of bookings, closely followed by Brooklyn, indicating the significance of these locations in guest preferences.

The noticeable spike in 2013 prompts an investigation to uncover the reasons behind this anomaly, providing insights into external factors that may have influenced pricing during that period. Understanding the dynamics behind the price spike can offer valuable information for making informed decisions in future pricing strategies.

Over the years from 2011 to 2019, there is a general decrease in average prices, prompting the need for hosts to understand season trends, customer preferences, and the importance of providing excellent services that justify prices. This strategic approach not only enhances guest satisfaction but also contributes to increased revenue.

Single-listing hosts dominate the platform, particularly in Manhattan and Brooklyn, highlighting the competitive landscape for individual hosts.

Beyond the popularity of specific neighborhoods, the conclusion underscores the importance of understanding seasonality, providing offers, and ensuring excellent services to stand out in the competitive market.

Guests consistently show a preference for lower-priced listings, with a gradual increase in the volume of guests in these listings from 2011 to 2019, emphasizing the strategic importance of pricing for hosts.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***