# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This project focuses on performing an Exploratory Data Analysis(EDA) on Airbnb listings in New York City for the year 2019. The main goal is to uncover useful patterns and insights about listings, pricing, availability, and host behaviour across different neighbourhoods. By analysing this data, we can understand the factors influencing pricing and which locations or property type performs better. The dataset helps explore trends in the short-term rental market and how Airbnb properties vary across boroughs like Manhattan, Brooklyn, Queens and others.

# **GitHub Link -**


# **Problem Statement**



Short term rental market has changed how individuals travel, where they have an alternative to the conventional forms of accommodations in platforms such as Airbnb. In a market with a very large number of listings in cities such as New York, companies need to know the major tendencies of listing, preferences and pricing of guests to be competitive and profitable.

New York City is a popular tourist destination, which carries a varied list of Airbnb properties. These include everything to whole houses, to shared rooms, and are located in neighborhoods of different popularity and prices. This variation brings opportunity as well as challenges in terms of difference in balance of supply and demand, price strategy as well as performance of hosts.

The goal of this project is to use exploratory data analysis (EDA) of Python on a dataset of New York City, Airbnb Marketed listings (approximately 49,000). In this analysis, we will seek to identify interesting ways of discussing room type distributions, the range of prices, guest activity (through reviews), and the availability patterns with respect to neighborhoods. The findings will contribute knowledge to in strategic business decision making taking the form of price recommendations, targeted marketing, and engagement practices with the hosts.


#### **Define Your Business Objective?**

To assist Airbnb hosts and the company in optimizing listing strategies.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
import math


### Dataset Loading

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Load Dataset

df=pd.read_csv('Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look

df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
missing_counts = df.isnull().sum()

# Keep only columns with missing values
missing_counts = missing_counts[missing_counts > 0]



# Plot using Seaborn
plt.figure(figsize=(10, 6))
sns.barplot(missing_counts)

# Labels
plt.title('Missing Values per Column')
plt.xlabel('Number of Missing Values')
plt.ylabel('Columns')

### What did you know about your dataset?

The dataset belongs to the hospitality industry, focusing on Airbnb’s operations in New York City. Airbnb, since 2008, has offered travelers unique and personalized accommodations worldwide. Data from the listings help Airbnb make business decisions, improve services, and understand customer and host behavior.

The given dataset contains 48,895 rows and 16 columns, consisting of both categorical and numerical variables like neighbourhood group, room type, price, availability, and reviews.

There are missing values in name(16), host_name(21), last_review(10,052), and reviews_per_month(10,052). There are no duplicate entries.

The goal is to explore and analyze the dataset to discover key insights into pricing, availability, and customer behavior.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')


### Variables Description



id: Unique listing ID.

name: Name/title of the listing.

host_id:	Unique ID assigned to each host.

host_name:	Name of the host.

neighbourhood_group:	Major location(e.g., Manhattan, Brooklyn).

neighbourhood:	Specific area or neighborhood within the location.

latitude: Geographical latitude of the listing.

longitude:	Geographical longitude of the listing.

room_type:	Type of listing (Entire home/apt, Sharing room, Private room).

price:	Price of the listing per night (in USD).

minimum_nights:	Minimum number of nights required per booking.

number_of_reviews:	Total number of reviews received.

last_review:	Date of the most recent review.

reviews_per_month:	Average number of reviews received per month.

calculated_host_listings_count:	Total number of listings the host has.

availability_365:	Number of available days in a year (0-365).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

In [None]:
df.dtypes

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
data_df=df.copy()
data_df.replace('?',np.nan,inplace=True)
data_df.isnull().sum()

In [None]:
data_df

In [None]:
# Converting last review column into datetime format.
data_df['last_review']=pd.to_datetime(data_df['last_review'])
data_df.dtypes

In [None]:
# To check if the number of missing values match the number of ratings with zeroes.
empty_reviews=data_df[data_df['number_of_reviews']==0]
empty_reviews.shape


In [None]:
# The ratings with eroes have missing values for last_review and review_per_month columns.
# As the last review and reviews per month are null for hotels with zero ratings we assign null values as zero
data_df.loc[df['number_of_reviews'] == 0, ['last_review', 'reviews_per_month']] = 0
data_df.isnull().sum()

In [None]:
# Replacing null values in name column with text 'Hotel name not mentioned'
data_df['name']=data_df['name'].fillna('Hotel name not mentioned.')
data_df[data_df['name']=="Hotel name not mentioned."]

In [None]:
#Hotels with name Not available.
data_df[data_df['name']=='Not available'].replace('Not available', 'Hotel name not mentioned')

In [None]:
# Similarly filling null values in the host_name column with text'Host name not mentioned'
data_df['host_name'] = data_df['host_name'].fillna('Host name not mentioned')

In [None]:
# Number of hotels available for every day in a year.
no_available_allyear = data_df[data_df['availability_365']==365].shape
print("Number of Hotels available for 365 days are",no_available_allyear[0])


In [None]:
hotel_available = data_df[data_df['availability_365']==365]
# Names of hotels available for 365 days
hotel_available_name=hotel_available['name']
hotel_available_name

In [None]:
# Finding average price of hotels
avg_price=data_df['price'].mean()
print("Average price of a hotel is",round(avg_price,3))

In [None]:
# Finding median price of a hotel
median_price=data_df['price'].median()
print("Median price of a hotel is",round(median_price,3))

In [None]:
# Average number of nights stayed
avg_nights=data_df['minimum_nights'].mean()
print("Average number of nights stayed is",round(avg_nights,3))

In [None]:
# Storing the names of the different neigbourhoods.
neighbourhood_names=data_df['neighbourhood'].unique()
print(neighbourhood_names)

In [None]:
# Finding the number of hotels for each neighbourhood
neighbourhood_freq=data_df['neighbourhood'].value_counts()
neighbourhood_freq

In [None]:
# Storing the names of the different neighbourhood groups
neighbourhood_group_names=data_df['neighbourhood_group'].unique()
print(neighbourhood_group_names)


In [None]:
# Finding the number of hotels for each neighbourhood group
neighbourhood_group_freq=data_df['neighbourhood_group'].value_counts()
neighbourhood_group_freq

In [None]:
# Number of various types of rooms booked
room_type=data_df['room_type'].value_counts()
room_type

In [None]:
# Hotels with most listings
max_listings=data_df['calculated_host_listings_count'].max()
most_listings=data_df[data_df['calculated_host_listings_count']==max_listings]
most_listings.shape

In [None]:
# Hotels with 500 user ratings or above
hotels_500=data_df[data_df['number_of_reviews']>=500]
hotels_500


In [None]:
# Average price of hotels with 500 or more ratings
avg_price_500=hotels_500['price'].mean()
print("Average price of hotels with 500 or more ratings is",round(avg_price_500,3))

In [None]:
# Top 25 hotels according to user ratings
popular_hotels=data_df.sort_values(by='number_of_reviews',ascending=False).head(25)
popular_hotels

In [None]:
# Average price of popular hotels
popular_hotels_price=popular_hotels['price'].mean()
print("Average price of popular hotels is",round(popular_hotels_price,3))

In [None]:
# Finding out average price of various room types
avg_price_per_room = data_df.groupby('room_type')['price'].mean()
round(avg_price_per_room,2)


In [None]:
# Median price of various room types.
median_price_per_room = data_df.groupby('room_type')['price'].median()
round(median_price_per_room,2)

In [None]:
# Average price of hotels in all the neigbourhoods
neighbourood_avg_price=data_df.groupby('neighbourhood')['price'].mean()
round(neighbourood_avg_price,2)

In [None]:
# Median price of hotels in all the neighbourhoods.
neighbourhood_median_price=data_df.groupby('neighbourhood')['price'].median()
round(neighbourhood_median_price,2)

In [None]:
# Average price of hotels for each neighbourhood group
neighbourhood_group_avg_price=data_df.groupby('neighbourhood_group')['price'].mean()
round(neighbourhood_group_avg_price,2)

In [None]:
# Median price of hotels for each neighbourhood group.
neighbourhood_group_median_price=data_df.groupby('neighbourhood_group')['price'].median()
round(neighbourhood_group_median_price,2)

In [None]:
# Maximum price for different neighbourhood groups.
max_price_neighbourhood=data_df.groupby('neighbourhood_group')['price'].max()
round(max_price_neighbourhood,2)

In [None]:
# Average stay duration for different locations
avg_stay_duration=data_df.groupby('neighbourhood_group')['minimum_nights'].mean()
round(avg_stay_duration,2)

In [None]:
# Median stay duration for different locations.
median_stay_duration=data_df.groupby('neighbourhood_group')['minimum_nights'].median()
round(median_stay_duration,3)

In [None]:
# Maximum stay duration for different locations.
max_stay_duration=data_df.groupby('neighbourhood_group')['minimum_nights'].max()
round(max_stay_duration,2)

In [None]:
# Availability of hotels in different locations.
availability=data_df.groupby('neighbourhood_group')['availability_365'].mean()
round(availability,2)

In [None]:
# Median availability of different hotels for different location.
median_availability=data_df.groupby('neighbourhood_group')['availability_365'].median()
round(median_availability,2)

In [None]:
# Average Availability of different room types.
availability_room=data_df.groupby('room_type')['availability_365'].mean()
round(availability_room,2)

In [None]:
# Median Availability of different room types.
median_availability_room=data_df.groupby('room_type')['availability_365'].median()
round(median_availability_room,2)

In [None]:
# Finding the count of different tyes of rooms.
room_type_count=data_df['room_type'].value_counts()
room_type_count

In [None]:
# Average minimum stay per night for different room types
avg_stay_rooms=data_df.groupby('room_type')['minimum_nights'].mean()
round(avg_stay_rooms,2)

In [None]:
# Longest stays:
longest_stays=data_df.sort_values(by='minimum_nights',ascending=False).head(20)
longest_stays

In [None]:
# Which room has the better availibility
room_availability=data_df.groupby('room_type')['availability_365'].mean()
round(room_availability,2)

In [None]:
# Finding the highest price.
max_price=data_df[data_df['price']==data_df['price'].max()]
max_price

In [None]:
# Top 25 hotels according to their price.
high_price=data_df.sort_values(by='price',ascending=False).head(25)
high_price

In [None]:
#Finding out the rooms in the luxury segment.
luxury_rooms=data_df[(data_df['price']>=2000) & (data_df['room_type']=='Private room')]
luxury_rooms

In [None]:
# What is the average number of reviews per listing for different room types?
avg_reviews_by_roomtype=data_df.groupby('room_type')['number_of_reviews'].mean()
avg_reviews_by_roomtype

In [None]:
# What is the median number of reviews per listing for different room types?
median_reviews_by_roomtype=data_df.groupby('room_type')['number_of_reviews'].median()
median_reviews_by_roomtype

In [None]:
# Number of reviews for the various room types.
reviews_by_roomtype=data_df.groupby('room_type')['number_of_reviews'].sum()
reviews_by_roomtype

In [None]:
# Number of reviews for different locations.
reviews_by_neighbourhood_group=data_df.groupby('neighbourhood_group')['number_of_reviews'].sum().sort_values(ascending=False)
reviews_by_neighbourhood_group

In [None]:
# Average reviews for different locations.
avg_reviews_by_neighbourhood_group=data_df.groupby('neighbourhood_group')['number_of_reviews'].mean().sort_values(ascending=False)
avg_reviews_by_neighbourhood_group

In [None]:
# Median reviews for different locations
median_reviews_by_neighbourhood_group=data_df.groupby('neighbourhood_group')['number_of_reviews'].median().sort_values(ascending=False)
median_reviews_by_neighbourhood_group

In [None]:
# Finding correlation between availibilty and number of reviews.
data_df[['availability_365', 'number_of_reviews']].corr()


In [None]:
# Finding the correlation between price and number of reviews.
data_df[['price','number_of_reviews']].corr()

### What all manipulations have you done and insights you found?

Manipulations:

There were 16 missing values in the name column, 21 missing values in the host_name column and 10,052 missing values in both last_review and reviews_per_month columns. For the missing values in name(hotel name) and host_name(name of the host of the hotel), the missing values were replaced by "Hotel name not mentioned" and "Host name not mentioned" respectively. The missing values for both the reviews columns were replaced by zeroes. There were also 11 values with price as zero but they had reviews so they were real properties and not errors. As they were real and very less in number in comparison to the total values they were kept as they were. The last_review column was converted into proper date time format.

Insights and assumptions:

1. Is Manhattan is the most preferred neighborhood group?

  Although Manhattan has the most amount of listings with 21661 total listings, Brooklyn is a close second with 20104 total listings. But properties were more reviewed in Brooklyn with 486574 reviews and Manhattan with 454569 reviews. Thus, suggesting Brooklyn is more engaging and preferred by guests.

2. Private rooms are the most listed or preffered type of properties.

  Although, Entire rooms have more listings(25409) and total reviews(580403) in comparison to private rooms' total listings(22326) and total reviews(538346).
  Private rooms on average have more reviews(24.11) than Entire homes/apartments(22.84) per listing, likely due to lower cost and suitability for short stays or solo travelers. Whereas, shared rooms have just 19256 total reviews and 16.6 average reviews.

  So, entire homes are the most listed, but private rooms may be more frequently booked per listing, indicating strong guest preference.

3. Higher prices leads to more reviews.
  
  They have a very weak negative correlation with each other (-0.048), indicating that higher-priced listings do not necessarily attract more reviews. In fact, as price increases slightly, the number of reviews tends to slightly decrease though the effect is minimal.
  So, price is not a strong driver of review frequency.
  It’s also possible that affordable listings are more frequently booked, and therefore reviewed.

4. Higher availibilty leads to more reviews.

  The correlation is positive but weak (0.17).

  This implies that listings available throughout the year tend to get more reviews.
  So, availability helps, but it's not the only driver of customer engagement.

5. Manhattan is the costliest neighbourhood group.

  Yes, Manhattan is the costliest neighbourhood group with the highest average hotel price of 196.88 dollars and also has the highest median hotel price of 150 dollars. This reflects its premium location and high demand compared to other groups.
  
6. Staten Islands has the highest availability.
  
  Staten Island has the highest mean availability(199 days) and median availability(219 days) for all the neighbourhoods, suggesting
  listings here are available most of the year.
  Possibly less competition or demand, so listings remain open longer.

7. Entire homes/apartments are significantly costlier than private or shared rooms.

  Yes, Entire homes/apartments have significantly higher mean cost(211.79) and  median cost(160) than Private rooms' mean cost(89.78) and median cost(70)
  and also Shared rooms' mean cost(70.12) and median cost(45).










## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Price Distribution-Univariate

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(25,20))
plt.boxplot(data_df['price'])
plt.yticks([0, 500, 1000, 2000, 5000, 10000])
plt.show()

##### 1. Why did you pick the specific chart?




Box plot was chosen to display outliers effectively and also
showing us where most of the hotels' price values are , clearly providing us a visual representation of the price dustribution for the hotels.


##### 2. What is/are the insight(s) found from the chart?

We can see that most of the hotels' price lies between 0 to 500. Although most of the price is skewed to the bottom there are also some outlier hotels with price of 10000. These are the extreme outliers for this dataset and we can also consider hotels having price of 5000 or above as outliers as they are very far and less on number. So the prices are highly spread for the upper range and hotels with price more than 2000 are also rare.


##### 3. Will the gained insights help creating a positive business impact?


Yes. Knowing most hotel prices fall below 500 helps focus pricing and marketing where demand is highest. While high-priced hotels are rare, targeting this luxury segment with specialized offers can drive high-margin sales. These insights also help avoid over-investing in low-demand premium offerings.


Are there any insights that lead to negative growth? Justify with specific reason.


Yes. Ignoring the luxury segment as mere outliers may cause missed high-value opportunities. Over-focusing on the budget segment could also trigger price wars and reduce profitability.

#### Chart - 2 Availability Distribution - Univariate

In [None]:
# Chart - 2 visualization code


plt.figure(figsize=(10,7))
sns.histplot(data_df['availability_365'],kde=True)

plt.xlabel('Availibility in days')
plt.ylabel('Count')
plt.title('Availibility Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

We chose a histogram to clearly visualize the distribution of property availability in days, as it shows frequency patterns effectively.

##### 2. What is/are the insight(s) found from the chart?

 The chart reveals that a large number of listings have availability between 0 to 50 or even 0 to 20 days, while another significant group is available year-round.

##### 3. Will the gained insights help creating a positive business impact?

These insights can create a positive business impact by helping identify high-performing listings and encouraging hosts to increase availability, potentially improving occupancy rates and revenue. Additionally, platforms can prioritize support or promotions for listings with consistent availability to maintain supply.

Are there any insights that lead to negative growth? Justify with specific reason.

On the other hand, the large number of listings with very less availablity may point to inactive hosts, seasonal closures, or listings used for non-rental purposes. If left unaddressed, this could lead to reduced inventory, poor customer experience, and a loss of potential income, harming platform growth and reputation.


Answer Here

#### Chart - 3 Amount of Listings per Neighbourhood Groups. (Univariate)

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12,7))
sns.countplot(x=data_df['neighbourhood_group']) # Corrected: Removed x= and added value_counts() to get counts for plotting
plt.xlabel('Neighbourhood Groups')
plt.ylabel('Count') # Added y-axis label
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot was suitable because it clearly and effectively shows how many hotels exist in each neighbourhood group, allowing easy visual comparison between those groups.

##### 2. What is/are the insight(s) found from the chart?

We can see that Manhattan has the highest number of hotels followed closely by Brooklyn. Then we have Queens, Bronx and at last Staten Islands with the least amount of hotels.

##### 3. Will the gained insights help creating a positive business impact?

Yes.Understanding that Manhattan and Brooklyn dominate in hotel counts helps focus marketing, resource allocation, and operational efforts where demand is highest. Businesses can prioritize these locations for promotions, partnerships, and expansion plans.
In contrast, areas like Staten Island and the Bronx represent potential growth markets if demand increases, allowing businesses to explore niche opportunities with less competition.


Are there any insights that lead to negative growth? Justify with specific reason.


Focusing solely on high-density areas like Manhattan and Brooklyn could lead to market saturation and increased competition, driving prices down. Ignoring smaller areas like Staten Island or the Bronx could mean missing out on untapped markets where supply is low but future demand could grow.

#### Chart - 4 Room Type Share Among Listings(Univariate)


In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(8,8))
plt.pie(room_type_count, labels=room_type_count.index, autopct='%1.1f%%', startangle=140)
plt.title('Room Type Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart effectively shows what portion of total hotel listings each room type occupies, making it a suitable choice for analyzing room type distribution. It is also clear and easy to understand.

##### 2. What is/are the insight(s) found from the chart?

The most major portion is occupied by Entire home/apartments with the highest percentage of 52% of the total rooms listings for all the room types. In the second place, we have have Private rooms with also a significant portion of 45.7% of the total rooms listings and at last, we have shared rooms with a very small percentage of 2.4%.

Answer Here

3.Will the gained insights help create a positive business impact?

Yes.Knowing that entire homes/apartments dominate the market (52%), followed by private rooms (45.7%), helps businesses focus their listings, marketing, and service offerings toward what customers prefer most. Since shared rooms make up only 2.4%, resources can be optimized by prioritizing higher-demand room types. This insight also helps in strategic pricing and service differentiation, targeting customers based on the preferred room type.

Are there any insights that could lead to negative growth? Justify with specific reason.

Focusing exclusively on high-demand room types might cause businesses to ignore niche markets like shared rooms. While small in share, shared rooms may attract budget-conscious travelers or group bookings, representing a missed opportunity if overlooked.

#### Chart - 5 Minimum Nights Distribution(Univariate)

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(15,10))
sns.boxplot(data_df['minimum_nights'])
plt.show

##### 1. Why did you pick the specific chart?

A box plot was chosen because it effectively highlights the distribution and presence of outliers in minimum_nights, which often has extreme values. It's a simple and clear way to visualize skewed data.

##### 2. What is/are the insight(s) found from the chart?

Most listings allow short stays, but there are outliers with unusually high minimum stay requirements. This shows a wide variation in booking policies across listings.

##### 3. Will the gained insights help creating a positive business impact?

Yes, identifying listings with very high minimum stays can help platforms or hosts reassess their booking policies to align with guest expectations and improve occupancy rates.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, listings with very high minimum night requirements might deter bookings and reduce visibility. These policies, if not intentional, could negatively impact host revenue and user satisfaction.



#### Chart - 6 Comparison of Prices Across Room Types(Bivariate)

In [None]:
# Chart - 6 visualization code
# Average price vs Room type
plt.figure(figsize=(10,7))
plt.bar(avg_price_per_room.index, avg_price_per_room.values, color='green')
plt.title('Average Price per Room Type')
plt.xlabel('Room Type')
plt.ylabel('Average Price')
plt.show()

In [None]:
avg_price_per_room

In [None]:
# Median price vs Room type
plt.figure(figsize=(10,7))
plt.bar(median_price_per_room.index, median_price_per_room.values, color='green')
plt.title('Median Price per Room Type')

In [None]:
median_price_per_room

##### 1. Why did you pick the specific chart?

A bar chart was used to compare the average and median prices across different room types. This helps in understanding how pricing varies based on the type of accommodation (Entire home/apt, Private room, Shared room), and gives a clearer picture of both the typical and average costs.

##### 2. What is/are the insight(s) found from the chart?

Entire home/apt has the highest average price at 211.79 dollars and median price of 160 dollars, indicating consistent pricing and positioning in the premium segment.
Private rooms average 89.78 dollars, with a median also at 70 dollars, representing mid-tier affordability.
Shared rooms are the lowest in price, with an average of 70.13 dollars and median price of 45 dollars, making them ideal for budget travelers.

Since the mean and median are equal for all room types, this suggests that the pricing is symmetrical and not skewed, implying a stable pricing pattern across room types.



##### 3. Will the gained insights help creating a positive business impact?

Yes.
Yes. These pricing insights are valuable in several business aspects:

Pricing strategy: Define clear pricing tiers for each room type.

Targeted marketing: Focus high-end listings like entire homes at premium users and promote private/shared rooms to budget travelers.

Revenue optimization: Encourage hosts to upgrade their listings or adjust prices within their tier to improve profitability.


Are there any insights that lead to negative growth? Justify with specific reason.

Yes, potential risks include:

Over-dependence on premium listings (entire homes) may alienate price-sensitive travelers, reducing occupancy.
Focusing too much on budget listings could limit revenue potential and brand positioning.
Ignoring pricing flexibility might miss opportunities for seasonal or demand-based pricing, affecting competitiveness and revenue growth.

#### Chart - 7 Comparison of Price across different Neighbourhood groups(Bivariate)

In [None]:
# Chart - 7 visualization code
# Average price vs Neighbourhood group
plt.bar(neighbourhood_group_avg_price.index, neighbourhood_group_avg_price.values, color='red')
plt.title('Average Price per Neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('Average Price')
plt.show()

In [None]:
neighbourhood_group_avg_price

In [None]:
# Median price vs Neighbourhood group.
plt.bar(neighbourhood_group_median_price.index, neighbourhood_group_median_price.values, color='red')
plt.title('Median Price per Neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('Median Price')
plt.show()

In [None]:
neighbourhood_group_median_price

##### 1. Why did you pick the specific chart?

A bar chart was chosen to clearly display how hotel prices differ between neighbourhood groups. Since both average and median prices are continuous variables mapped against distinct categories (the neighbourhood groups), this format allows for a clean comparison across regions, while revealing how pricing structure varies spatially in the city

##### 2. What is/are the insight(s) found from the chart?

The analysis confirms that Manhattan is the most premium neighbourhood group, with an average price of 196.88 dollars and a median of 150 dollars. This suggests the presence of both consistently high-priced listings and a few extremely costly outliers.
Brooklyn follows with an average of 124.38 dollars, while Staten Island shows a moderate price level at 114.81 dollars.
Queens and the Bronx emerge as the most economical options, with average prices of around 99.5 dollars and 87.5 dollars, and medians of 75 dollars and 65 dollars, respectively.

Notably, the gap between average and median prices is widest in Manhattan, which may reflect a more polarized market where luxury listings influence the mean disproportionately.

##### 3. Will the gained insights help creating a positive business impact?

Yes. These insights allow hosts and platforms to align their pricing strategies and marketing efforts with neighbourhood-specific demand:

In premium locations like Manhattan, there’s clear scope for offering luxury services and exclusive experiences to maximize revenue.

Meanwhile, Queens and the Bronx could serve as hubs for budget travel, long stays, or student accommodations, optimizing for volume rather than premium margins.

Understanding price variation also supports location-based segmentation, allowing operators to diversify their offerings based on traveler affordability and preferences.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, overly concentrating investments in high-price areas like Manhattan may lead to oversaturation and diminished returns, especially during off-peak seasons. Conversely, neglecting lower-cost areas like the Bronx might mean missing out on steady demand from budget-conscious travelers.
Also, if pricing strategies are based solely on averages without acknowledging the skew from outliers, listings may become misaligned with customer expectations. A balanced pricing approach rooted in both mean and median values is essential for sustainable growth.














#### Chart - 8 Availability vs Price (Bivariate)

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10,6))
sns.scatterplot(x='availability_365', y='price', data=data_df)
plt.title('Availability vs Price')
plt.xlabel('Days Available per Year')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it’s ideal for visualizing the relationship between two continuous variables, in this case, days available in a year and price. It helps to quickly identify whether a pattern, trend, or correlation exists between how long a listing is available and what price it charges.

##### 2. What is/are the insight(s) found from the chart?

Most listings are clustered at lower price ranges (below 500), regardless of availability.
Listings with full availability (close to 365 days) generally fall within low to moderate price ranges, indicating that highly available listings tend to be budget or mid-range.

A few high-priced listings exist even with low availability, likely representing premium or exclusive properties. No clear direct relationship between availability and price was observed, suggesting that pricing decisions may depend more on location, room type, or property features than availability alone.

##### 3. Will the gained insights help creating a positive business impact?

Understanding that full-year availability does not guarantee premium pricing can help businesses guide hosts towards strategies other than maximizing availability to increase revenue. Insights from this chart encourage focusing on property upgrades, service quality, or targeting premium markets rather than simply keeping listings open year-round.


Are there any insights that lead to negative growth? Justify with specific reason.

Assuming that increasing availability alone leads to higher prices could result in missed revenue opportunities and poor strategy execution. Hosts might unnecessarily keep listings available year-round, increasing operational costs without corresponding revenue gains. Without strategic pricing and quality improvements, oversupply could push prices down in competitive areas, risking revenue stagnation.

#### Chart - 9 Price Distribution across different Neighbourhood groups(Bivariate)

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(15,8))
sns.boxplot(x='neighbourhood_group', y='price', data=data_df)
plt.title('Price Distribution by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen to visualize price distribution across neighbourhood groups. This chart makes it easy to compare median prices, overall price spread, and outliers in each neighbourhood, highlighting differences in how properties are priced by area.



##### 2. What is/are the insight(s) found from the chart?

The median prices differ clearly across categories, highlighting distinct price tiers (e.g., Entire homes being priced higher than shared rooms).
Outliers are visible, showing extreme high-priced listings, especially in higher-tier categories.
The price spread is wider in premium categories, indicating more variability and potential for both high-budget and mid-range bookings.
Lower categories show tighter price distributions, suggesting more standardized pricing.

##### 3. Will the gained insights help creating a positive business impact?
Understanding pricing distribution helps:

Optimize pricing strategies by identifying typical price ranges,
detect overpriced listings for adjustment,
inform hosts about where their pricing stands relative to competitors and
focus marketing on categories with higher median prices to drive revenue growth.


Are there any insights that lead to negative growth? Justify with specific reason.

If outlier pricing is ignored, customers may avoid listings perceived as overpriced, leading to lower occupancy. Similarly, underpricing in high-tier categories could result in lost revenue opportunities. Without addressing these insights, inconsistent pricing strategies could harm both host earnings and platform reputation.



#### Chart - 10 Frequency of Room Types in Each Neighbourhood Group(Bivariate)

In [None]:
# Chart - 10 visualization code
heatmap_data = pd.crosstab(data_df['neighbourhood_group'], data_df['room_type'])

# Plotting the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt='d')

# Chart titles and labels
plt.title('Frequency of Room Types in Each Neighbourhood Group')
plt.xlabel('Room Type')
plt.ylabel('Neighbourhood Group')

plt.show()

##### 1. Why did you pick the specific chart?

The heatmap was chosen because:

Both variables (neighbourhood_group and room_type) are categorical.
It allows for quick visual comparison across multiple combinations of neighborhood groups and room types.

Using color intensity, it clearly highlights where concentrations of certain room types are higher or lower.

It’s more natural and wasy to understand than reading a large frequency table.





##### 2. What is/are the insight(s) found from the chart?

Manhattan has the highest number of total room listings closely followed by Brooklyn.
Staten Islands has the fewest listings in total and also for each room type.
Manhattan has the highest amount of Entire homes/apartments and Shared rooms. Brooklyn has the highest amount of private rooms. Except for Manhattan where Entire homes/apartments have the largest count between the types of rooms, every other nieghbourhood has private rooms as the room type with the highest amount. Shared rooms has the least amount for all neighbourhoods.

##### 3. Will the gained insights help creating a positive business impact?

Yes, the insights can help Airbnb focus its strategy effectively. Since Manhattan and Brooklyn have the highest total listings, targeted marketing and optimized pricing in these areas can drive higher revenue. Knowing Manhattan leads in Entire homes/apartments and Shared rooms allows premium services to be focused there, while Brooklyn’s strength in Private rooms suggests promoting affordable stay options. Identifying low listing counts in Staten Island highlights potential for market expansion. Overall, these insights enable better resource allocation, inventory planning, and strategic growth.


Are there any insights that lead to negative growth? Justify with specific reason.

Yes, certain patterns could risk negative growth. Manhattan’s over-concentration of Entire homes/apartments risks market saturation, leading to reduced occupancy and price drops. Similarly, the heavy reliance on Private rooms in other neighborhoods may limit profitability if guest preferences shift. The consistently low supply of Shared rooms across all neighborhoods might indicate a missed opportunity in the budget traveler segment if the issue is supply-based rather than demand-driven. Ignoring this could lead to untapped revenue potential.

#### Chart - 11 Neighbourhood group vs Number of reviews(Bivariate)

In [None]:
# Chart - 11 visualization code

# Plotting
plt.figure(figsize=(8, 5))
sns.barplot(x=reviews_by_neighbourhood_group.values, y=reviews_by_neighbourhood_group.index, palette='Blues_r')

plt.title('Total Reviews by Neighbourhood Group')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Neighbourhood Group')
plt.show()

In [None]:
reviews_by_neighbourhood_group

In [None]:
# Average number of reviews vs Neighbourhood groups
sns.barplot(x=avg_reviews_by_neighbourhood_group.values,y=avg_reviews_by_neighbourhood_group.index)
plt.title('Average Reviews per Neighbourhood Group')
plt.xlabel('Average Number of Reviews')
plt.ylabel('Neighbourhood Group')


In [None]:
avg_reviews_by_neighbourhood_group

In [None]:
# Median reviews vs Neighbourhood groups

sns.barplot(x=median_reviews_by_neighbourhood_group.values,y=median_reviews_by_neighbourhood_group.index)

In [None]:
median_reviews_by_neighbourhood_group

##### 1. Why did you pick the specific chart?

A bar chart was selected for its clarity in showcasing how customer interaction—measured through the number of reviews—varies across neighbourhood groups. Reviews are a proxy for both demand and guest satisfaction, and using total, average, and median values provides a well-rounded view of both popularity and consistency of engagement in each area.



##### 2. What is/are the insight(s) found from the chart?

Brooklyn leads in total reviews with over 4,86,000, slightly ahead of Manhattan's 4,54,000, confirming both as Airbnb hotspots.

Interestingly, Staten Island, despite having the lowest total reviews (11,541), shows the highest average number of reviews per listing (30.94). This indicates fewer listings but possibly higher guest satisfaction or longer stays per property.

The median review count is also highest in Staten Island (12), while Manhattan has the lowest median at 4, suggesting a larger number of less-reviewed or newer properties in high-density areas.

These nuances indicate that volume of bookings doesn’t always translate to stronger engagement per listing—important when assessing performance.

##### 3. Will the gained insights help creating a positive business impact?

Hosts in Staten Island may benefit from their high engagement rates, and Airbnb could leverage that to position the borough as an underrated yet satisfying experience.
In contrast, Brooklyn and Manhattan can continue to be prioritized for marketing due to their volume, but may require strategies to ensure individual listing visibility and guest retention.
Queens, with solid average engagement, could be targeted for expansion and competitive pricing to boost traffic further.

These insights enable smarter resource allocation, listing promotion, and customer targeting across distinct market profiles.

Are there any insights that lead to negative growth? Justify with specific reason.

Focusing solely on high-traffic areas like Brooklyn and Manhattan without addressing low median engagement may result in many underperforming listings lost in the crowd, especially in saturated zones.
Conversely, underestimating regions like Staten Island, which show strong per-listing performance, could mean missing niche opportunities where demand may be less competitive but more loyal.
Strategic neglect of such areas may suppress overall platform diversity and long-term customer satisfaction.

#### Chart - 12 Room Type vs Number of reviews(Bivariate)

In [None]:
# Chart - 12 visualization code
# Plotting bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=reviews_by_roomtype.values, y=reviews_by_roomtype.index, palette='Blues_r')
plt.title('Total Number of Reviews by Room Type')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Room Type')
plt.show()


In [None]:
reviews_by_roomtype

In [None]:
# Average reviews per room type
# Plot bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=avg_reviews_by_roomtype.values, y=reviews_by_roomtype.index, palette='Blues_r')

In [None]:
avg_reviews_by_roomtype

In [None]:
# Median reviews per room type
# Plot bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=median_reviews_by_roomtype.values, y=reviews_by_roomtype.index, palette='Blues_r')

In [None]:
median_reviews_by_roomtype

##### 1. Why did you pick the specific chart?

A bar chart was used because it clearly shows how each room type compares in terms of guest interaction. Reviews are a strong sign of how often listings are booked and how engaged guests are, and comparing total, average, and median reviews gives a more complete picture.

##### 2. What is/are the insight(s) found from the chart?

Entire homes/apartments have the highest total reviews at 5,80,403, slightly ahead of private rooms with 5,38,346, showing that both are heavily used.
However, when looking at average reviews per listing, private rooms (24.11) perform slightly better than entire homes (22.84), showing strong engagement despite lower pricing.
Shared rooms have the lowest total (19,256) and average reviews (16.6), and also the lowest median (4), indicating they are the least used and least engaged room type.

This means that while entire homes dominate in volume, private rooms may offer better per-listing performance.

##### 3. Will the gained insights help creating a positive business impact?
These insights can help Airbnb and hosts balance between volume and engagement.
Promoting private rooms can attract budget-conscious guests while still offering strong guest activity.
Entire homes remain essential for guests seeking privacy or space, making them ideal for premium targeting.
Knowing shared rooms lag behind allows room for specific marketing or redesigning strategies for that segment.


Are there any insights that lead to negative growth? Justify with specific reason.

Ignoring the shared room category could mean missing out on the budget travel market, especially younger or solo travelers.
On the other hand, focusing only on entire homes might limit access to more price-sensitive guests.
A lack of variety in room types could also limit customer reach, so promoting a balanced mix is important for long-term growth.

#### Chart - 13 Availability vs Neigbourhood Group(Bivariate)

In [None]:
# Chart - 13 visualization code
# Average availability vs Neighbourhood group
# Plot bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=availability.values, y=availability.index, palette='Greens_r')
plt.title('Average Availability (Days per Year) by Neighbourhood Group')
plt.xlabel('Average Availability (Days per Year)')
plt.ylabel('Neighbourhood Group')
plt.show()

In [None]:
availability


In [None]:
# Median availability vs Neighbourhood groups.
sns.barplot(x=median_availability.values,y=median_availability.index)
plt.title('Median Availability (Days per Year) by Neighbourhood Group')
plt.xlabel('Median Availability (Days per Year)')
plt.ylabel('Neighbourhood Group')
plt.show()

In [None]:
median_availability

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it clearly compares availability across neighbourhood groups, making it easy to see which areas have more or fewer days open for booking.



##### 2. What is/are the insight(s) found from the chart?

Staten Island leads with the highest average availability at approximately 200 days/year, and a very high median of 219 days, suggesting consistently open listings throughout the year.

Bronx and Queens follow with medians of 148 and 98 days respectively—indicating relatively high availability.

In contrast, Brooklyn (28) and Manhattan (36) have very low median availability, meaning a significant number of listings are open for short durations only.

This suggests that in high-demand areas, listings are either booked frequently or simply not available all year, possibly due to part-time hosts or regulatory limitations.

##### 3. Will the gained insights help creating a positive business impact?
These insights reveal that supply in high-demand areas like Brooklyn and Manhattan is limited not by lack of listings, but by reduced availability.

Airbnb can work with hosts in these neighbourhoods to encourage year-round availability, where possible, unlocking additional booking days and maximizing earning potential.
At the same time, neighbourhoods like Staten Island, which already offer high availability, could be positioned as alternative extended-stay options, especially for guests planning longer visits

Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
The low median availability in Brooklyn and Manhattan could create supply shortages despite high demand.

If guests consistently find limited dates available, they might turn to competitors or lose trust in the platform’s consistency in those areas.

Furthermore, it signals dependency on part-time hosts, which introduces unpredictability. If not addressed, this could hinder platform reliability and slow growth in key markets.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select numeric columns
numeric_cols = data_df.select_dtypes(include=['float64', 'int64'])

# Compute correlation matrix
correlation_matrix = numeric_cols.corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")
plt.title('Correlation Heatmap of Numeric Features')
plt.show()

##### 1. Why did you pick the specific chart?


We chose a correlation heatmap to easily visualize relationships between all numeric variables, allowing quick identification of strong positive or negative correlations. The chart highlights which factors are most closely linked, such as price with number of reviews or availability, revealing key drivers of performance.

##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap shows that most numeric features in the dataset have weak relationships with one another. A few noticeable insights include:

number_of_reviews and reviews_per_month have a strong positive correlation, meaning listings with more reviews tend to receive them more frequently.

id and host_id show moderate correlation, likely because hosts with multiple listings share similar patterns.

availability_365 and calculated_host_listings_count have a slight positive relationship, suggesting hosts with more listings tend to have better availability.

Price has little to no correlation with most other variables, implying it’s influenced more by non-numeric factors like location or property type.

Overall, the insights help identify key relationships for modeling and pricing strategies while confirming that features are mostly independent—reducing the risk of redundanc

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

# Create pair plot
sns.pairplot(data_df)
plt.suptitle('Pair Plot ')
plt.show()

##### 1. Why did you pick the specific chart?

We chose a pair plot to visually explore relationships and distributions among all numeric variables simultaneously, helping identify correlations, trends, and outliers.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most variables are weakly related, though patterns exist between number_of_reviews and reviews_per_month, and between latitude and longitude, reflecting geographical clustering.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Tailor pricing by neighbourhood since Manhattan is the most expensive, while Bronx and Queens are more affordable.

Encourage hosts in Brooklyn and Manhattan to increase availability to meet high demand.

Promote private rooms and entire homes differently, targeting budget and premium guests respectively.

Focus marketing efforts on underperforming areas like Bronx and Staten Island to boost bookings.

Support hosts to maintain high review scores and improve guest satisfaction.

Regularly check listings with zero or very low prices to ensure data accuracy.

Collect and use guest feedback to fix neighbourhood or room-type specific issues.

Use dynamic pricing and seasonal offers to optimize occupancy and revenue.

Run targeted campaigns to attract guests to high-availability but low-demand areas like Staten Island.

Continuously monitor booking and review trends to adapt strategies and stay competitive.



# **Conclusion**

Through this project, we were able to understand how Airbnb operates across New York City. We noticed that Manhattan and Brooklyn dominate the market, both in terms of pricing and the number of listings, while Queens, the Bronx, and Staten Island offer more budget-friendly stays that attract a different kind of audience. We also saw that entire homes and apartments are the most preferred type of accommodation, showing that travelers value comfort and privacy the most.

As we explored further, we discovered how pricing, availability, and host activity vary from one area to another. Tourist-heavy zones tend to have higher prices and limited availability, while quieter neighborhoods cater more to locals or budget travelers. We also found that a few superhosts manage several listings, which highlights the growing professional side of hosting on Airbnb.

Overall, this analysis helped us turn raw data into a meaningful story about travel behavior and market trends. It reminded us how powerful data can be when used to understand people — their preferences, habits, and choices. Beyond the charts and graphs, this project taught us how data storytelling connects numbers to real-life experiences, giving us a deeper appreciation for how analytics can shape better decisions for both hosts and travelers.

In [None]:
data_df.to_csv('Airbnbfinal_data.csv', index=False)
files.download("Airbnbfinal_data.csv")

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***