# **Project Name**    - Exploratory Data Analysis of Airbnb NYC 2019 Listings



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

This project focuses on performing Exploratory Data Analysis (EDA) on the Airbnb NYC 2019 dataset to understand pricing patterns, availability behavior, room type distribution, and customer engagement across Airbnb listings. The dataset contains approximately 48,895 records with a mix of numerical and categorical variables related to listings, hosts, pricing, and demand.

The analysis began with understanding the dataset structure, identifying missing values, checking for duplicate records, and performing basic data wrangling. Irrelevant columns with missing values were removed, missing review frequency values were handled logically, duplicate records were checked, and unrealistic price values were filtered to ensure data quality. These preprocessing steps ensured that the dataset was clean, consistent, and suitable for further analysis.

Univariate analysis was used to study individual variables such as price, availability, room type, and minimum nights. This revealed that most listings fall within the budget to mid-range price segment and that short-term stays dominate the platform. Bivariate analysis helped examine relationships between variables such as price vs room type, price vs reviews, and availability vs demand, highlighting how pricing and location influence customer engagement. Multivariate analysis further combined multiple variables to provide deeper insights into how price, room type, and location jointly impact demand.

A total of 20 meaningful visualizations were created to support the analysis. Each chart was selected carefully based on the type of variables involved and the business question being addressed. Insights from the analysis provide actionable recommendations for pricing optimization, availability management, and targeted marketing strategies. Overall, this project demonstrates how EDA can convert raw data into valuable business insights for data-driven decision-making.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Airbnb operates a large online marketplace connecting hosts and travelers, generating vast amounts of listing-level data related to pricing, availability, room types, and customer engagement. However, raw data alone does not provide clear insights into how these factors influence demand and booking behavior. The challenge is to analyze this data to identify pricing patterns, availability trends, and demand drivers while addressing data quality issues such as missing values and outliers. Without proper analysis, decision-making related to pricing strategy, inventory planning, and marketing initiatives may be inefficient or ineffective.

#### **Define Your Business Objective?**

The primary business objective of this analysis is to use exploratory data analysis to understand how pricing, room type, location, availability, and customer engagement influence Airbnb listing performance. The goal is to derive actionable insights that help optimize pricing strategies, improve availability planning, enhance host performance, and support data-driven business decisions that can increase bookings, customer satisfaction, and overall platform revenue.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/AlmaBetter_Full Stack Data Science & AI/Module_2/Airbnb NYC 2019.csv')


### Dataset First View

In [None]:
# Dataset First Look
df.head()



- This step displays the first five rows of the dataset.

It helps in understanding:

- Column names

- Type of values in each column

- Overall structure of the dataset

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


- The dataset contains approximately 48,895 rows and 16 columns.

- Rows represent individual Airbnb listings.

- Columns represent features such as price, room type, neighbourhood, availability, and reviews.

### Dataset Information

In [None]:
# Dataset Info
df.info()


The df.info() function provides a concise summary of the dataset, including the total number of non-null values, data types of each column, and memory usage.

From the output, we observe that:

- The dataset contains both numerical and categorical variables.

- Numerical columns include price, minimum_nights, number_of_reviews, and availability_365.

- Categorical columns include room_type, neighbourhood_group, and neighbourhood.

- Some columns contain missing values, which need to be handled before analysis.

This step is essential for identifying data quality issues and deciding appropriate data cleaning strategies.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


This step checks whether the dataset contains any duplicate rows, meaning identical records repeated more than once.

The duplicated() function identifies duplicate rows.

The sum() function counts how many such rows exist.

Duplicate records can bias analysis and visualizations

They may artificially inflate counts, averages, or trends

Removing duplicates ensures data integrity and reliability

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


This step checks for missing (null) values in each column of the dataset.

- The isnull() function identifies missing values.

- The sum() function counts the total missing values column-wise.

From the output, we observe that:

- Some columns such as name, host_name, and last_review contain missing values.

- Other important numerical and categorical columns such as price, room_type, neighbourhood_group, and availability_365 do not contain missing values.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()

Visualization of Missing Values

- A heatmap is used to visually represent the presence of missing values in the dataset.

Why this visualization was chosen:

- A heatmap provides a quick and clear overview of missing data patterns.

- It helps identify whether missing values are random or concentrated in specific columns.

Insights from the visualization:

- Missing values are concentrated in a few specific columns.

- Most of the dataset contains complete records, indicating overall good data quality.

Business Impact:

- Since missing values are limited to non-critical columns, removing them will not negatively affect business analysis.

- This confirms that the dataset is suitable for further exploratory analysis without complex imputation techniques.

### What did you know about your dataset?

The dataset contains 48,895 rows and 16 columns with a mix of numerical and categorical variables. There are no duplicate records, indicating good data integrity. Missing values are present only in a few text-based columns, while key business-related columns such as price, room type, and availability contain no missing values. Overall, the dataset is well-structured and suitable for further exploratory analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns


In [None]:
# Dataset Describe
df.describe()


### Variables Description

id: Unique identifier for each Airbnb listing.

name: Name or title of the Airbnb listing provided by the host.

host_id: Unique identifier for each host on the Airbnb platform.

host_name: Name of the host offering the listing.

neighbourhood_group: Broad geographical area (borough) where the listing is located.

neighbourhood: Specific neighborhood within the borough.

latitude: Geographic latitude coordinate of the listing location.

longitude: Geographic longitude coordinate of the listing location.

room_type: Type of accommodation offered (Entire home/apt, Private room, Shared room).

price: Cost per night for staying at the listing.

minimum_nights: Minimum number of nights required to book the listing.

number_of_reviews: Total number of reviews received by the listing.

last_review: Date of the most recent review received by the listing.

reviews_per_month: Average number of reviews received per month.

calculated_host_listings_count: Number of listings owned by the same host.

availability_365: Number of days the listing is available for booking in a year.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['reviews_per_month'].fillna(0, inplace=True)

# Remove duplicates (if any)
df.drop_duplicates(inplace=True)

# Handle invalid and extreme prices
df = df[df['price'] > 0]
df = df[df['price'] < 1000]

### What all manipulations have you done and insights you found?

The missing values in the reviews_per_month column were replaced with zero to indicate listings that have not received any reviews. Duplicate records were checked and removed to ensure that each Airbnb listing is uniquely represented. Listings with invalid price values, such as zero-priced entries, were removed, and extreme price values above 1000 were filtered out to avoid distortion in analysis.

From these manipulations, it was observed that the dataset is largely clean, with missing values mainly related to review activity rather than critical business attributes. Price values vary widely, indicating the presence of both budget and premium listings. After cleaning, the dataset became more reliable and suitable for meaningful exploratory data analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Price Distribution (Histogram)
sns.histplot(df['price'], bins=50, kde=True)
plt.title("Distribution of Listing Prices")
plt.show()


##### 1. Why did you pick the specific chart?

To understand the overall distribution and spread of listing prices.

##### 2. What is/are the insight(s) found from the chart?

Prices are right-skewed; most listings are low to mid-priced, with few expensive ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
Helps Airbnb focus on budget and mid-range demand, which forms the majority of the market.

Negative growth insight:
Overpricing listings may reduce demand and bookings.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Price Outliers (Box Plot)
sns.boxplot(x=df['price'])
plt.title("Price Outliers")
plt.show()


##### 1. Why did you pick the specific chart?

To identify extreme price values.

##### 2. What is/are the insight(s) found from the chart?

Few listings are priced significantly higher than the majority.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Helps identify premium segments.

Negative growth:
Extreme pricing can lead to poor occupancy and low reviews.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Room Type Distribution
sns.countplot(x='room_type', data=df)
plt.title("Room Type Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To understand inventory composition.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Entire homes and private rooms dominate listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Supports supply planning.

Negative growth:
Over-supply of one room type may reduce competitiveness.

#### Chart - 4

In [None]:
# Chart - 4 Availability Distribution
sns.histplot(df['availability_365'], bins=50)
plt.title("Availability Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze host availability behavior.

##### 2. What is/are the insight(s) found from the chart?

Many listings have low availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Identifies opportunity to increase availability.

Negative growth:
Low availability limits revenue potential.

#### Chart - 5

In [None]:
# Chart - 5 Number of Reviews Distribution
sns.histplot(df['number_of_reviews'], bins=50)
plt.title("Number of Reviews Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

Reviews indicate customer engagement.

##### 2. What is/are the insight(s) found from the chart?

Most listings have few reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
New host onboarding strategies.

Negative growth:
Low reviews reduce trust and bookings.

#### Chart - 6

In [None]:
# Chart - 6 Minimum Nights Distribution
sns.histplot(df['minimum_nights'], bins=50)
plt.title("Minimum Nights Distribution")
plt.show()


##### 1. Why did you pick the specific chart?


To understand booking restrictions.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Short stays are common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Supports short-term travel demand.

Negative growth:
High minimum nights may discourage bookings.

#### Chart - 7

In [None]:
# Chart - 7 Listings by Neighbourhood Group
sns.countplot(x='neighbourhood_group', data=df)
plt.title("Listings by Neighbourhood Group")
plt.show()


##### 1. Why did you pick the specific chart?

Location affects demand and pricing.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Manhattan dominates listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Helps location-based marketing.

Negative growth:
Over-concentration risks saturation.

#### Chart - 8

In [None]:
# Chart - 8 Price vs Room Type
sns.boxplot(x='room_type', y='price', data=df)
plt.title("Price vs Room Type")
plt.show()


##### 1. Why did you pick the specific chart?

To compare pricing across room types.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Entire homes are most expensive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Room-type based pricing.

Negative growth:
High prices reduce demand for entire homes.

#### Chart - 9

In [None]:
# Chart - 9 Average Price by Neighbourhood
df.groupby('neighbourhood_group')['price'].mean().plot(kind='bar')
plt.title("Average Price by Neighbourhood")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze location-based pricing.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Manhattan has the highest prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Supports premium pricing.

Negative growth:
High prices may reduce budget traveler demand.

#### Chart - 10

In [None]:
# Chart - 10 Price vs Number of Reviews
sns.scatterplot(x='price', y='number_of_reviews', data=df)
plt.title("Price vs Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

To understand price vs demand.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Lower prices receive more reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Encourages competitive pricing.

Negative growth:
Overpricing leads to fewer bookings.

#### Chart - 11

In [None]:
# Chart - 11 Availability by Room Type
sns.boxplot(x='room_type', y='availability_365', data=df)
plt.title("Availability by Room Type")
plt.show()


##### 1. Why did you pick the specific chart?

Room type influences availability.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Entire homes have lower availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
High-demand segment.

Negative growth:
Low availability restricts earnings.

#### Chart - 12

In [None]:
# Chart - 12 Availability vs Reviews
sns.scatterplot(x='availability_365', y='number_of_reviews', data=df)
plt.title("Availability vs Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

To check if availability increases demand.

##### 2. What is/are the insight(s) found from the chart?

Insights:
Availability alone does not guarantee reviews.

##### 3. Will the gained insights help creating a positive business impact?
Positive impact:
Focus on quality, not just availability.

Negative growth:
Unused availability means lost revenue.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 Price vs Minimum Nights
sns.scatterplot(x='minimum_nights', y='price', data=df)
plt.title("Price vs Minimum Nights")
plt.show()


##### 1. Why did you pick the specific chart?

To study how booking restrictions influence pricing.

##### 2. What is/are the insight(s) found from the chart?

Higher minimum nights often correlate with higher prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:

Useful for long-stay pricing models.

Negative growth insight:

Strict rules reduce bookings.

#### Chart - 14 - Correlation Heatmap

In [None]:
# The linear relationships and correlation strength between numerical variables in the dataset.
plt.figure(figsize=(10,6))
sns.heatmap(df[['price',
                'minimum_nights',
                'number_of_reviews',
                'reviews_per_month',
                'availability_365']].corr(),
            annot=True,
            cmap='coolwarm')

plt.title("Correlation Heatmap of Numerical Variables")
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is ideal for simultaneously examining multiple numerical variables and understanding the strength and direction of their linear relationships. It provides a compact visual summary that would be difficult to interpret using separate charts or tables.

##### 2. What is/are the insight(s) found from the chart?



Insights Found

Most numerical variables show weak correlations, indicating that pricing and demand are influenced by multiple independent factors.

Price does not have a strong linear relationship with reviews or availability.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df[['price',
                 'availability_365',
                 'number_of_reviews',
                 'reviews_per_month']])
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is specifically chosen for multivariate exploratory analysis when the objective is to understand how multiple numerical variables interact with each other simultaneously. Unlike individual scatter plots, a pair plot provides a matrix of relationships, allowing us to observe patterns, trends, distributions, and potential correlations across several variables in one consolidated view.

This chart is particularly useful in EDA because it helps identify:

Linear or non-linear relationships

Clusters or patterns in data

Redundancy between variables

Variables that may influence each other

Such insights cannot be efficiently captured using isolated bivariate charts.


##### 2. What is/are the insight(s) found from the chart?

There is no strong linear relationship between price and number of reviews.

Listings with lower prices tend to have higher review counts, but the trend is scattered.

Availability and reviews per month do not show a strong dependency on price.

Each numerical variable displays a different distribution pattern, reinforcing the idea that demand and pricing are influenced by multiple factors.

# chart-16

In [None]:
# Chart-16-Price by Neighbourhood Group & Room Type
plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='price', hue='room_type', data=df)
plt.title("Price Distribution by Neighbourhood Group and Room Type")
plt.show()


Why did you pick this chart?

This chart was chosen to analyze the combined effect of location and room type on pricing. A grouped box plot allows comparison of price distributions across neighbourhood groups while simultaneously differentiating room types. This multilevel comparison cannot be achieved effectively using single-variable or simple bivariate charts.

Insights Found

Entire homes in Manhattan command the highest prices.

Private and shared rooms are priced significantly lower across all locations.

Price variability is highest in Manhattan.

# Chart 17:

In [None]:
#Chart 17: Price vs Number of Reviews by Room Type
plt.figure(figsize=(8,6))
sns.scatterplot(x='price', y='number_of_reviews', hue='room_type', data=df)
plt.title("Price vs Reviews by Room Type")
plt.show()


Why did you pick this chart?

This chart is ideal for understanding how pricing and customer engagement interact across different room types. Adding a categorical hue allows comparison of demand behavior within each room category.

Insights Found

Lower-priced private rooms tend to receive more reviews.

Entire homes show higher prices but lower engagement.

# Chart 18:

In [None]:
# Chart 18: Price Distribution by Room Type (Violin Plot)
plt.figure(figsize=(8,6))
sns.violinplot(x='room_type', y='price', data=df)
plt.title("Price Distribution by Room Type")
plt.show()


Why did you pick this chart?

A violin plot combines the advantages of a box plot and a density plot, making it ideal for understanding both price distribution and concentration across room types. It reveals pricing patterns that are not visible in simple summary statistics.

Insights Found

Entire homes have a wider and higher price range.

Private rooms show more consistent pricing.

Shared rooms cluster at lower prices.

# Chart 19

In [None]:
# Chart 19: Room Type Mix by Neighbourhood Group (Stacked Bar Chart)
pd.crosstab(df['neighbourhood_group'], df['room_type']).plot(
    kind='bar', stacked=True, figsize=(10,6))
plt.title("Room Type Composition by Neighbourhood Group")
plt.show()


Why did you pick this chart?

A stacked bar chart is best suited to analyze the composition of categories within another category. This chart helps understand how room types are distributed within each neighbourhood group.

Insights Found

Manhattan has a higher proportion of entire homes.

Other neighbourhoods show more private rooms.

# Chart 20:

In [None]:
# Chart 20: Reviews per Month vs Price (Scatter Plot)
plt.figure(figsize=(8,6))
sns.scatterplot(x='price', y='reviews_per_month', data=df)
plt.title("Price vs Reviews per Month")
plt.xlabel("Price")
plt.ylabel("Reviews per Month")
plt.show()

Why did you pick this chart?

This chart was selected to analyze the relationship between pricing and review frequency, which acts as a proxy for booking regularity and customer engagement. A scatter plot is the most appropriate choice because it allows direct visualization of how changes in price relate to variations in review activity across listings. Unlike total review counts, reviews per month provide a time-normalized measure of demand, making this relationship more meaningful for business analysis.

What insights were found?

Listings with lower prices generally receive more reviews per month, indicating higher booking frequency.

As price increases, review frequency tends to decrease, though the relationship is not perfectly linear.

High-priced listings show sparse review activity, suggesting niche demand.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the insights obtained from the exploratory data analysis, the client can achieve the business objective by adopting a data-driven pricing and inventory strategy. Since the analysis shows that most demand comes from budget and mid-range listings, hosts should be encouraged to price competitively within these ranges to maximize bookings and reviews. Location-based pricing should be implemented, as listings in high-demand areas such as Manhattan can sustain higher prices compared to other neighbourhoods.

The client should also focus on optimizing availability, especially in high-demand locations, by motivating part-time hosts to increase listing availability. Improving host engagement and review generation is equally important, as listings with more frequent reviews tend to attract higher demand. New or low-review listings can be supported through promotions or visibility boosts.

Additionally, the platform should avoid a one-size-fits-all strategy. Instead, it should use a multi-factor approach that considers price, room type, location, availability, and customer engagement together. By aligning pricing, availability, and marketing strategies with these insights, the client can improve booking rates, enhance customer satisfaction, and drive sustainable business growth.

# **Conclusion**

This exploratory data analysis of the Airbnb NYC 2019 dataset provided valuable insights into pricing behavior, availability patterns, room type distribution, and customer demand across the platform. The analysis revealed that Airbnb listings are predominantly budget to mid-range priced, with a smaller segment of premium listings. Pricing is strongly influenced by room type and location, with entire homes and listings in Manhattan commanding higher prices. Customer engagement, measured through reviews and review frequency, tends to be higher for competitively priced listings, highlighting the importance of affordable pricing strategies.

The study also showed that availability alone does not guarantee higher demand, emphasizing that factors such as price, room type, and perceived value play a more significant role in attracting bookings. Multivariate analysis confirmed that no single variable drives demand or pricing, reinforcing the need for a multi-factor decision-making approach. These insights can help Airbnb and hosts optimize pricing, improve availability planning, and design targeted marketing strategies.

However, the analysis also highlighted certain drawbacks. A large number of listings have low review counts, which can negatively impact trust and visibility on the platform. Extremely high-priced listings tend to receive fewer bookings, indicating potential overpricing issues. Additionally, limited availability in high-demand areas restricts revenue potential, while strict minimum night requirements may discourage short-term travelers.

Overall, this project demonstrates how structured exploratory data analysis can transform raw listing data into actionable business insights. While the findings provide a strong foundation for decision-making, incorporating additional factors such as seasonal trends, customer ratings, and booking history could further enhance the analysis.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***