# **Project Name**    - EDA Airbnb NYC Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Airbnb is an online marketplace that connects people who want to rent out their homes with people who are looking for accommodations in that locale. It currently covers more than 100,000 cities and 220 countries worldwide. For hosts, it's a way to earn money while protecting their property from potential damage. However, for guests, it's a risky venture that they should avoid.

For this project we are analyzing Airbnb’s New York City(NYC) data of 2019. NYC is not only the most famous city in the world but also top global destination for visitors drawn to its museums, entertainment, restaurants and commerce. According to the Office of New York State Comptroller, NYC hosted 66.6 million visitors in 2019.

Data analysis on thousands of listings provided through Airbnb is a crucial factor for the company. Our main objective is to find out the key metrics that influence the listing of properties on the platform. For this, we will explore and visualize the dataset from Airbnb in NYC using basic exploratory data analysis (EDA) techniques. We have found out the distribution of every Airbnb listing based on their location, including their price range, room type, listing name, and other related factors. We have analyzed this dataset from different angles and have come up with interesting insights. This can help in making strategic data-driven decisions by the marketing team, finance team and technical team of Airbnb.

# **GitHub Link -**

**GitHub Link** - https://github.com/Kundan64

# **Problem Statement**


*  For this project we are analyzing Airbnb’s New York City(NYC) data of 2019. NYC is not only the most famous city in the world but also top global destination for visitors drawn to its museums, entertainment, restaurants and commerce.
*  Our main objective is to find out the key metrics that influence the listing of properties onn the platform. For this we will explore and visualize the dataset from Airbnb in NYC using basic  exploratory data analysis(EDA) techniques.
*  We will be finding out the distribution of every Aiirbnb listing based on their location, including their price range, room type,listing name, and other related factors





#### **Define Your Business Objective?**

The primary objective of AirBnB is to provide a platform for people to rent out their homes, apartments, or rooms to travelers
*   To provide a user-friendly platform for hosts to list their properties and for travelers to search and book accommodations.
To ensure the safety and security of both hosts and travelers by implementing a robust verification process and providing insurance coverage.
To offer competitive pricing for accommodations to attract more travelers and increase revenue for hosts.
To expand the business globally by entering new markets and establishing partnerships with local businesses.
To continuously improve the platform by incorporating feedback from users and implementing new features and technologies.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
Airbnb = pd.read_csv('/content/drive/MyDrive/Data_Sets/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
Airbnb.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Airbnb.shape

### Dataset Information

In [None]:
# Dataset Info
Airbnb.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
Airbnb.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Airbnb.isnull().sum()

In [None]:
# Drop null value
Airbnb = Airbnb.dropna()

### What did you know about your dataset?

The dataset given is a dataset from Airbnb is an online marketplace that connects people looking for short-term accommodations, and we have to analysis the price, trends,availability, and other factors and the insights behind it.

The above dataset has 48895 rows and 16 columns. There are 16 mising values in name column, 21 host name, 10052 in last review and review per month there is no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Airbnb.columns

In [None]:
# Dataset Describe
Airbnb.describe(include="all")

### Variables Description

* **ID**                : Unique identifier for each listing.

* **Name**       : Name of the listing.

* **Host_id**            : ID of the host.

* **Host_name**            : Name of the host.

* **Neighbourhood_group**      : Broad neighborhood classification.

* **neighbourhood**        : Specific neighborhood.

* **latitude**             : Latitude of the listing.
* **longitude**            : Longitude of the listing.
* **room_type**         : Type of room.

* **Price**         : Price per night.

* **minimum_nights**          : Minimum number of nights for booking.

* **number_of_reviews**          : Total number of reviews.

* **last_review**         : Date of the last review.

* **reviews_per_month**         :  Average reviews per month.

* **calculated_host_listings_count**  : Number of listings by the host.

* **availability_365**      :  Number of days available per year.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
Airbnb.nunique()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Neighborhoods by number of listings
top_neighborhoods = df['neighbourhood'].value_counts()

plt.figure(figsize=(10, 6))
top_neighborhoods.plot.bar(color = ['violet','indigo','b','g','y','orange','r'])
plt.title('Top 5 Neighborhoods by Number of Listings', fontsize=16)
plt.xlabel('Neighborhood', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the No of listing by top 5 states, I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know that, ther are 5 neighbourhood no of listing  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the top neighborhoods dominate the majority of listings (as might be reflected in the bar chart), businesses may become over-concentrated in a few areas. This could lead to market saturation in those neighborhoods, where the supply outweighs the demand.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Distribution of room_type.
room_type_counts = df['room_type'].value_counts()

plt.figure(figsize=(8, 6))
room_type_counts.plot(kind='bar', color='purple', edgecolor='black')
plt.title('Distribution of Room Types', fontsize=16)
plt.xlabel('Room Type', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know that, distribute a room type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By increasing the availability of high-demand room types, businesses can capture a larger share of the market. For example, offering more Entire homes or Luxury apartments may attract customers looking for privacy or premium accommodations.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Distribution of price
plt.figure(figsize=(8, 6))
sns.histplot(df['price'], bins=20, kde=True, color='green')
plt.title('Distribution of Price', fontsize=16)
plt.xlabel('Price', fontsize=12)

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know that, distribute a price according to room

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By aligning with the price points that most customers are comfortable with, businesses can increase their booking frequency, leading to higher revenue and occupancy rates.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Average price by neighbourhood_group.
average_price_by_neighbourhood = df.groupby('neighbourhood_group')['price'].mean().reset_index()

plt.figure(figsize=(8, 6))
sns.barplot(x='neighbourhood_group', y='price', data=average_price_by_neighbourhood, palette='viridis')
plt.title('Average Price by Neighbourhood Group', fontsize=16)
plt.xlabel('Neighbourhood Group', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are widely used to compare different categories or groups of data. They are especially effective for visualizing categorical data.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know that, average price by neighbour hood group

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If a neighbourhood group has a high average price, but the number of bookings or reviews is relatively low, it suggests overpricing for that location.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Filter out extreme values for price and minimum_nights
filtered_data = df[(df['price'] <= 1000) & (df['minimum_nights'] <= 30)]

plt.figure(figsize=(10, 6))
plt.scatter(filtered_data['price'], filtered_data['minimum_nights'], alpha=0.5, color='teal')
plt.title('Correlation Between Price and Minimum Nights', fontsize=16)
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Minimum Nights', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot would show points indicating the relationship between hours studied and test scores, visually representing their correlation.

##### 2. What is/are the insight(s) found from the chart?

In this graph find the relation between price of minimum night

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If there is a noticeable cluster of listings with high prices and high minimum nights, this could point to a subset of listings that may not be as attractive to a broad customer base.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Distribution of neighbourhood_group.
neighbourhood_group_counts = df['neighbourhood_group'].value_counts()

plt.figure(figsize=(8, 6))
neighbourhood_group_counts.plot(kind='bar', color='orange', edgecolor='black')
plt.title('Distribution of Neighbourhood Group', fontsize=16)
plt.xlabel('Neighbourhood Group', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I would pick a bar chart in scenarios where the goal is to compare values across distinct categories. Here's why it would be the specific chart of choice

##### 2. What is/are the insight(s) found from the chart?

In this above chart find the neighbourhood group according to listing

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the bar chart shows that a few neighbourhood groups have a significantly higher number of listings, this could indicate market concentration.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Map of listings using latitude and longitude.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='longitude', y='latitude', hue='neighbourhood_group', palette='viridis')
plt.title('Map of Listings', fontsize=16)
plt.xlabel('Longitude', fontsize=12)
plt.ylabel('Latitude', fontsize=12)
plt.legend(title='Neighbourhood Group', fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot would show points indicating the relationship between hours studied and test scores, visually representing their correlation.

##### 2. What is/are the insight(s) found from the chart?

In the above chart find the relationship between latitude and longitude

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

scatter plot of listings based on latitude and longitude shows that most listings are concentrated in just a few neighbourhoods, this could signal a lack of diversification. This means that if there is a downturn or saturation in those popular areas, it could severely impact business growth, as it limits the potential to reach new customers in less popular regions.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Number of reviews by neighbourhood_group.
reviews_by_neighbourhood_group = df.groupby('neighbourhood_group')['number_of_reviews'].sum().reset_index()

plt.figure(figsize=(8, 6))
sns.barplot(x='neighbourhood_group', y='number_of_reviews', data=reviews_by_neighbourhood_group, palette='viridis')
plt.title('Number of Reviews by Neighbourhood Group', fontsize=16)
plt.xlabel('Neighbourhood Group', fontsize=12)
plt.ylabel('Number of Reviews', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I would pick a bar chart in scenarios where the goal is to compare values across distinct categories. Here's why it would be the specific chart of choice

##### 2. What is/are the insight(s) found from the chart?

In the above chart find the neighbourhood group review

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Some neighbourhood groups may have significantly fewer reviews compared to others. This could indicate that these areas are less popular or have lower activity.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Average reviews per month by room_type.
reviews_per_month_by_room_type = df.groupby('room_type')['reviews_per_month'].mean().reset_index()

plt.figure(figsize=(8, 6))
sns.barplot(x='room_type', y='reviews_per_month', data=reviews_per_month_by_room_type, palette='viridis')
plt.title('Average Reviews per Month by Room Type', fontsize=16)
plt.xlabel('Room Type', fontsize=12)
plt.ylabel('Average Reviews per Month', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I would pick a bar chart in scenarios where the goal is to compare values across distinct categories. Here's why it would be the specific chart of choice

##### 2. What is/are the insight(s) found from the chart?

In the above chart find the average review per month according to room type

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   Insight: Room types like "Shared room" or "Hotel room" might show significantly lower average reviews per month compared to "Entire home/apt" or "Private room."



#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Availability of listings across the year.
plt.figure(figsize=(10, 6))
plt.hist(df['availability_365'], bins=30, color='skyblue', edgecolor='black')
plt.title('Availability of Listings Across the Year', fontsize=16)
plt.xlabel('Number of Available Days (365 = Always Available)', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?


I would pick a histogram when the goal is to visualize the distribution of a single numerical variable. Here’s why this chart would be the specific choice

##### 2. What is/are the insight(s) found from the chart?

In the above graph find the availability of room

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   If the histogram shows a significant portion of listings with low availability (e.g., ≤30 days/year), it suggests that many hosts are infrequently active or treat their listings as secondary.




#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Group by host and count the number of listings per host
host_listings = df['calculated_host_listings_count']

plt.figure(figsize=(10, 6))
plt.hist(host_listings, bins=range(1, host_listings.max() + 1), color='orange', edgecolor='black', align='left')
plt.title('Host Activity: Number of Listings per Host', fontsize=16)
plt.xlabel('Number of Listings', fontsize=12)
plt.ylabel('Number of Hosts', fontsize=12)
plt.xticks(range(1, min(20, host_listings.max() + 1)))
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I would pick a bar chart in scenarios where the goal is to compare values across distinct categories. Here's why it would be the specific chart of choice

##### 2. What is/are the insight(s) found from the chart?

In the above chart count the number of listings per host

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hosts with a higher number of listings likely contribute significantly to the platform's revenue.
Understanding the distribution of listings (e.g., single-property hosts vs. multi-property hosts).

#### Chart - 12

In [None]:
# Chart - 12 visualization code
filtered_data = df[df['price'] <= 1000]

# Create a boxplot for price comparison by room_type
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type', y='price', data=filtered_data, palette='Set2')
plt.title('Price Comparison by Room Type', fontsize=16)
plt.xlabel('Room Type', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is the best choice for analyzing and comparing the distribution, spread, and outliers in numerical data

##### 2. What is/are the insight(s) found from the chart?

*   The boxplot will show how the prices of the listings are distributed within each room type category. Each room type will have a box representing the interquartile range (IQR), a line (median) inside the box, and whiskers extending to the minimum and maximum values within a defined range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By visualizing the price distribution for different room types, you can gain insights into which types of listings are more expensive or affordable. For example, private rooms may have a different price distribution than entire homes or shared rooms.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
reviews_per_month = df['reviews_per_month']

# Plot the distribution of reviews_per_month
plt.figure(figsize=(10, 6))
plt.hist(reviews_per_month, bins=30, color='lightgreen', edgecolor='black')
plt.title('Distribution of Reviews per Month', fontsize=16)
plt.xlabel('Reviews per Month', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I would pick a histogram when the goal is to visualize the distribution of a single numerical variable. Here’s why this chart would be the specific choice

##### 2. What is/are the insight(s) found from the chart?

You'll get an understanding of how reviews are distributed across listings. Most listings may have few reviews per month (which is typical for some Airbnb properties), while a small number of properties may receive many reviews per month.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Listings with consistently high reviews per month may indicate that
they are highly popular, well-maintained, or have a loyal customer base.
Listings with very few or no reviews per month might indicate underperformance or lack of visibility.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Get the top 5 neighborhoods by number of listings
top_5_neighborhoods = df['neighbourhood'].value_counts().head(5).index

# Filter the dataset for these neighborhoods
filtered_data = df[df['neighbourhood'].isin(top_5_neighborhoods) & (df['price'] <= 1000)]

# Create a boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='neighbourhood', y='price', data=filtered_data, palette='Set2')
plt.title('Price Comparison by Neighborhood', fontsize=16)
plt.xlabel('Neighborhood', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()

##### 1. Why did you pick the specific chart?

A box plot is the best choice for analyzing and comparing the distribution, spread, and outliers in numerical data

##### 2. What is/are the insight(s) found from the chart?

In the above chart find comparision between price and room type

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Filter out extreme price values for clarity
filtered_data = df[df['price'] <= 1000]

# Create a scatter plot for price and availability_365
plt.figure(figsize=(10, 6))
plt.scatter(filtered_data['price'], filtered_data['availability_365'], alpha=0.5, color='blue')
plt.title('Correlation Between Price and Availability (365 Days)', fontsize=16)
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Availability (Days)', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot would show points indicating the relationship between hours studied and test scores, visually representing their correlation.

##### 2. What is/are the insight(s) found from the chart?

In  the above chart find the corelation between price and availability

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

*   Increase Booking Rates.
*   Enhance Host Engagement and Retention.
*   Optimize Pricing for Maximum Revenue.
*   Expand into New Markets.
*   Improve Guest Experience.





# **Conclusion**

 In conclusion, the airbnb analysis project sucessfully provide valuable inside in to pricing variation availability, pattern and, location based trends within the airbnb dataset. key finding includes.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***