<a href="https://colab.research.google.com/github/Ashish-Kumar-Vaish/Airbnb-EDA/blob/main/Airbnb_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project AirBnb**   



# **Project Summary -**

This project explores Airbnb listings in New York City to get insights that can benefit the Airbnb hosts and the company. The dataset includes attributes such as price, neighborhood, room type, number of reviews, availability, and geographic coordinates for each listing. The project involves data cleaning, visualization, and modeling to understand price and customer behavior.

# **GitHub Link -**

https://github.com/Ashish-Kumar-Vaish/Airbnb-EDA

# **Problem Statement**


The goal is to analyze Airbnb listings in New York City to understand patterns in pricing, room types, availability, and location. The project will help hosts and Airbnb optimize pricing and property availability based on real world data.

#### **Define Your Business Objective?**

* Identify key properties that affect Airbnb prices in NYC.
* Help hosts optimize listing features (like room type, location, or availability).
* Provide insights that support Airbnb's strategy for pricing and availability.

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
path = "/content/drive/MyDrive/Colab Notebooks/Datasets/Copy of Copy of Airbnb NYC 2019.csv"
df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
print("Missing values per column:\n", missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(9,3))
missing_values.plot(kind = "bar", color=sns.color_palette("Set2"))
plt.title("Missing Values Per Column")
plt.xlabel("Column Name")
plt.ylabel("Missing Count")
plt.xticks(rotation=0)
plt.show()

### What did you know about your dataset?

1. The dataset contains 48,895 rows and 16 columns.
2. No duplicate entries were found in the dataset.
3. Columns last_review and reviews_per_month contain missing values.
4. id, name, host_id and host_name are identifiers and may not be required for modeling.
5. There may be outliers in price and minimum nights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* ***id:*** `Unique listing ID`
* ***name:*** ` Name of the listing`
* ***host_id:*** ` Unique host ID`
* ***host_name:*** `Name of the host`
* ***neighbourhood_group:*** ` District or region within the city (e.g. Manhattan, Brooklyn)`
* ***neighbourhood:*** `More detailed neighborhood within a district/region`
* ***latitude, longitude:*** `Geo coordinates of the listing`
* ***room_type:*** `Type of listing - Entire home/apt, Private room, Shared room`
* ***price:*** `Price per night in USD`
* ***minimum_nights:*** ` Minimum number of nights required to book`
* ***number_of_reviews:*** `Total number of reviews for the listing`
* ***last_review:*** `Date of the last review (NaN if no reviews)`
* ***reviews_per_month:*** ` Average reviews per month (NaN if no reviews)`
* ***calculated_host_listings_count:*** `Total listings managed by the host`
* ***availability_365:*** `Number of days the listing is available in a year`

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
print("Number of duplicate rows:", df.duplicated().sum())
df.drop(["id", "host_name", "name", "host_id"], axis=1, inplace=True)
df

In [None]:
df['reviews_per_month'].fillna(0, inplace=True)
df.isnull().sum()

In [None]:
df['last_review'] = pd.to_datetime(df['last_review'])
df['last_review'].fillna("No review", inplace=True)
df.isnull().sum()

### What all manipulations have you done and insights you found?

* Checked if there are any duplicate rows. If there are then we drop duplicate rows using df.drop_duplicates(inplace=True).
* Dropped identifiers such as id, host_name, host_id, and name, because they are not needed in modeling.
* Replaced NULL values in reviews_per_month to 0.
* Converted last_review to datetime so we can replace NULL values with "No Reviews".


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(7,4))
df["room_type"].value_counts().plot(kind = "bar", color=sns.color_palette("Set2"))
plt.title("Distribution of Room Types")
plt.xlabel("Room Type")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for displaying the frequency of each category in a categorical variable like room_type.

##### 2. What is/are the insight(s) found from the chart?

* The most common room type is Entire home/apt and Private room.
* Shared rooms are the least frequent in the NYC Airbnb.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(7,4))
df['neighbourhood_group'].value_counts().plot(kind = 'bar', color=sns.color_palette("Set3"))
plt.title("Distribution of Neighbourhood Groups")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Number of Listings")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for displaying the frequency of each category in a categorical variable like neighbourhood_group.

##### 2. What is/are the insight(s) found from the chart?

* Manhattan is the highest neighbourhood group in Airbnb listings.
* Staten Island has the least listings.

#### Chart - 3

In [None]:
df['price'].describe()

In [None]:
# Filter Outliers
below_550 = (df['price'] < 550).mean()
print(f"Listings below 550 USD: {below_550 * 100}%")

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(9,4))
df[df['price'] <= 550]['price'].plot(kind = 'hist', bins = 30)
plt.title("Distribution of Price below 550")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is ideal for visualizing the distribution of a numerical value like price.

##### 2. What is/are the insight(s) found from the chart?

* Most of the listings are priced below 200 USD, peaking around the 50-150 range.
* There are very few listings above 300 USD, indicating those are premium stays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Hosts can set competitive pricing in the 50-150 range to maximize bookings.

#### Chart - 4

In [None]:
df["minimum_nights"].describe()

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(9,4))
df[df['minimum_nights'] <= 30]['minimum_nights'].plot(kind = 'hist', bins=30, color='green')
plt.title("Distribution of Minimum Nights")
plt.xlabel("Minimum Nights")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

Since minimum_nights is a numerical column with discrete values, a histogram reveals patterns clearly.

##### 2. What is/are the insight(s) found from the chart?

* Most hosts require 1-3 night minimum stay.

#### Chart - 5

In [None]:
df["number_of_reviews"].describe()

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(9,4))
df[df['number_of_reviews'] <= 100]['number_of_reviews'].plot(kind = 'hist', bins = 70)
plt.title("Distribution of Number of Reviews")
plt.xlabel("Number of Reviews")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is ideal for visualizing the distribution of a numerical variable like number_of_reviews.

##### 2. What is/are the insight(s) found from the chart?

* Most listings have fewer than 40 reviews.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(9,4))
df.groupby('neighbourhood_group')['price'].mean().plot(kind = 'bar', color=sns.color_palette("pastel"))
plt.title("Average Price by Neighbourhood Group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Average Price")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart can help comparing groupwise statistics

##### 2. What is/are the insight(s) found from the chart?

* Manhattan has the highest average Airbnb price.
* Staten Island has the lowest average price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Hosts can adjust pricing based on location trends to stay competitive.

**Negative Insight:**
* Overpriced listings in lower demand areas may suffer from low booking rates.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(6,4))
df.groupby('room_type')['price'].mean().plot(kind = 'bar', color=sns.color_palette("Set3"))
plt.title("Average Price by Room Type")
plt.xlabel("Room Type")
plt.ylabel("Average Price")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing average prices across different categories of room_type.

##### 2. What is/are the insight(s) found from the chart?

* Entire home/apt has the highest average price.
* Shared rooms are the cheapest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Hosts can adjust pricing based on room type.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(6,4))
df.groupby('room_type')['availability_365'].mean().plot(kind = 'bar', color=sns.color_palette("Set1"))
plt.title("Average Availability by Room Type")
plt.xlabel("Room Type")
plt.ylabel("Average Availability (Days)")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is best for comparing numeric averages across categories.

##### 2. What is/are the insight(s) found from the chart?

* Private rooms have the highest average availability.
* Shared rooms have the lowest availability on average.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(9,4))
df.groupby('neighbourhood_group')['number_of_reviews'].mean().plot(kind = 'bar', color=sns.color_palette("Set1"))
plt.title("Average Number of Reviews by Neighbourhood Group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Average Number of Reviews")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is best for comparing average number of reviews per listing across neighbourhood group.

##### 2. What is/are the insight(s) found from the chart?

* Staten Island and Queens have the highest average review counts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Spotlight less reviewed areas.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(9,5))
df.groupby('neighbourhood')['price'].mean().sort_values(ascending = False).head(10).plot(kind = 'bar', color=sns.color_palette("pastel"))
plt.title("Top 10 Neighbourhoods by Average Price")
plt.xlabel("Neighbourhood")
plt.ylabel("Average Price")
plt.xticks(rotation = 45)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is best for comparing average price across different neighbourhoods.

##### 2. What is/are the insight(s) found from the chart?

* Top 10 Neighbourhoods by average price are plotted.
* Fort Wadsworth and Woodrow has the highest average price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Helps new hosts or investors choose high-revenue zones.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(9,4))
df[df['price'] < 550].plot(kind = 'scatter', x = 'price', y = 'number_of_reviews', alpha=0.1)
plt.xlabel("Price")
plt.ylabel("Number of Reviews")
plt.title("Price vs Number of Reviews")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is the best choice to analyze relationships between two continuous variables, like price and number_of_reviews.

##### 2. What is/are the insight(s) found from the chart?

* Most reviewed listings are in the 50-150 price range.
* Listings with high prices have less reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Hosts can adjust prices to increase visibility.

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(11,5))
correlation = df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']].corr()

plt.imshow(correlation, cmap='coolwarm')
plt.title("Correlation Heatmap of Numerical Values")
plt.colorbar(label = 'Correlation Coefficient')
plt.xticks(ticks = range(len(correlation.columns)), labels=correlation.columns, rotation = 45)
plt.yticks(ticks = range(len(correlation.index)), labels=correlation.index)
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap helps to spot positive or negative correlations between features.

##### 2. What is/are the insight(s) found from the chart?

* number_of_reviews and reviews_per_month are positively correlated.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
numeric_cols = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']

pd.plotting.scatter_matrix(df[numeric_cols][df['price'] < 550], alpha=0.1, figsize=(12,10))
plt.show()

##### 1. Why did you pick the specific chart?

A scatter matrix shows the pair relationships between several variables in one grid.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? Explain Briefly.

Answer Here.

# **Conclusion**

* Manhattan and Brooklyn are the most active neighbourhoods, but with different price dynamics.
* Room type, location, and availability are key factors affecting how many people book and leave reviews.
* Visualizations showed important patterns in bookings, pricings, and host performance.