<a href="https://colab.research.google.com/github/mani-github2021/AirBnb_Booking_Analysis/blob/main/AirBnb_Bookings_Analysis_EDA_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    



##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual


# **Project Summary**

Airbnb has changed the way we travel since 2008, offering unique experiences worldwide. With millions of listings, Airbnb collects a lot of data, around 49,000 observations in this dataset alone, with 16 columns mixing different types of information. This data is essential for many things, like making the platform safer, deciding on business strategies, understanding how users and hosts behave, evaluating performance, targeting ads better, and coming up with new services. Exploring this dataset helps Airbnb learn from its users and improve its services for everyone.

# **GitHub Link**

https://github.com/mani-github2021/AirBnb_Booking_Analysis

# **Problem Statement**


Analyze the Airbnb NYC 2019 dataset to understand various aspects of Airbnb listings in New York City, such as distribution, pricing, and relationships between different variables, and to provide insights that can help improve business strategies for hosts and Airbnb as a platform.
The analysis involves importing and examining the dataset, understanding the variables, performing data wrangling, and visualizing relationships between variables. Insights gained from the analysis will help Airbnb hosts and the platform optimize their offerings.

#### **Define Your Business Objective?**

To gain insights from the Airbnb NYC 2019 dataset that can help improve host performance, optimize pricing strategies, and enhance guest satisfaction. This includes understanding data distribution, identifying key factors affecting prices, and finding patterns that can be leveraged for better decision-making.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
path='/content/Airbnb NYC 2019.csv'
df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.rcParams['figure.figsize'] = (10,5)
df.isna().sum().plot.bar()
plt.show()

### What did you know about your dataset?

The dataset has 48895 rows and 16 columns withmissing values in name, host name,last_review,reviews per month

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

*These are the variables in the dataset*

id: Unique identifier

name: Listing title

host_id: Host identifier

host_name: Host name

neighbourhood_group: Grouped neighbourhood

neighbourhood: Specific neighbourhood

latitude: Geographic latitude

longitude: Geographic longitude

room_type: Type of room

price: Listing price

minimum_nights: Minimum stay

number_of_reviews: Total reviews

last_review: Date of last review

reviews_per_month: Avg. reviews/month

calculated_host_listings_count: Host listings count

availability_365: Availability (days)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"Number of unique value for {col} is : {len(df[col].unique())}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Remove duplicates
df = df.drop_duplicates()

# Handle missing values (filling or dropping)
df = df.dropna(subset=['name', 'host_name'])  # Dropping rows where 'name' or 'host_name' is missing
df['reviews_per_month'].fillna(0, inplace=True)  # Filling missing reviews per month with 0


In [None]:
#Assigning data without and with certain conditions
# Data without reviews
df_no_reviews = df[df['reviews_per_month'] == 0]

# Data with reviews
df_with_reviews = df[df['reviews_per_month'] > 0]

In [None]:
#Number of listings for each neighborhood group
df['neighbourhood_group'].value_counts().reset_index(name='listings')

In [None]:
#Number of listings for each neighborhood
df['neighbourhood'].value_counts().reset_index(name='listings')

In [None]:
#Number of listings by room type without reviews
df_no_reviews['room_type'].value_counts().reset_index(name='listings')

In [None]:
#Number of listings by room type with reviews
df_with_reviews['room_type'].value_counts().reset_index(name='listings')

In [None]:
#Calculating the percentage of hosts with reviews
total_hosts = df['host_id'].nunique()
hosts_with_reviews = df_with_reviews['host_id'].nunique()
percentage_hosts_with_reviews = (hosts_with_reviews / total_hosts) * 100
print(f"Percentage of hosts with reviews: {percentage_hosts_with_reviews:.2f}%")

### What all manipulations have you done and insights you found?

Removed duplicate entries to ensure data integrity.

Dropped rows with missing 'name' or 'host_name' to maintain essential information.

Filled missing 'reviews_per_month' with 0 to handle null values effectively.

Grouped data by neighborhood and room type to understand distribution and concentration.
- Manhattan has the Maximum Number of listings with 21643 among neighborhood groups
- Staten island has the Minimum Number of listings with 373 among neighborhood groups

Separated data based on the presence of reviews to analyze differences in listing characteristics.
- Entire home/apt has maximum Number of listings with 5072 by room type without reviews
- Shared room has minimum Number of listings with 313 by room type without reviews

Counted listings per neighborhood group and room type to identify popular areas and types.
- Entire home/apt is identified as popular with maximum Number of listings with 20321

Calculated the percentage of hosts with reviews to gauge engagement.
- Percentage of hosts with reviews: 80.78%

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
#Distribution of Prices
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=50, kde=True)
plt.title('Distribution of Prices')
plt.xlabel('Frequency')
plt.ylabel('Price')
plt.show()


##### 1. Why did you pick the specific chart?

Histplot can be utilized to visualize the distribution of a dataset, showing the frequency of values within specified bins. I used this chart to understand the price distribution of Airbnb listings.

##### 2. What is/are the insight(s) found from the chart?

Most listings are priced below $500. There are a few listings with extremely high prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps the hosts understand the competitive pricing.

#### Chart - 2

In [None]:
#Average Price by Neighborhood Group
plt.figure(figsize=(10, 6))
sns.barplot(x='neighbourhood_group', y='price', data=df)
plt.title('Average Price by Neighborhood Group')
plt.xlabel('Neighborhood Group')
plt.ylabel('Average Price')
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different labels of a categorical or nominal variable.I used this to compare average prices across different neighborhoods.

##### 2. What is/are the insight(s) found from the chart?

We see the average price is high in Manhattan among them.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps hosts in setting appropriate prices based on location.

#### Chart - 3

In [None]:
# Room Type Distribution
plt.figure(figsize=(8, 8))
room_counts = df['room_type'].value_counts()
plt.pie(room_counts, labels=room_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Room Type Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions. I picked this chart to understand the popularity of each type of room.

##### 2. What is/are the insight(s) found from the chart?

Entire home/apt is in high demand and shared rooms are in less demand

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps hosts on the most in-demand room types.

#### Chart - 4

In [None]:
#Availability by Neighborhood Group
plt.figure(figsize=(10, 6))
sns.boxplot(x='neighbourhood_group', y='availability_365', data=df)
plt.title('Availability by Neighborhood Group')
plt.xlabel('Neighborhood Group')
plt.ylabel('Availability (days)')
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot visually represents the distribution of a dataset by displaying its summary statistics in a concise manner. I picked this chart to see how availability varies across neighborhoods.

##### 2. What is/are the insight(s) found from the chart?

Some neighborhoods have listings with high availability year-round.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps hosts manage expectations and strategies based on availability trends.

#### Chart - 5

In [None]:
# Convert non-numeric columns to numeric
df_numeric = df.apply(pd.to_numeric, errors='coerce')

# Create correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df_numeric.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap visually represents data using colors to indicate the magnitude of values in a matrix or table.I used this to explore relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Identifies strong and weak correlations among variables like price, reviews, and availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify key factors influencing pricing and availability.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

Adjust pricing strategies based on neighborhood and competition.
Focus on high-demand room types to maximize occupancy and revenue.
Improve listings in underperforming neighborhoods by enhancing features and amenities.

# **Conclusion**

This analysis provides valuable insights to Airbnb listings in NYC, helping hosts and the platform make informed decisions to optimize performance and enhance user experience.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***