<a href="https://colab.research.google.com/github/RallapatiSaiTejith/Data_wrangling_project/blob/main/Data_wrangling_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Bookings Analysis



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

Project Summary

Introduction :

The Airbnb Bookings Analysis project is an exploratory data analysis (EDA) aimed at understanding patterns, trends, and key factors influencing Airbnb bookings. As Airbnb has become a dominant force in the short-term rental industry, analyzing booking data can provide valuable insights into customer preferences, seasonal demand, pricing strategies, and host performance. This project leverages real-world Airbnb data to uncover trends that can help hosts optimize their listings and travelers make informed decisions.




Objectives :

The primary goals of this project are:

*Analyze Booking Trends: Identify seasonal fluctuations, popular travel months, and booking frequency across different locations.

*Price Analysis: Examine pricing variations based on location, property type, and amenities.

*Host Performance Metrics: Evaluate factors contributing to high-rated hosts, such as response rate, reviews, and pricing strategies.

*Guest Preferences: Understand traveler preferences by analyzing booking durations, property features, and guest reviews.

*Geospatial Analysis: Explore the impact of location on booking rates, availability, and pricing.

*Revenue Estimation: Predict potential revenue based on historical data and current trends.





Data Description :

The dataset used in this analysis is sourced from Airbnb’s publicly available listings data. It typically includes:

*Listing ID and name

*Host details (e.g., host ID, response rate, and listing count per host)

*Location (latitude, longitude, and neighborhood)

*Property type and room type

*Pricing details (price per night, minimum nights, service fees, etc.)

*Availability and booking trends (availability across different time frames)

*Customer ratings and review scores.




Methodology:

The analysis follows a structured approach:

*Data Collection and Cleaning:

Handling missing or inconsistent values

Removing duplicates and outliers

Converting data types where necessary

*Exploratory Data Analysis (EDA):

Statistical summary of numerical variables

Visualization of price distributions, booking trends, and host characteristics

Correlation analysis to identify relationships between features

*Predictive Insights:

Using regression models to predict pricing trends and potential revenue generation





Key Findings and Insights:

The project aims to derive actionable insights such as:

*Peak Booking Seasons: Identifying months with the highest demand.

*Optimal Pricing Strategies: Understanding the price range that maximizes occupancy while maintaining profitability.

*Host Success Factors: Determining what separates high-earning hosts from others.

*Location-Based Demand: Recognizing the most sought-after areas for Airbnb stays.

*Customer Sentiment Trends: Assessing guest preferences based on reviews and ratings.





Tools and Technologies Used

*Programming Language: Python (Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn)

*Data Visualization: Matplotlib, Seaborn, Plotly






# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The short-term rental industry has experienced significant growth, with Airbnb leading the market as a preferred accommodation choice for travelers worldwide. However, both hosts and guests face challenges in optimizing their decisions. Hosts struggle with setting the right pricing, maximizing occupancy, and understanding factors that contribute to guest satisfaction. On the other hand, travelers find it difficult to choose properties that best match their preferences and budgets. This project aims to analyze Airbnb booking data to uncover trends, pricing strategies, host performance indicators, and guest preferences. By leveraging data-driven insights, we seek to help hosts enhance their listings and travelers make informed choices, ultimately improving the overall Airbnb experience.

#### **Define Your Business Objective?**

The primary business objective of this project is to extract meaningful insights from Airbnb booking data to support data-driven decision-making for stakeholders, including Airbnb hosts, travelers, and market analysts. The specific objectives include:

*Optimizing Revenue for Hosts: By identifying the best pricing strategies, seasonal demand patterns, and key success factors, hosts can maximize their earnings while ensuring high occupancy rates.

*Enhancing Guest Experience: Understanding guest preferences through booking trends, review sentiment, and property features enables hosts to improve their offerings and attract more bookings.

*Improving Market Competitiveness: Analyzing location-based demand and pricing trends helps hosts stay competitive by adjusting their pricing and services accordingly.

*Supporting Investment Decisions: Providing insights into profitable property locations and high-demand listing features aids real estate investors in making informed investment choices.

*Predicting Future Trends: Using historical data to forecast demand, pricing, and revenue helps both hosts and Airbnb strategize for future growth.

*Facilitating Policy and Market Analysis: Policymakers and market analysts can leverage findings to understand the economic impact of Airbnb and design regulations that balance growth and local community interests.

By addressing these objectives, this project aims to create value for Airbnb stakeholders and contribute to a more efficient and profitable short-term rental ecosystem.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')
data=pd.read_csv('/content/drive/MyDrive/Copy of Airbnb NYC 2019.csv')
data

### Dataset First View

In [None]:
data.head()

In [None]:
data.tail()

### Dataset Rows & Columns count

In [None]:
data.shape

### Dataset Information

In [None]:
data.dtypes

#### Duplicate Values

In [None]:
data.duplicated()

#### Missing Values/Null Values

In [None]:
data.isnull().sum()

In [None]:
sns.heatmap(data.isnull())

### What did you know about your dataset?

Dataset Size & Structure:

The dataset contains 48,895 rows and 16 columns.

*Columns & Data Types:

Includes categorical (name, host_name, neighbourhood_group, neighbourhood, room_type, last_review) and numerical (price, minimum_nights, number_of_reviews, etc.) features.

*Missing Values:

name (16 missing values)

host_name (21 missing values)

last_review (10,052 missing values)

reviews_per_month (10,052 missing values, likely due to properties with no reviews)

*Key Statistics:

Price Range: Min = $0, Max = $10,000, Median = $106 (suggesting extreme outliers)

Minimum Nights: Max = 1,250 nights, with a median of 3 nights (likely unrealistic values)

*Reviews: Some listings have 0 reviews, while the maximum is 629 reviews.

*Availability: Some listings are available for 0 days, while others are available for 365 days.

## ***2. Understanding Your Variables***

In [None]:
data.columns

In [None]:
data.describe()

### Variables Description

1. Identification Variables:
id (int) – Unique identifier for each listing.

name (str, nullable) – The title or name of the listing.

host_id (int) – Unique identifier for the host.

host_name (str, nullable) – Name of the host.

2. Location Variables:
neighbourhood_group (str) – The broader borough in NYC where the listing is located (e.g., Manhattan, Brooklyn).

neighbourhood (str) – Specific neighborhood within the borough.

latitude (float) – Geographic coordinate (latitude) of the listing.

longitude (float) – Geographic coordinate (longitude) of the listing.

3. Property & Booking Details:
room_type (str) – Type of listing:

Entire home/apt – The entire place is available for guests.

Private room – Guests share some spaces.

Shared room – Guests share sleeping space with others.

price (int) – The cost per night in USD.

minimum_nights (int) – Minimum number of nights required for a booking.

availability_365 (int) – Number of days per year the listing is available for booking.

4. Review & Popularity Metrics:
number_of_reviews (int) – Total reviews the listing has received.

last_review (date, nullable) – Date of the most recent review.

reviews_per_month (float, nullable) – Average number of reviews per month.

5. Host Activity Variables:
calculated_host_listings_count (int) – Number of listings the host has.

### Check Unique Values for each variable.

---



In [None]:
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

data info

In [None]:
data.info()

value count

In [None]:
data.value_counts()

rename


In [None]:
data.rename({'number_of_reviews':'no_of_rev'},axis=1)

In [None]:
# sort data in increasing order of price
data.sort_values('price',inplace=True)
data

In [None]:
#filter data
data[data['neighbourhood_group']=='Manhattan']

In [None]:
#new columns
data['total_price']=data['price']*data['minimum_nights']
data

In [None]:
data.isnull().sum()

In [None]:
#drop missing value
data.dropna(inplace=True)
data

1. Data Exploration
Checked dataset structure (48,895 rows, 16 columns).

Identified missing values in:

name (16 missing)

host_name (21 missing)

last_review (10,052 missing)

reviews_per_month (10,052 missing)

2. Unique Value Analysis
Hosts & Listings:

37,457 unique hosts (some manage multiple properties).

47,905 unique listing names (some duplicates).

Location:

Listings span 5 boroughs and 221 neighborhoods.

Room Type Distribution:

Entire home/apt, Private room, Shared room (only 3 unique values).

3. Data Transformation
Created total price column: Multiplied price and minimum nights.



Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize= (4,3))
sns.barplot(data=data,x='neighbourhood_group',y='price')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Simple comparison between neighbourhood group and price

##### 2. What is/are the insight(s) found from the chart?

* Manahattan has the highest price in the neighbourhood group
* Bronx has th lowest price compared in the neighbourhood group

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By making price disparities visible across regions, businesses can adjust offerings, attract the right audience, and improve conversion and satisfaction.

#### Chart - 2

In [None]:
plt.figure(figsize= (4,3))
sns.barplot(data=data,x='calculated_host_listings_count',y='neighbourhood_group')
plt.show()

##### 1. Why did you pick the specific chart?

Neighborhood names are categorical and can often be long. Horizontal bars make it easier to read longer labels without rotating or truncating them.

##### 2. What is/are the insight(s) found from the chart?

* Manhattan has the hightest number of listing count in neighbourhood group
* Whereas the staten Island, Broklyn and bronx have least listing count in the neighbourhood group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Market Targeting & Expansion , Competitive Positioning,  Strategic Pricing

#### Chart - 3

In [None]:
plt.figure(figsize= (4,3))
sns.boxplot(data=data,x='room_type',y='price')
plt.show()

Price Distribution & Spread, Comparison Across Room Types, Spotting Outliers & Price Gaps.

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

* Entire homes/apartments charge premium prices
* shared rooms are more cheaper than the other to room types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Better Pricing Algorithms → Helps Airbnb fine-tune dynamic pricing based on real trends.
* Improved Search & Filtering → If private rooms are overpriced in certain areas, Airbnb can suggest alternatives.

#### Chart - 4

In [None]:
# Minimum Nights vs. Number of Reviews
plt.figure(figsize= (20,3))
sns.boxplot(data=data,x='minimum_nights',y='number_of_reviews')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?





I picked a box plot for number of reviews vs. minimum nights because it’s the best way to visualize how review activity changes with different stay lengths while identifying patterns and anomalies.


##### 2. What is/are the insight(s) found from the chart?

* Listings with Shorter Minimum Stays Get More Reviews
* Longer Minimum Stay Listings Have Fewer Reviews

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More Reviews → Higher Visibility & Bookings , Optimizing Pricing & Minimum Stay Strategy

#### Chart - 5

In [None]:
#Availability vs. Price
plt.figure(figsize= (20,3))
sns.scatterplot(data=data,x='availability_365',y='price')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a scatter plot for Availability vs. Price because it’s the best way to visualize how price correlates with the number of available days per year.

##### 2. What is/are the insight(s) found from the chart?

---



* No Strong Correlation Between Price & Availability
* High-Priced Listings Tend to Have Lower Availability
* Budget-Friendly Listings Are Available Year-Round


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* If luxury listings with low availability charge premium rates, hosts can limit availability to create exclusivity.
*  If budget-friendly listings with high availability get more bookings, hosts can keep calendar open year-round.

#### Chart - 6

In [None]:
# Host Listings vs. Price
plt.figure(figsize= (20,3))
sns.scatterplot(data=data,x='calculated_host_listings_count',y='price')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this graph because it helps uncover how host behavior and business scale influence pricing strategies—a key factor for both Airbnb's operations and host success.

##### 2. What is/are the insight(s) found from the chart?

* Multi-Listing Hosts Tend to Have Slightly Lower Prices.
* Single-Listing Hosts Charge a Wider Range of Prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Business Impact by Audience.
* The relationship between host listing count and pricing reveals patterns that can directly inform Airbnb’s strategy, host behavior, and guest expectations.

#### Chart - 7

In [None]:
# Reviews per Month vs. Price
plt.figure(figsize= (20,3))
sns.scatterplot(data=data,x='reviews_per_month',y='price')
plt.show()

##### 1. Why did you pick the specific chart?

* Reviews per Month = Demand Signal
* Helps Identify the Sweet Spot for Pricing

##### 2. What is/are the insight(s) found from the chart?

 * Lower-Priced Listings Get More Reviews per Month
 * High-Priced Listings Receive Fewer Reviews per Month

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The relationship between Reviews per Month and Price offers valuable guidance for both hosts and Airbnb to optimize their strategy for higher engagement and revenue.



#### Chart - 8

In [None]:
#Host Listings vs. Reviews
plt.figure(figsize= (20,3))
sns.scatterplot(data=data,x='calculated_host_listings_count',y='number_of_reviews')
plt.show()

##### 1. Why did you pick the specific chart?

Visualizing Scale vs. Engagement
Easy to Spot Patterns & Outliers

##### 2. What is/are the insight(s) found from the chart?

* Most Reviews Come from Hosts with Fewer Listings
* Multi-Listing Hosts Show Varying Review Performance

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
#Listings Count by Room Type
plt.figure(figsize= (20,3))
room_type_counts = data['room_type'].value_counts()
plt.pie(room_type_counts, labels=room_type_counts.index, autopct='%1.1f%%')
plt.show()

##### 1. Why did you pick the specific chart?

* Clear Proportional Comparison
* Visual Simplicity

##### 2. What is/are the insight(s) found from the chart?

* Entire home/apt usually makes up the largest portion, indicating a preference for full privacy.
* Private rooms are the second most common, often used by travelers seeking budget-friendly stays or local interactions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Product-Market Fit & Inventory Planning
* Pricing & Positioning Strategy

#### Chart - 10

In [None]:

plt.figure(figsize= (20,5))
neighbourhood_group_count= data['neighbourhood_group'].value_counts()
plt.pie(neighbourhood_group_count,labels=neighbourhood_group_count.index, autopct='%1.1f%%')
plt.show()



##### 1. Why did you pick the specific chart?

*  Instant Visual Insight
*  Limited Categories = Clean Visuals

##### 2. What is/are the insight(s) found from the chart?

*  Growth Opportunities
* Targeted Marketing & Investment

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* For Hosts & Property Managers
* For City Planners or Tourism Boards

#### Chart - 11

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data['availability_365'], bins=50, kde=False, color='skyblue')
plt.xlabel('Availability (Days)')
plt.ylabel('Frequency')
plt.title('Distribution of Availability')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a histogram for the "Listings Count vs Availability" chart because it's the best way to visualize a distribution — especially when you're looking at how many listings fall into different availability ranges throughout the yea

##### 2. What is/are the insight(s) found from the chart?

* 365-day listings for consistent revenue.
* Low-availability listings for re-engagement or offboarding.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Identify Revenue Opportunities
* Listings with 365 availability are full-time — great candidates for promotions, premium support, or professional tools.

#### Chart - 13

In [None]:
#4. Number of Reviews
plt.figure(figsize= (20,3))
sns.histplot(data=data,x='number_of_reviews',bins=50,kde=False)
plt.show()

##### 1. Why did you pick the specific chart?

A histogram gives a fast, intuitive view of how reviews are distributed across thousands of listings.

Great for presentations or dashboards — even non-technical viewers can get the message.

##### 2. What is/are the insight(s) found from the chart?

High number of reviews usually = more bookings.

It also reflects consistent guest traffic over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Identify Top-Performing Listings
* Strategic Investment & Growth

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_df = data.select_dtypes(include=['number'])
corr_matrix = numeric_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, linewidths=0.5)



##### 1. Why did you pick the specific chart?

You're Measuring Relationships, Not Distributions
Smart for Feature Selection & Business Strategy

##### 2. What is/are the insight(s) found from the chart?

 * reviews_per_month vs number_of_reviews → Very strong positive correlation (0.99)
 * availability_365 vs number_of_reviews → Moderate positive correlation (~0.36)
 *  minimum_nights vs other features → Very low or negative correlations

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pairplot (also known as a scatterplot matrix) for the Airbnb NYC 2019 dataset because it is an effective visualization tool for exploring relationships between multiple numerical variables simultaneously.

##### 2. What is/are the insight(s) found from the chart?

* The presence of extreme prices suggests a diverse market, with luxury listings coexisting alongside budget options. Outliers may need special consideration in pricing models or could represent unique properties (e.g., entire homes in premium neighborhoods like Manhattan).

* Listings with longer minimum stays tend to be priced higher, possibly targeting long-term renters or reflecting hosts’ preference for stable bookings. This could indicate a niche for extended-stay accommodations in NYC.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

*  Focus on High-Engagement Listings

Action: Prioritize and invest in listings with ≥50 reviews and high availability — they already have momentum.

* Target Underutilized Listings for Optimization

Action: Offer training, photography support, or pricing tools to help these hosts convert better.

* Use Room Type & Neighbourhood Insights for Strategy


Action: Invest marketing in top-performing room types & zones, while launching pilots in underbooked areas.

*  Leverage Price Independency

Action: Use smart pricing tools and A/B testing to maximize revenue without hurting engagement.

*  Clean Data & Platform Inventory

Action: Perform regular audits to clean up inactive listings.

*  Predict & Promote Rising Stars (Pair Plot Insights)

Action: Create a “Boost” program to accelerate promising listings with ad credits or homepage placement.














# **Conclusion**

By conducting an in-depth exploratory data analysis of Airbnb bookings, this project provides valuable insights for both Airbnb hosts and travelers. Hosts can optimize their pricing and listing strategies, while travelers can make informed decisions about their stays. The findings from this analysis can also be beneficial for market analysts and policymakers to understand the impact of short-term rentals on local economies.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***