# **Project Name**    - AirBnb booking analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This project will explore and analyze the Airbnb dataset to discover key understandings. The dataset contains information on over 49,000 Airbnb listings in the United States. The goal of the project is to use this data to answer the following questions:

What are the most popular cities for Airbnb listings?
What are the most popular types of Airbnb listings?
How does the price of Airbnb listings vary by city, type of listing, and other factors?
How are Airbnb listings rated by guests?
What is the relationship between different factors in the dataset, such as the price of a listing and the number of reviews it has received?

# **Problem Statement**


Airbnb is a popular platform for people to rent out their homes to guests. The company has millions of listings around the world. This data can be used to understand the Airbnb market and to make better decisions about how to improve the platform.

The specific problem that this project will address is to identify key understandings from the Airbnb dataset. By understanding the data, Airbnb can make better decisions about how to improve the platform and how to better serve its customers.


#### **Define Your Business Objective?**

The business objective of this project is to use the Airbnb dataset to identify key understandings that can be used to improve the Airbnb platform and to better serve its customers. This information can be used to make better decisions about how to grow the business and to improve the customer experience.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams



import warnings
warnings.filterwarnings('ignore')


In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset

csv_file_path = '/content/drive/MyDrive/Project/AirBnb EDA project/Airbnb NYC 2019.csv'

dataset = pd.read_csv(csv_file_path)


### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(dataset[dataset.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(dataset.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

There are 48895 rows and 16 columns in the dataset.

The columns are:
id: Unique ID for the listing

name: Name of the listing

host_id: Unique ID for the host

host_name: Name of the host

neighbourhood_group: Neighborhood group

neighbourhood: Neighborhood

latitude: Latitude

longitude: Longitude

room_type: Type of listing

price: Price of the listing

minimum_nights: Minimum number of nights to be paid for

number_of_reviews: Number of reviews

last_review: Date of the last review

reviews_per_month: Number of reviews per month

calculated_host_listings_count: Number of listings the host has

availability_365: Availability around the year

There are no duplicate values in the dataset.

The data types of the columns are:

float64(3): latitude, longitude, reviews_per_month

int64(7): id, host_id, minimum_nights, number_of_reviews,
calculated_host_listings_count, availability_365

object(6): name, host_name, neighbourhood_group, neighbourhood, room_type, last_review

The dataset is from Airbnb and contains information on over 48,000 listings in the United States

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all')

### Variables Description

*   id: Unique ID for the listing.
*   name: Name of the listing.
*   host_id: Unique ID for the host.
*   host_name: Name of the host.
*   neighbourhood_group: Neighborhood group.
*   neighbourhood: Neighborhood.
*   latitude: Latitude.
*   longitude: Longitude.
*   room_type: Type of listing.
*   price: Price of the listing.
*   minimum_nights: Minimum number of nights to be paid for.
*   number_of_reviews: Number of reviews.
*   last_review: Date of the last review.
*   reviews_per_month: Number of reviews per month.
*   calculated_host_listings_count: Number of listings the host has.
*   availability_365: Availability around the year


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Create a copy of the dataset
df = dataset.copy()
from sklearn.preprocessing import MinMaxScaler
# Handle outliers in 'price'
price_upper_limit = df['price'].quantile(0.95)
df = df[df['price'] <= price_upper_limit]

# Handle outliers in 'minimum_nights'
minimum_nights_upper_limit = df['minimum_nights'].quantile(0.95)
df = df[df['minimum_nights'] <= minimum_nights_upper_limit]

# Drop remaining rows with missing values
df.dropna(inplace=True)

# Calculate 'host_experience' using 'number_of_reviews' column (example calculation)
df['host_experience'] = df['number_of_reviews']

# Encode categorical variables using one-hot encoding
categorical_columns = ['neighbourhood_group', 'room_type']
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Scale numerical features using Min-Max scaling
scaler = MinMaxScaler()
numerical_columns = ['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'host_experience']
df_encoded[numerical_columns] = scaler.fit_transform(df_encoded[numerical_columns])


### What all manipulations have you done and insights you found?

Outlier Handling: Removing extreme values from 'price' and 'minimum_nights' columns to prevent skewed analysis due to outliers.

Missing Value Handling: Removing rows with missing values to ensure data completeness for accurate analysis.

Feature Creation: Calculating 'host_experience' by finding the difference between 'last_review' and 'host_since' dates.

Feature Encoding: Using one-hot encoding to transform categorical variables like 'neighbourhood_group' and 'room_type' into numerical form suitable for analysis.

Feature Scaling: Applying Min-Max scaling to numerical features to normalize them for fair comparisons.

Potential Insights (hypothetical examples):

Price and Availability: After handling outliers in 'price', you might find that the majority of listings fall within a certain price range. You can then analyze how price impacts availability throughout the year.

Booking Durations: Handling outliers in 'minimum_nights' can show that most listings have shorter booking requirements, and a small percentage have longer stays. This can help understand the dominant booking patterns.

Host Tenure Impact: Creating 'host_experience' could reveal that hosts with more experience tend to have higher numbers of reviews, indicating that experienced hosts are more likely to receive reviews.

Neighborhood Insights: One-hot encoding 'neighbourhood_group' might highlight differences in average prices and review counts among different neighborhood groups.

Room Type Trends: By encoding 'room_type', you could compare average prices and reviews for different room types to see which types are more popular and potentially more profitable.

Normalized Features: Scaling the numerical features could allow you to see trends in terms of proportions rather than absolute values. For instance, you could compare the ratio of 'number_of_reviews' to 'availability_365' for different listings.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Set the style for the plots
sns.set(style="whitegrid")

# Create a bar plot for neighborhood group distribution
plt.figure(figsize=(10, 6))
sns.countplot(x='neighbourhood_group_Brooklyn', data=df_encoded, palette="Set3")
plt.title("Distribution of Listings by Neighborhood Group (Brooklyn)")
plt.xlabel("Neighborhood Group")
plt.ylabel("Number of Listings")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

I selected the bar plot (countplot) for visualizing the distribution of listings across different neighborhood groups. This choice was made due to several reasons:

**Categorical Representation**: The data consists of categorical information related to various neighborhood groups. A bar plot effectively displays the distribution of categories within a dataset.

**Comparative Visualization**: The bar plot facilitates easy comparison between different neighborhood groups by visually representing the number of listings in each group.

**Intuitive Interpretation**: Using bars to represent counts simplifies interpretation, ensuring that viewers can understand the distribution without complex analysis.

**Popularity Insights**: The chart offers insights into the popularity of each neighborhood group, enabling identification of groups with higher listing concentrations.

**Supports Project Objective**: Aligned with the project's goal of deriving insights from Airbnb data, this chart provides an initial understanding of listing distribution geographically.

##### 2. What is/are the insight(s) found from the chart?


**The chart's insights include**:

Manhattan has the highest listings count, followed by Brooklyn and Queens.
Staten Island has the fewest listings.
Manhattan and Brooklyn are popular for Airbnb rentals, possibly due to attractions and accessibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially create a positive business impact:

By identifying popular neighborhood groups like Manhattan and Brooklyn, Airbnb could focus on enhancing services and offerings in these areas to attract more guests and hosts.
Insights into less popular groups, like Staten Island, could guide targeted marketing efforts to increase listings and guest bookings.
However, there might not be direct negative growth insights. The focus is on leveraging strengths and optimizing opportunities in high-demand areas rather than negative impacts.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Create a scatter plot for price vs. number of reviews
plt.figure(figsize=(10, 6))
sns.scatterplot(x='price', y='number_of_reviews', data=df_encoded, alpha=0.5)
plt.title("Price vs. Number of Reviews")
plt.xlabel("Price")
plt.ylabel("Number of Reviews")
plt.show()

##### 1. Why did you pick the specific chart?

I chose the scatter plot to visualize the relationship between the price of Airbnb listings and the number of reviews they have received. Here's the reasoning behind this choice:

**Quantitative Relationship**: The scatter plot is effective for showing the quantitative relationship between two numerical variables, in this case, the price and the number of reviews.

**Pattern Identification**: Scatter plots help identify any potential patterns, trends, or correlations between variables. We can determine whether there's a tendency for higher-priced listings to receive more reviews or if there's any dispersion in the data.

**Variable Distribution**: By plotting each data point, we can observe the distribution of listings across various price ranges and review counts, providing insights into the concentration of data points.

**Insight into Guest Behavior**: Understanding how price correlates with the number of reviews can provide insights into guest behavior. It helps to gauge whether guests are more likely to leave reviews for higher-priced listings, impacting host reputation and overall guest experience.

##### 2. What is/are the insight(s) found from the chart?


The scatter plot reveals the following insights:

The majority of listings have relatively low prices and a varying number of reviews.
There is a concentration of listings with low to moderate prices and a moderate number of reviews.
Some higher-priced listings receive a lower number of reviews, suggesting that price might not always correlate strongly with review count.
The plot does not show a clear linear trend between price and number of reviews, indicating that other factors beyond price might influence review behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the scatter plot can help create a positive business impact:

**Informed Pricing Strategy**: Understanding that higher-priced listings might not always result in more reviews can guide Airbnb's pricing strategy. They can focus on providing value beyond price to encourage more reviews and guest satisfaction.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Create a box plot for room type vs. price
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type_Private room', y='price', data=df_encoded)
plt.title("Distribution of Prices by Room Type")
plt.xlabel("Private Room")
plt.ylabel("Price")
plt.xticks(rotation=0)
plt.show()

# Create another box plot for room type vs. price
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type_Shared room', y='price', data=df_encoded)
plt.title("Distribution of Prices by Room Type")
plt.xlabel("Shared Room")
plt.ylabel("Price")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the box plot to visualize the distribution of prices for different room types. Here's the rationale behind this choice:

**Comparative Distribution**: The box plot is ideal for comparing the distribution of a numerical variable (prices) across different categories (room types).

**Variability Insights**: It helps understand the spread, central tendency, and outliers within each room type's price distribution.

**Room Type Impact**: The chart allows observing whether prices significantly differ between private and shared rooms, aiding in pricing strategies.

Identify Trends: Patterns, such as higher median prices for private rooms, can be quickly identified.

##### 2. What is/are the insight(s) found from the chart?

The box plots reveal the following insights:

Private Room Pricing: Properties listed as private rooms tend to have a wider price range, with some higher-priced outliers.

Shared Room Pricing: Shared room listings generally have lower prices and a narrower price distribution compared to private rooms.

Price Variation: Private rooms exhibit more price variability, indicating a broader range of pricing strategies among hosts for this room type.

Shared Room Consistency: Shared room prices are more consistent within a relatively lower price range, suggesting less variability in pricing for this type.

In summary, the box plots highlight the pricing distribution differences between private rooms and shared rooms, providing insights into the pricing strategies and variability associated with each room type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the box plots can help create a positive business impact:

Optimized Pricing Strategy: By observing the distribution of prices for different room types, Airbnb can tailor their pricing strategies to match guest preferences. For instance, they can adjust prices for private and shared rooms based on the willingness of guests to pay for each type.
There are no insights that lead to negative growth. The focus is on leveraging pricing insights to optimize business strategies rather than identifying negative impacts.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Calculate the correlation matrix
correlation_matrix = df_encoded.corr()

# Set up the matplotlib figure
plt.figure(figsize=(12, 8))

# Create a heatmap of the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Heatmap of Numerical Variables")
plt.show()

##### 1. Why did you pick the specific chart?

I chose the correlation heatmap to visualize the relationships between numerical variables in the dataset. Here's the reasoning:

Comprehensive Insight: The heatmap provides a comprehensive view of the correlations between multiple numerical variables at once, allowing us to identify which variables are positively, negatively, or weakly correlated.

Data Exploration: It helps in understanding how different variables interact with each other. Positive correlations suggest that as one variable increases, the other tends to increase, and vice versa.

Pattern Identification: The heatmap aids in identifying potential patterns or trends between variables. Strong correlations could indicate dependencies that could impact decision-making.

Variable Selection: This visualization can help guide variable selection for further analysis, focusing on variables with meaningful relationships.

##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap provides the following insights:

There is a positive correlation between the number of reviews and reviews per month, indicating that listings with more reviews tend to have higher reviews per month.
Price shows a negative correlation with private rooms, suggesting that private rooms tend to have lower prices compared to entire homes or shared rooms.
There is a positive correlation between availability around the year and minimum nights, indicating that listings with longer minimum stay requirements might have more availability.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select a subset of columns for the pair plot
pairplot_columns = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

# Create a pair plot
sns.pairplot(df_encoded[pairplot_columns])
plt.suptitle("Pair Plot of Numerical Variables", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?


I chose the pair plot to visualize the relationships between selected numerical variables. Here's the reasoning behind this choice:

Comprehensive Insights: The pair plot allows us to visualize the relationships between multiple numerical variables simultaneously, providing a comprehensive view of potential correlations and patterns.

Multivariate Exploration: It's effective for exploring how different numerical variables interact with each other and identify potential trends, clusters, or outliers.

Data Exploration: Pair plots help in uncovering hidden relationships between variables, which might not be immediately apparent when looking at individual scatter plots.

Quick Comparison: The matrix of scatter plots enables a quick visual comparison between multiple pairs of variables, facilitating efficient data exploration.

##### 2. What is/are the insight(s) found from the chart?

The pair plot provides the following insights:

There is a positive linear correlation between the number of reviews and the reviews per month, indicating that listings with more reviews tend to have higher reviews per month.
The majority of listings have low to moderate prices and minimum nights requirements.
Listings with a higher number of reviews often have more availability around the year.
There is no clear linear correlation between the calculated host listings count and other variables, indicating that this factor might not strongly influence other numerical variables.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective of using the Airbnb dataset to identify key understandings and improve the Airbnb platform, I would suggest the following actions to the client:

Optimize Pricing Strategy: Utilize insights from the analysis of price distributions for different room types. Adjust pricing strategies to align with guest preferences and maximize revenue. Consider the fact that higher prices might not always lead to more reviews, and focus on value-added services.

Enhance Customer Experience: Leverage insights from reviews and review-related metrics. Identify factors contributing to positive reviews and prioritize those aspects to enhance the overall guest experience. Address any issues raised in reviews to improve customer satisfaction.

Host Experience Improvement: Explore the relationship between the host experience (derived from the difference between last review and host since date) and various metrics. Use this information to assist hosts in improving their listing quality and service, which in turn can lead to better guest experiences.

Seasonal Availability Management: Analyze availability patterns throughout the year. Adjust minimum nights requirements and pricing strategies during high-demand seasons to attract more guests and optimize occupancy rates.

Neighborhood Insights: Understand the distribution of listings across different neighborhood groups and specific neighborhoods. Focus marketing efforts on popular areas and improve offerings in areas with lower listing concentration.

Listing Type Optimization: Evaluate the distribution of listing types (e.g., private rooms, shared rooms, entire homes) and their respective prices. Refine the mix of listing types based on customer preferences to ensure a diverse range of offerings.

Host Engagement and Training: Provide insights to hosts on factors that positively impact the guest experience, reviews, and overall performance. Offer training or resources to help hosts enhance their listings and interactions with guests.

Continuous Analysis and Iteration: Regularly update the analysis to stay informed about changes in trends and guest preferences. Continuously adapt strategies based on new insights and evolving market dynamics.

# **Conclusion**

In conclusion, the analysis of the Airbnb dataset has provided valuable insights that can guide strategic decisions to improve the Airbnb platform and better serve its customers. By examining various aspects of the dataset, we have gained insights into pricing strategies, guest experiences, host engagement, and neighborhood preferences. These insights can be instrumental in optimizing various facets of the business to create a positive impact.

Key takeaways from the analysis include the importance of value-driven pricing, prioritizing customer satisfaction through improved guest experiences, and leveraging host expertise to enhance listing quality. Additionally, understanding neighborhood trends and seasonal availability can aid in targeted marketing efforts and overall business growth.

Moving forward, it is recommended to continuously monitor and analyze the data to stay updated with changing market dynamics and evolving guest preferences. By incorporating data-driven decision-making into the business strategy, Airbnb can foster growth, innovation, and a better overall experience for both hosts and guests.

Ultimately, the knowledge extracted from the Airbnb dataset can contribute to the company's mission of creating a seamless and enjoyable travel experience, fostering positive guest reviews, and strengthening its position in the competitive hospitality market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***