# **Project Name**    - AirBnb Bookings Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual




# **Project Summary -**

Since its inception in 2008, Airbnb has redefined the travel and hospitality industry by offering a unique and personalized way to find accommodations. The platform connects millions of guests and hosts worldwide, generating an immense amount of data. This data can provide actionable insights for improving customer experiences, guiding business strategies, and optimizing platform performance. With the growing reliance on data-driven decisions, it has become increasingly important to analyze Airbnb’s data to understand customer behavior, pricing trends, and other critical factors impacting the platform's success.

This project focuses on conducting an Exploratory Data Analysis (EDA) of the Airbnb NYC 2019 dataset. The dataset contains 48,895 rows and 16 columns, offering a comprehensive snapshot of Airbnb listings in New York City. These columns include information about hosts, listing locations, room types, prices, availability, and customer reviews. The primary objective of this analysis is to extract meaningful patterns and insights that can help Airbnb’s stakeholders, including management, hosts, and customers, make well-informed decisions.

# **GitHub Link -**

Link - https://github.com/Chetna03/AirBnb-Bookings-Analysis-EDA

# **Problem Statement**


The goal of this analysis is to explore and understand the factors that affect the performance of Airbnb listings, including pricing, booking behavior, and customer ratings. By conducting a comprehensive exploratory data analysis on the Airbnb dataset, we aim to uncover insights related to the following:



1.   Price Analysis: What factors (e.g., location, amenities, property type) influence the price of Airbnb listings?
2.   Customer Ratings: How do factors such as cleanliness, host responsiveness, and listing features affect customer ratings and reviews?
3.   Location and Popularity: How does the geographical location of listings impact their popularity, pricing, and booking frequency?
4.   Host Characteristics: What are the characteristics of successful hosts, and how do their listings differ from less successful ones?
5.   Booking Patterns: What patterns can be observed in booking frequency over time (e.g., seasonal trends, demand fluctuations)?

By examining these questions, we aim to extract actionable insights that could help Airbnb hosts optimize their listings and improve customer satisfaction, while also providing recommendations for potential pricing strategies, property improvements, and marketing techniques.

This analysis will leverage various data visualization and statistical techniques to uncover hidden patterns and relationships in the dataset.

#### **Define Your Business Objective?**

The primary objective of this exploratory data analysis is to provide actionable insights that help Airbnb hosts, property managers, and the platform itself optimize their business performance. Specifically, the analysis aims to:


1.   Maximize Revenue: Identify the key factors that influence pricing, and develop strategies to help hosts optimize their pricing models based on property type, location, and market demand.
2.   Improve Guest Experience: Analyze customer ratings and reviews to identify areas of improvement for hosts, such as cleanliness, communication, and amenities, to enhance guest satisfaction and improve ratings.
3.   Increase Booking Rates: Understand seasonal trends and demand patterns to help hosts adjust their availability and pricing accordingly, leading to better booking rates.
4.   Targeted Marketing: Identify trends and patterns in customer preferences, enabling hosts to create targeted marketing strategies that appeal to specific customer segments.
5.   Optimize Listings: Uncover trends related to successful listings, allowing hosts to better tailor their offerings to attract guests and differentiate themselves in a competitive market.

The overall aim is to help Airbnb hosts make data-driven decisions that lead to higher occupancy rates, improved customer satisfaction, and ultimately increased profitability, while also assisting the platform in refining its services and user experience.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('https://raw.githubusercontent.com/Chetna03/AirBnb-Bookings-Analysis-EDA/main/Airbnb%20NYC%202019.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

In [None]:
# Check for duplicate values on the id
df[df.duplicated(subset=["id"])]

In [None]:
# Check for duplicate values on the name
df[df.duplicated(subset=["name"])]

In [None]:
# Check for the differences in the duplicated name values
df[df["name"] == "Loft Suite @ The Box House Hotel"]

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values with a bar chart

missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')
plt.title("Missing Values Count by Column", fontsize=16)
plt.xlabel("Columns", fontsize=12)
plt.ylabel("Number of Missing Values", fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

### What did you know about your dataset?

The dataset contains details about Airbnb listings in NYC for 2019, including features like neighborhood, room type, price, and availability.

**Dataset Information:**

The dataset contains 16 columns with different data types: integers, floats, and strings.

**Duplicate Values:**

There are 0 duplicate rows in the dataset.

**Missing/Null Values:**

Key columns with missing values:

name- 16 missing entries

host_name- 21 missing entries

last_review & reviews_per_month- Both have 10,052 missing entries


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df[df['price'] == 0]

In [None]:
df[df['minimum_nights'] > 365]

### Variables Description

*  id - Unique ID
*  name - Name of the listing
*  host_id - Unique host_id
*  host_name - Name of the host
*  neighbourhood_group - location
*  neighborhood - area
*  latitude - Latitude range
*  longitude - Longitude range
*  room_type - Type of listing
*  price - Price of listing
*  minimum_nights - Minimum nights to be paid for
*  Number_of reviews - Number of reviews
*  last_review - Content of the last review
*  reviews_per_month - Number of checks per month
*  calculated_host_listing_count - Total count
*  availability_365 - Availability around the year


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique().to_frame("Unique Values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Changing data types
df['last_review'] = pd.to_datetime(df['last_review'])
df['neighbourhood_group'] = df['neighbourhood_group'].astype('category')
df['neighbourhood'] = df['neighbourhood'].astype('category')
df['room_type'] = df['room_type'].astype('category')

In [None]:
# Handling missing values
df['name'] = df['name'].fillna('Unknown')
df['host_name'] = df['host_name'].fillna('Unknown')
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['last_review'] = df['last_review'].fillna('2000-01-01')

In [None]:
# Remove unimportant columns
# df.drop(columns=['id','host_id'], inplace=True)

In [None]:
# New Columns
df['never_reviewed'] = df['number_of_reviews'] == 0

price_bins = [0, 50, 150, 300, 10000]
price_labels = ['Budget', 'Mid-range', 'Premium', 'Luxury']
df['price_category'] = pd.cut(df['price'],bins=price_bins,labels=price_labels)

stay_bins = [0, 3, 7, 30, 365]
stay_labels = ['Short Stay','Week Stay','Month Stay','Extended Stay']
df['stay_category'] = pd.cut(df['minimum_nights'],bins=stay_bins,labels=stay_labels)

df['estimated_revenue_per_year'] = df['price']*df['availability_365']

In [None]:
# Removing Outliers
df = df[df['price'] > 0]
df = df[df['minimum_nights'] <= 365]

In [None]:
df['log_price'] = np.log1p(df['price'])

In [None]:
df = df.rename(columns={
    'id': 'listing_id',                   # Unique identifier for each listing
    'name': 'listing_name',               # Name/title of the Airbnb listing
    'host_id': 'host_unique_id',          # Unique ID of the host
    'host_name': 'host_full_name',        # Full name of the host
    'neighbourhood_group': 'city_area',   # Broad area in the city
    'neighbourhood': 'locality',          # Specific locality of the listing
    'latitude': 'lat',                    # Latitude of the listing
    'longitude': 'lon',                   # Longitude of the listing
    'room_type': 'accommodation_type',    # Type of room (Entire home, Private room, etc.)
    'price': 'nightly_rate',              # Price per night for the listing
    'minimum_nights': 'min_stay',         # Minimum stay required
    'number_of_reviews': 'total_reviews', # Total number of reviews received
    'last_review': 'last_review_date',    # Date of the last review
    'reviews_per_month': 'avg_reviews_per_month',  # Average reviews per month
    'calculated_host_listings_count': 'total_listings_by_host', # Number of listings by the host
    'availability_365': 'availability_days_per_year', # Number of days available in a year
    'last_review_year': 'review_year'
})



### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Distribution of Airbnb listings across different Stay Categories (Short, Week, Month, Extended)

In [None]:
# Set Seaborn style
sns.set_style("whitegrid")

# Create a figure with a specified size
plt.figure(figsize=(8, 5))

# Create a count plot for 'stay_category' with hue for differentiation
ax = sns.countplot(
    x="stay_category", # We want to analyse 'stay_category' column
    data=df, # Dataframe name
    hue="stay_category",  # Different colours for each 'stay_category'
    palette=plt.cm.Set2.colors,  # Set color palette
    linewidth=2,  # Add border around bars
    width=0.5  # Adjust bar width
)

# Loop through each bar (patch) in the count plot
for p in ax.patches:
    ax.annotate(
        f'{int(p.get_height())}',  # Convert the bar height (count) to an integer and format it as text
        (p.get_x() + p.get_width() / 2, p.get_height()),  # Position the label at the center-top of each bar
        ha='center',  # Horizontally center the text
        va='bottom',  # Align text just above the bar
        fontsize=10  # Set font size
    )


# Set labels and title
plt.xlabel("Stay Category", fontsize=12, labelpad=12)
plt.ylabel("Number of Listings", fontsize=12, labelpad=12)
plt.title("Distribution of Stay Categories in Airbnb Listings", fontsize=14, pad=15)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

* It is best to represent categorical data, showing the number of listings in each stay_category.
* It provides a quick and clear comparison of different stay types.

##### 2. What is/are the insight(s) found from the chart?

*   The majority of listings cater to short-term travelers, which only require the minimum booking of 1-3 days, mainly targeting the tourists,  business travellers, etc.
*   Around 15,000 listings require either a week or month of minimum booking, making it suitable for work trips, short relocations, or staycations.
* Only a small number of listings have a minimum booking requirement of more than 30 days, making is least preferable choice.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

* Current or new hosts can focus on short stay bookings since they dominate the market.
* Offering special pricing or discounts for weekly or monthly stays might improve occupancy rates.
* If market demand shifts toward long-term stays, expanding extended stay options could be profitable, as it brings higher, more stable earnings with reduced overhead costs.


⚠️ Potential Negative Growth:

* High competition in short stays could lead to price wars and lower profits.
* If demand for long stays increases, lack of listings in extended stays might be a missed business opportunity.

#### Chart - 2 : Distribution of Airbnb listings across different Price Categories (Budget, Mid-range, Premium, Luxury)

In [None]:
import matplotlib.pyplot as plt

# Count occurrences of each price category
price_category_counts = df['price_category'].value_counts()

# Set figure size
plt.figure(figsize=(8, 8))

# Create Pie Chart
plt.pie(price_category_counts,
        labels=[f"{label} ({count})" for label, count in zip(price_category_counts.index, price_category_counts)],  # Show category name with count
        autopct='%1.1f%%',  # Display percentage values
        startangle=120,  # Rotate the chart for better visibility
        colors=plt.cm.Set2.colors,  # Use a predefined color palette
        wedgeprops={'edgecolor': 'black'})  # Add black edges for better separation

# Add Title
plt.title('Distribution of Price Categories', fontsize=14, pad=15)  # Set title with padding

# Display the Pie Chart
plt.show()


##### 1. Why did you pick the specific chart?

* A pie chart is ideal for visualizing the proportion of each price category in the dataset.
* It clearly shows the market share of Budget, Mid-range, Premium, and Luxury listings.

##### 2. What is/are the insight(s) found from the chart?

*   Most listings belong to the mid-range category that falls under the \$50 to $150 price range, meaning most travelers prefer affordable stays as well as they value quality too.
*   Luxury listings are significantly fewer, indicating a smaller premium  market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

* Since most listings fall in these categories, Airbnb can offer discounts, promotions, and travel bundles to boost bookings.
* The low number of luxury listings suggests a gap in the high-end market. Airbnb can encourage hosts to list premium properties and market them to business travelers and high-income tourists.


⚠️ Potential Negative Growth:

* A lack of luxury listings may push away high-spending travelers to competitors like premium hotels, limiting market reach.
* If the price of listings stay dominated towards the lower side, it could affect Airbnb’s brand perception, making it seem like a low-cost platform instead of a diverse marketplace.

#### Chart - 3 : Distribution of Airbnb Listings Across Different Accommodation Types (Entire Home, Private Room, Shared Room)

In [None]:
import matplotlib.pyplot as plt  # Import Matplotlib for visualization

# Count occurrences of each accommodation type
accommodation_counts = df['accommodation_type'].value_counts().reset_index()
accommodation_counts.columns = ['accommodation_type', 'count']  # Rename columns for clarity

# Sort data in ascending order for better visualization
# accommodation_counts = accommodation_counts.sort_values(by='count', ascending=True)

# Create the figure
plt.figure(figsize=(7, 4))

# Plot horizontal dashed lines (Lollipop sticks)
plt.hlines(
    y=accommodation_counts['accommodation_type'],  # Y-axis values (accommodation types)
    xmin=0,  # Start of the line at 0 on the x-axis
    xmax=accommodation_counts['count'],  # End of the line at the count value
    color='seagreen',  # Line color
    linestyle='solid',  # Dashed line style
    linewidth=3  # Line thickness
)

# Plot scatter points (Lollipop heads)
plt.scatter(
    accommodation_counts['count'],  # X-axis values (listing counts)
    accommodation_counts['accommodation_type'],  # Y-axis values (accommodation types)
    color='seagreen',  # Point color
    s=60  # Size of the points
)

# Add count annotations next to points
for i in range(len(accommodation_counts)):
    plt.annotate(
        f"{accommodation_counts['count'].iloc[i]}",  # Text: count value
        (accommodation_counts['count'].iloc[i], accommodation_counts['accommodation_type'].iloc[i]),  # Position of annotation
        textcoords="offset points",  # Use offset points for better placement
        xytext=(10, -3),  # Move text slightly (10 right, 3 down)
        ha='left',  # Align text to the left
        fontsize=10,  # Font size of annotation text
        bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3')  # Box styling for readability
    )

# Adjust axis limits for better spacing
plt.ylim(-0.5, len(accommodation_counts) - 0.5)  # Adds space above and below y-axis labels
plt.xlim(0, accommodation_counts['count'].max() * 1.3)  # Extends x-axis by 30% beyond max count

# Labels and title
plt.xlabel('Number of Listings', fontsize=12, labelpad=13)
plt.ylabel('Accommodation Type', fontsize=12, labelpad=13)
plt.title('Listings Count by Accommodation Type', fontsize=14, pad=15)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

* The lollipop chart effectively shows the distribution of different accommodation types.
* Compared to a bar chart, it provides a cleaner representation with minimal clutter, making it easier to compare categories.

##### 2. What is/are the insight(s) found from the chart?

* The most common accommodation types are "Entire home/apt" and "Private room", with more than 40,000 listings combined, making up around 98% of all listings, suggesting the preference for complete privacy among Airbnb users.
* Shared rooms have minimal listings, indicating lower market demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

* Since "Entire home/apt" and "Private room" make up 98% of the listings, focusing on these types ensures maximum market reach and revenue potential.
* Offering discounts on shared rooms can test whether low listings are due to limited supply or genuinely low demand, helping identify untapped market potential.

⚠️ Potential Negative Growth:

* "Entire home/apt" dominance in the market creates excessive competition which could lower prices, reducing profitability for hosts in crowded markets.
* With only 2% of listings being shared rooms, investing in this category has a high risk of low occupancy and low revenue, making it an unprofitable choice.

#### Chart - 4 : Distribution of Airbnb Listings Across Top 10 City-Locality Combinations

In [None]:
# Combine 'city_area' and 'locality' into a single column for grouping
df['city_locality'] = df['locality'].astype(str) + ', ' + df['city_area'].astype(str)

# Group by 'city_locality' and count the number of listings
city_locality_counts = df.groupby('city_locality').size().reset_index(name='Total Listings')

# Sort the data in descending order based on the total listings
city_locality_counts = city_locality_counts.sort_values(by='Total Listings', ascending=False)

# Select the top 10 city-locality combinations
city_locality_counts = city_locality_counts.head(10)

# Set figure size
plt.figure(figsize=(10, 6))

# Create a horizontal bar plot
ax = sns.barplot(
    x='Total Listings',  # Set the x-axis to display the count of listings
    y='city_locality',   # Set the y-axis to display the city-locality names
    hue='city_locality',
    data=city_locality_counts,  # Use the grouped DataFrame as the data source
    palette="Greens_r"  # Set palette for visual appeal
)

# Add labels to each bar to display the exact count
for index, value in enumerate(city_locality_counts['Total Listings']):
    ax.text(
        value + 2,  # Position the text slightly to the right of the bar
        index,       # Align with the corresponding bar on the y-axis
        str(value),  # Convert count to string and display as text
        va='center', # Vertically center the text on the bar
        fontsize=10  # Set font size for readability
    )

# Set plot labels and title
plt.xlabel('Total Listings')  # Label for x-axis
plt.ylabel('City & Locality') # Label for y-axis
plt.title('Top 10 City & Locality Listings') # Chart title

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

* A horizontal bar chart effectively displays categorical data.
* The descending order of bars makes it easy to identify the top localities.
* Long locality names fit better on the y-axis compared to a vertical bar chart.

##### 2. What is/are the insight(s) found from the chart?

* Brooklyn and Manhattan dominate the top 10, indicating high Airbnb activity in these areas.
* Williamsburg, Brooklyn (3,917 listings) has the highest number of Airbnb listings, followed by Bedford-Stuyvesant, Brooklyn (3,709 listings).
* Among the top 10 areas, Brooklyn has only 4 localities but the highest number of listings within them, whereas Manhattan has more localities but relatively fewer listings per locality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

* Investors and property owners can focus on top-performing localities (Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick) to maximize returns, as they have high listing concentrations and audience.
* Since Manhattan has more localities but fewer listings in each, new hosts can choose less crowded areas to face less competition while still attracting tourists.

⚠️ Potential Negative Growth:

* Areas like Williamsburg and Bedford-Stuyvesant have a high number of listings, increasing competition and price pressure.
* If all hosts focus on the same top listing areas, it would lower the likelihood of opening Airbnbs in new neighborhoods.

#### Chart - 5 : Distribution of Average Reviews Per Month (Log Scaled) for Airbnb Listings

In [None]:
# Set figure size for better visibility
plt.figure(figsize=(8, 5))

# Plot Kernel Density Estimate (KDE) of log-transformed 'avg_reviews_per_month'
# np.log1p(x) is used to apply log transformation while handling zero values (log(1 + x))
sns.kdeplot(np.log1p(df['avg_reviews_per_month']), fill=True, color="orange")

# Compute the minimum and maximum log-transformed values for axis scaling
log_min = np.log1p(df['avg_reviews_per_month']).min()
log_max = np.log1p(df['avg_reviews_per_month']).max()

# Generate 15 evenly spaced tick positions in log scale for the x-axis
ticks = np.linspace(log_min, log_max, 15)  # Adjust num for more/fewer ticks

# Convert log-scale ticks back to original scale for meaningful interpretation
tick_labels = [round(np.expm1(t), 1) for t in ticks]  # np.expm1(t) = e^t - 1

# Set custom x-axis ticks with corresponding labels in original scale
plt.xticks(ticks, tick_labels)

# Generate 15 evenly spaced tick positions for the y-axis based on plot limits
# plt.gca().get_ylim() returns the (min, max) limits of the y-axis
y_min, y_max = plt.gca().get_ylim()  # Extract y-axis limits
y_ticks = np.linspace(y_min, y_max, 15)  # Create 15 evenly spaced y-ticks

# Format y-axis labels to 2 decimal places
y_labels = [round(y, 2) for y in y_ticks]

# Set custom y-axis ticks
plt.yticks(y_ticks, y_labels)

# Set axis labels and title
plt.xlabel("Average Reviews Per Month")  # Original scale for better understanding
plt.ylabel("Density")
plt.title("KDE Plot (Log Scaled) of Average Reviews Per Month")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

* A KDE Plot helps visualize the distribution of the avg_reviews_per_month column more smoothly than a histogram.
* The data might be right-skewed, meaning a log transformation ensures better visibility of patterns.
* KDE plots provide insights into the density and spread of the reviews rather than just counts.

##### 2. What is/are the insight(s) found from the chart?

* The distribution is highly skewed towards the lower end (left side), meaning most listings receive very few reviews per month.
* The long tail shows that a few listings receive significantly higher reviews per month, but they are rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Airbnb can analyze these successful listings to identify best practices (e.g., pricing, amenities, customer engagement) and apply them to other listings as well.
* Hosts can try to improve visibility of their listings through promotions, discounts, or better guest experiences to encourage more reviews.

⚠️ Potential Negative Growth:
* Hosts with lower reviews might fail to gain visibility and struggle to attract new guests. This can lead to lower occupancy rates and revenue losses.

#### Chart - 6 : Distribution of Hosts by Total Listings (Log Scaled)

In [None]:
# Count total listings per host
# This counts how many hosts have a specific number of total listings and sorts them in ascending order.
host_counts = df['total_listings_by_host'].value_counts().sort_index()

# Create a line chart with figure size 10x6
plt.figure(figsize=(10, 6))

# Plot the data
plt.plot(
    host_counts.index,  # X-axis values: Total listings per host
    host_counts.values,  # Y-axis values: Number of hosts
    marker='o',  # Marker style: 'o' represents small circles at each data point
    linestyle='-',  # Line style: '-' represents a solid line connecting the points
    color='teal'  # Line color: Teal for better visualization
)

# Apply log scale to both axes for better visualization of distribution
plt.xscale('log')  # Log scale for x-axis (total listings per host)
plt.yscale('log')  # Log scale for y-axis (number of hosts)

# Set actual values as tick marks on both axes
x_min, x_max = host_counts.index.min(), host_counts.index.max()  # Get min and max values for x-axis
y_min, y_max = host_counts.values.min(), host_counts.values.max()  # Get min and max values for y-axis

# Generate log-spaced tick marks for x-axis
x_ticks = np.logspace(
    np.log10(x_min),  # Exponent of the starting value (10^start)
    np.log10(x_max),  # Exponent of the ending value (10^stop)
    num=10,  # Number of ticks to generate
    base=10  # Base of the log scale (default is 10)
).astype(int)  # Convert values to integers for clean axis labels

# Generate log-spaced tick marks for y-axis
y_ticks = np.logspace(
    np.log10(y_min),  # Exponent of the starting value (10^start)
    np.log10(y_max),  # Exponent of the ending value (10^stop)
    num=10,  # Number of ticks to generate
    base=10  # Base of the log scale (default is 10)
).astype(int)  # Convert values to integers for clean axis labels

plt.xticks(x_ticks, x_ticks)  # Set x-axis tick labels
plt.yticks(y_ticks, y_ticks)  # Set y-axis tick labels

# Add labels and title
plt.xlabel('Total Listings per Host')  # X-axis label
plt.ylabel('Number of Hosts')  # Y-axis label
plt.title('Distribution of Hosts by Total Listings')  # Chart title

# Add grid for better readability
plt.grid(
    True,  # Enable grid
    which="both",  # Apply grid to both major and minor ticks
    linestyle="--",  # Dashed line style for better visibility
    linewidth=0.5  # Thin grid lines to avoid cluttering the chart
)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

* The line chart effectively shows how the number of hosts declines as the number of listings per host increases.
* The logarithmic scale ensures better visibility of both small and large values, helping to uncover trends that would otherwise be hidden in a linear scale.

##### 2. What is/are the insight(s) found from the chart?

* Most of the Airbnb hosts own only one property, which suggests that Airbnb is dominated by individual homeowners.
* As the number of listings increase, the count of hosts drops significantly.
* Some hosts manage dozens or even hundreds of listings, showing the presence of large-scale operators or property management businesses.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Airbnb can help big hosts by giving them easy tools to manage many properties at once, like updating prices, sending auto-messages, scheduling cleaning, and tracking earnings, so they save time and make more money, as they bring a significant business.
* Giving small hosts more visibility helps them stand out and not get hidden by big hosts.


⚠️ Potential Negative Growth:
* If small hosts think they can't compete with big hosts, they might leave, leading to fewer listings on Airbnb.

#### Chart - 7 : Distribution of Listings by Availability Days (Log Scaled)

In [None]:
# Set figure size
plt.figure(figsize=(8, 5))

# Create histogram for availability_days_per_year
plt.hist(
    df['availability_days_per_year'],  # Data to plot
    bins=52,  # Number of bins (weekly distribution for a year)
    color='seagreen',  # Bar color
    edgecolor='black',  # Outline color for better visibility
    alpha=0.7  # Transparency level (0 = fully transparent, 1 = solid)
)

# Add labels and title
plt.xlabel('Availability Days Per Year')  # X-axis label
plt.ylabel('Number of Listings')  # Y-axis label
plt.title('Histogram of Availability Days Per Year')  # Chart title

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

* It shows the distribution clearly and helps visualize how listings are spread across different availability ranges.

##### 2. What is/are the insight(s) found from the chart?

* There are a significant number of listings for 1 to 7 days, meaning many hosts only list their property for a few days per year—likely for personal use when they are away or due to peak tourist seasons or events.
* The large number of hosts in the upper ranges (300+ days of availability) strongly suggests that these properties are not casual listings but rather dedicated short-term rental businesses and manage Airbnb as a primary income source.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* With fewer properties offering month-long or year-round availability, new hosts can enter the market by focusing on extended stays.


⚠️ Potential Negative Growth:
* With so many listings already offering short stays, new hosts may struggle to stand out and attract guests.

#### Chart - 8 : Revenue Distribution of Listings (Log Scaled)

In [None]:
# Set figure size for better visualization
plt.figure(figsize=(12, 5))

# Create a boxplot to show revenue distribution
sns.boxplot(x=df['estimated_revenue_per_year'], color='teal')

# Apply log scale for better visualization of skewed revenue values
plt.xscale('log')

# Get the maximum revenue value from the dataset
max_revenue = df['estimated_revenue_per_year'].max()

# Generate log-spaced tick values for better readability on log scale
tick_values = np.logspace(
    0,  # Start at 10^0 (which is 1) to avoid zero values on a log scale
    np.log10(max_revenue),  # End at log10(max_revenue) to cover the entire range
    20,  # Generate 20 evenly spaced tick marks for clarity
    dtype=int  # Convert values to integers for cleaner tick labels
)

# Set tick labels and rotate for better readability
plt.xticks(tick_values, tick_values, rotation=90)

plt.xlabel('Estimated Revenue Per Year')  # Label x-axis
plt.title('Boxplot of Estimated Revenue Per Year')  # Add title to describe the visualization

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

* Boxplot helps identify the spread of estimated revenue, showing median, quartiles, and outliers clearly.



##### 2. What is/are the insight(s) found from the chart?

* 25% of listings earn ₹0 per year, meaning a significant portion of properties generate no revenue.
* Median revenue is around ₹4,000 - ₹5,000 per year, indicating that half of the listings earn less than this amount.
* A large portion of listings earn below ₹30,000 per year, as seen from the boxplot's main concentration.
* There are many extreme high-revenue outliers, indicating that a small percentage of listings earn disproportionately more.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Airbnb hosts can study what makes high-earning properties successful, such as their location, amenities, pricing, and guest reviews. By following similar strategies, they can increase their own earnings.


⚠️ Potential Negative Growth:
* 25% of listings earn nothing, likely due to bad location, high prices, or low demand, which could lead to hosts leaving and less activity on the platform.
* Airbnb relies on a few top listings for most of its revenue, so any changes in the economy could cause big losses.

#### Chart - 9 : Distribution of Price Categories Across Accommodation Types

In [None]:
# Grouping data by accommodation type and price category, counting occurrences
grouped_data = df.groupby(['accommodation_type', 'price_category'], observed=False).size().reset_index(name='count')

# Set figure size for better visualization
plt.figure(figsize=(8, 6))

# Create a grouped bar chart
ax = sns.barplot(
    x='accommodation_type',  # X-axis: accommodation type
    y='count',               # Y-axis: count of listings
    hue='price_category',    # Grouped by price category
    data=grouped_data,       # Data source
    palette='viridis',       # Color theme
    width=0.8                # Adjust bar width for better spacing
)

# Annotate each bar with its height value (count of listings)
for p in ax.patches:
    if p.get_height() > 0:  # Avoid adding annotations to zero-height bars
        ax.annotate(
            text=int(p.get_height()),  # Convert height to an integer label
            xy=(p.get_x() + p.get_width() / 2, p.get_height()),  # Position at the top of the bar
            ha='center', va='bottom',  # Center align text
            fontsize=8, color='black'  # Text formatting
        )

# Set axis labels and title
plt.xlabel('Accommodation Type')
plt.ylabel('Count of Listings')
plt.title('Distribution of Price Categories Across Accommodation Types')

# Add a legend for price categories
plt.legend(title="Price Category")

# Add grid lines for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

* It effectively compares different price category within each accommodation types.
* It shows the proportion of each accommodation type in every price range, making it easy to identify dominant categories.

##### 2. What is/are the insight(s) found from the chart?

* For Entire home/apt -
  * Entire homes are in high demand for mid-range & premium travelers—ideal for families & groups.
  * Luxury entire homes exist, but competition is lower, offering potential for premium investment.
  * Budget entire homes are nearly non-existent, meaning either low demand or high operating costs.

* For Private room -
  * Private rooms are best for mid-range & budget travelers—the most competitive segment.
  * Premium & luxury private rooms have low demand, meaning they may not be worth the investment.
  * Private rooms are the top choice for affordability seekers, like solo travelers & backpackers.

* For Shared room -
  * Shared rooms are only viable for budget-conscious travelers, like backpackers or students.
  * Mid-range shared rooms exist but are far less preferred compared to private & entire homes.
  * Premium & luxury shared stays have no demand, so investing in them is not a good business decision.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Mid-range & Premium account for most of the listings in the Entire home/apt category, hence nvesting in these segments is highly profitable, as they attract families, tourists, etc.
* Mid-range & Budget private rooms covers a lot of the Private room segment, proving their affordability is a key selling point, ideal for solo travelers & backpackers, making it a steady revenue stream.
* Luxury Entire Homes are few in number but still attract high-spending travelers, making higher profit margins per booking.
* Shared rooms are popular amongst busget travellers, confirming demand for low-cost stays, ensuring stable revenue in the budget segment.

⚠️ Potential Negative Growth:
* Premium & Luxury Private Rooms are a risky market as they are barely booked, meaning travelers don’t prefer expensive private rooms in shared spaces.
* Travelers expect affordability in shared rooms, not premium or luxury experiences, hnvesting in premium shared stays is a waste of resources.

#### Chart - 10 : Distribution of Accommodation Types Across City Areas

In [None]:
# Group data: Count occurrences of each accommodation type in each city area
grouped_data = df.groupby(['city_area', 'accommodation_type'], observed=False).size().unstack(fill_value=0)

# Set up the plot
fig, ax = plt.subplots(figsize=(8, 6))

# Define colors for better visualization
rental_colors = [
    '#0A9396',
    '#94D2BD',
    '#EE9B00'
]

# Create a stacked bar chart
grouped_data.plot(
    kind='bar',        # Plot type → 'bar' creates a bar chart
    stacked=True,      # Stacked bars → Shows cumulative values for each accommodation type
    ax=ax,             # Use the predefined axis 'ax' for the plot
    color=rental_colors  # Apply the custom color scheme defined above
)

# Customize labels, title, and legend
plt.xlabel('Stay Area')  # X-axis label: Represents different city areas
plt.ylabel('Total Count')  # Y-axis label: Total number of accommodations
plt.title('Stacked Bar Chart of Accommodation Types by Stay Area')  # Chart title
plt.legend(title='Accommodation Type')  # Add a legend to indicate colors for each accommodation type

# Add a grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A stacked bar chart was chosen because it effectively visualizes the distribution of accommodation types across different city areas while maintaining the total count for each area
* This allows us to see both the relative proportion and absolute numbers of each accommodation type in different regions at a glance.

##### 2. What is/are the insight(s) found from the chart?

* Manhattan has the highest number of entire home/apartment listings, making it a prime area for independent rentals.
* Brooklyn and Manhattan dominate the market, with a nearly equal number of total accommodations.
* Shared rooms are the least preferred accommodation type overall, with Manhattan and Brooklyn having the highest counts but still significantly lower than other categories.
* Staten Island has the lowest number of listings, making it a less competitive market compared to other areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Since Manhattan has the highest demand for entire homes/apartments, investors and hosts can charge premium prices in this area.
* Brooklyn has a high demand for private rooms, making it a prime location for budget travelers and backpackers. Investing in more shared accommodations here could be profitable.
* Staten Island has very few listings, which could be a potential opportunity for new hosts to enter a market with less competition.

⚠️ Potential Negative Growth:
* Since these areas already have a high volume of listings, new hosts may struggle with stiff competition
* Shared rooms have the least demand across all areas, which means investing in shared accommodations might not yield high returns compared to private rooms or entire homes.

#### Chart - 11 : Distribution of Accommodation Types Across Stay Categories

In [None]:
# Group data by 'stay_category' and 'accommodation_type', count occurrences, and reshape
stay_accommodation_counts = df.groupby(['stay_category', 'accommodation_type'], observed=True).size().unstack(fill_value=0)

# Calculate the total number of listings per stay category (row-wise sum)
total_listings_per_category = stay_accommodation_counts.sum(axis=1)

# Convert counts to percentages for a 100% stacked bar chart
accommodation_percentage = round(stay_accommodation_counts.div(total_listings_per_category, axis=0) * 100, 1)

# Define colors for visualization
accommodation_colors = ['#0A9396', '#94D2BD', '#EE9B00']  # Colors for different accommodation types

# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(7, 5))

# Plot a 100% stacked bar chart
accommodation_percentage.plot(
    kind='bar',       # Create a bar chart
    stacked=True,     # Stack the bars (for a 100% stacked chart)
    ax=ax,           # Use the created axis 'ax' to plot
    color=accommodation_colors  # Assign custom colors to the bars
)

# Rotate x-axis labels for better readability
plt.xticks(rotation=0)

# Labels & Title
plt.xlabel('Stay Category', fontsize=12)
plt.ylabel('Percentage (%)', fontsize=12)
plt.title('Percentage Distribution of Accommodation Types by Stay Category', fontsize=14)

# Add percentage annotations on bars
for bar in ax.containers:
    ax.bar_label(
        bar,              # The current bar section
        fmt='%.1f%%',     # Format label as a percentage with 1 decimal place (e.g., 25.4%)
        label_type='center',  # Place label inside the center of the bar
        fontsize=10,       # Set font size of the labels
        color='black'      # Ensure label text is black for visibility
    )

# Add a grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A 100% staxked bar chart clearly shows easy comparison of how different accommodation types are distributed across various stay categories.
* Since the total for each stay category is different, this chart helps in understanding relative proportions rather than absolute values.

##### 2. What is/are the insight(s) found from the chart?

* Private rooms & entire home/apt have approx similar demand for short stays, indicating that travelers prefer privacy for brief stays.
* Entire homes/apartments are more preferred for week-long, month-long, and extended stays compared to shared or private rooms. This suggests that people staying longer prefer more privacy and comfort.
* Shared rooms have the lowest proportion in all stay categories, showing they are the least preferred accommodation type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Businesses should invest more in entire homes/apartments and private spaces as they are more preffered option for the travellers irrespective of the stay duration.
* Shared rooms have limited demand, so hosts may need to bundle them with extra amenities or target niche audiences like budget travelers or students.

⚠️ Potential Negative Growth:
* If too many shared room listings exist, they might suffer from low occupancy rates and bring lower revenue.
* Property owners should assess whether it is viable to continue offering shared rooms or if converting them into private rooms would be more profitable.

#### Chart - 12 : Distribution of Price Categories Across City Areas

In [None]:
# Grouping data by city area and price category, counting occurrences
city_price_group_data = df.groupby(['city_area', 'price_category']).size().reset_index(name='count')

# Set figure size for better visualization
fig, ax = plt.subplots(figsize=(9, 6))

# Define color palette for different price categories
colors = ['#0A9396', '#94D2BD', '#E9D8A6', '#EE9B00']

# Create a bar plot
sns.barplot(
    data=city_price_group_data,  # Data source
    x='count',  # X-axis: Number of listings
    hue='city_area',  # Y-axis: City areas
    y='price_category',  # Different colors for price categories
    ax=ax,  # Use the predefined axis
    palette=colors  # Assign custom colors to each category
)

# Add value labels to each bar for clarity
for container in ax.containers:
    for bar in container:
        ax.text(
            bar.get_x() + bar.get_width() + 2,  # Move label slightly to the right
            bar.get_y() + bar.get_height() / 2,  # Center vertically within the bar
            int(bar.get_width()),  # Convert count to integer for display
            ha='left',  # Align text to the left
            va='center',  # Center text vertically
            fontsize=9,  # Set font size
            color='black'  # Ensure good contrast for readability
        )

# Set labels and title for better understanding
ax.set_xlabel("Number of Listings", fontsize=12)  # Label for x-axis
ax.set_ylabel("Price Category", fontsize=12)  # Label for y-axis
ax.set_title("Price Distribution across City Area", fontsize=14)  # Chart title
ax.legend(title="City Area")  # Legend title

# Display the final plot
plt.show()


##### 1. Why did you pick the specific chart?

* It allows for easy comparison of price categories across different city areas.
* It also helps in identifying dominant price segments in each area.

##### 2. What is/are the insight(s) found from the chart?

* Brooklyn and Manhattan dominate the mid-range and premium categories, indicating high demand for moderately priced listings.
* Budget listings are relatively high in Queens and Brooklyn, suggesting affordability and a larger market for cost-conscious travelers.
* Luxury listings are concentrated in Manhattan and Brooklyn, while Staten Island has almost no luxury listings, signaling low demand for high-end accommodations there.
* The Bronx has a lower number of listings overall, showing it is a less popular choice for travelers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Hosts can optimize pricing strategies by aligning with area-specific demand trends.
* Investors can focus on Brooklyn and Manhattan for premium and luxury segments where there is clear demand.
* Budget-friendly listings in Queens and Bronx can be expanded to attract more price-sensitive travelers.

⚠️ Potential Negative Growth:
* Staten Island has very low demand across all categories, indicating limited business opportunities there. Investing in new listings may not be profitable.
* The Bronx has fewer premium and luxury listings, suggesting either low demand or an underserved market.

#### Chart - 13 : Relationship Between Nightly Rate and Estimated Revenue

In [None]:
plt.figure(figsize=(9, 5))
sns.scatterplot(data=df, x="nightly_rate", y="estimated_revenue_per_year", alpha=0.5)

plt.xscale("log")
plt.yscale("log")

x_ticks = np.logspace(
    0,  # Exponent of the starting value (10^start)
    np.log10(df['nightly_rate'].max()),  # Exponent of the ending value (10^stop)
    num=10,  # Number of ticks to generate
    base=10  # Base of the log scale (default is 10)
).astype(int)  # Convert values to integers for clean axis labels

# Generate log-spaced tick marks for y-axis
y_ticks = np.logspace(
    0,  # Exponent of the starting value (10^start)
    np.log10(df['estimated_revenue_per_year'].max()),  # Exponent of the ending value (10^stop)
    num=10,  # Number of ticks to generate
    base=10  # Base of the log scale (default is 10)
).astype(int)

plt.xticks(x_ticks,x_ticks)
plt.yticks(y_ticks,y_ticks)

# Titles and labels
plt.title("Do Higher Prices Lead to Higher Revenue?")
plt.xlabel("Nightly Rate")
plt.ylabel("Estimated Revenue per Year")

# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

* The plot shows a positive correlation between nightly rate and estimated revenue per year, meaning that as nightly rates increase, revenue also tends to increase.
* The budget and mid-range listings dominate the market as most data points are clustered in the lower nightly rate range.
* Some listings with the same nightly rate have very different revenue levels. This indicates that occupancy rate plays a critical role—high rates alone do not guarantee high revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Higher prices can lead to higher revenue, but only when occupancy remains high.
* The best-performing listings balance nightly rate and occupancy for sustained earnings.

⚠️ Potential Negative Growth:
* If a listing is too expensive, it risks low bookings, leading to lower total revenue than a moderately priced, fully booked one.

#### Chart - 14 : Distribution of Revenue Across Availability

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set figure size
plt.figure(figsize=(6, 3))

# Create scatter plot with log scales and teal color
plot = sns.jointplot(
    data=df,
    x="estimated_revenue_per_year",
    y="availability_days_per_year",
    kind="scatter",
    alpha=0.5,
    s=10,
    color="teal"  # Change color to teal
)

# Apply log scale to both axes
plot.ax_joint.set_xscale("log")
plot.ax_joint.set_yscale("log")

# Set xticks and yticks to actual values
xticks = [10**i for i in range(1, 7)]
yticks = [10**i for i in range(0, 4)]
plot.ax_joint.set_xticks(xticks)
plot.ax_joint.set_yticks(yticks)
plot.ax_joint.set_xticklabels(xticks)
plot.ax_joint.set_yticklabels(yticks)

# Set labels for axes
plot.ax_joint.set_xlabel("Estimated Revenue Per Year")
plot.ax_joint.set_ylabel("Availability (Days Per Year)")

# Set title with proper spacing
plt.suptitle("Do Cheaper Listings Get More Reviews?", y=1.02)

# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

* Helps visualize the relationship between two numerical variables (availability vs. revenue).
* Histograms on the sides show distribution of each variable separately.

##### 2. What is/are the insight(s) found from the chart?

* Maximizing availability is key to earning more revenue, but not the only factor (pricing, demand, etc., also matter).
* Few listings with low availability still earn high revenue, likely due to premium pricing.
* Most listings have low revenue, meaning competition is high, and only a few make substantial income.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Business owners can maximize revenue by increasing availability or optimizing occupancy rates.
* New hosts can strategically set pricing and availability to enter the profitable zone.

⚠️ Potential Negative Growth:
* Simply increasing availability without optimizing other factors may lead to stagnation or inefficiency.


#### Chart - 15 : Distribution of Revenue Across Accommodation Types

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set figure size
plt.figure(figsize=(8, 5))

# Create violin plot
sns.violinplot(data=df, hue="accommodation_type", y="estimated_revenue_per_year", palette="viridis")

# Set log scale for better visibility if revenue varies significantly
plt.yscale("log")

y_ticks = [10**i for i in range(1, 7)]
# yticks = [10**i for i in range(0, 4)]

plt.yticks(y_ticks,["₹10","₹100", "₹1K", "₹10K", "₹100K", "₹1M"])

# Set labels and title
plt.xlabel("Accommodation Type")
plt.ylabel("Estimated Revenue Per Year")
plt.title("Distribution of Revenue Across Accommodation Types")


# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

* The violin plot effectively shows the distribution and variability of estimated revenue across different accommodation types.
* It allows for a clear comparison of revenue between Entire home/apt, Private room, and Shared room.

##### 2. What is/are the insight(s) found from the chart?

* Entire homes/apartments tend to generate the highest revenue compared to private and shared rooms, as private rooms and shared rooms have a more compressed revenue distribution, meaning they typically generate lower revenue.
* Private and shared rooms generally earn less, but some earn relatively high revenue, possibly due to unique locations, excellent reviews, or long-term bookings.
* Revenue potential varies within each category, meaning other factors like pricing, location, and features play a crucial role in maximizing earnings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Hosts can maximize revenue by offering entire homes/apartments, as they tend to generate higher earnings.
* Understanding revenue distribution across accommodation types helps in setting competitive pricing and improving amenities.

⚠️ Potential Negative Growth:
* Shared rooms generate significantly lower revenue, which may discourage hosts from listing them.
* If low-earning listings are not handled well, hosts might struggle with costs and make less profit.

#### Chart - 16 : Distribution of Reviews Across Accommodation Types

In [None]:
# Set figure size for better visualization
plt.figure(figsize=(8, 5))

sns.boxplot(
    data=df,
    x="total_reviews",
    y="accommodation_type",
    hue="accommodation_type",  # Color-code by accommodation type
    palette="Set2",  # Use a visually appealing color palette
    legend=False  # Avoid duplicate legend entries
)

# Set x-axis to logarithmic scale for better data distribution
plt.xscale("log")

# Generate 10 logarithmically spaced ticks from 1 to max review count
log_ticks = np.logspace(0, np.log10(df["total_reviews"].max()), num=10, base=10).astype(int)

# Set x-axis ticks using log-spaced values
plt.xticks(log_ticks, log_ticks)

plt.xlabel("Total Reviews")  # Label for the X-axis
plt.ylabel("Accommodation Type")  # Label for the Y-axis
plt.title("Distribution of Reviews Across Accommodation Types")  # Set plot title

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A box plot helps compare the distribution of reviews across different accommodation types, highlighting medians, outliers, and variations.
* Using a logarithmic scale on the x-axis ensures better visualization of review counts.

##### 2. What is/are the insight(s) found from the chart?

* All accommodation types have significant outliers with very high review counts, indicating that some listings receive much more engagement than others.
* The median number of reviews is similar for all three accommodation types, suggesting that listings across categories receive comparable engagement on average.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Identifying the factors behind the success of high-review listings can help optimize other listings for better performance.
* Since all accommodation types get a similar number of reviews on average, businesses can invest in any type without worrying about big differences in customer interest.

⚠️ Potential Negative Growth:
* The presence of significant outliers suggests that a small number of listings dominate the market, making it harder for new or lower-performing listings to compete.
* Some listings may struggle to gain visibility if engagement is concentrated in a few popular ones, leading to lower occupancy rates and potential revenue loss.

#### Chart - 17 - Distribution of Average Availability Days Across City Areas

In [None]:
# Sample DataFrame (Replace this with your actual dataset)
# df = pd.read_csv("your_data.csv")

# Set figure size
plt.figure(figsize=(6, 4))

# Create a bar plot for availability days per year across city areas
sns.barplot(data=df, x="city_area", hue="city_area", y="availability_days_per_year", palette="viridis", width=0.5)

# Set labels and title
plt.xlabel("City Area")
plt.ylabel("Average Availability Days per Year")
plt.title("Availability Days Per Year Across City Areas")

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A bar plot is ideal for comparing categorical data (city area) against a numerical variable (availability days per year).
* This plot is effective for quickly comparing availability days across different city areas, making it easy to identify trends or outliers.

##### 2. What is/are the insight(s) found from the chart?

* Staten Island and Bronx have the most available days, possibly due to fewer bookings.
* Brooklyn has the lowest availability, meaning properties may get booked frequently or face rental restrictions.
* Queens and Manhattan have moderate availability, indicating a balanced rental pattern.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Areas with high availability (e.g., Staten Island, Bronx) may need competitive pricing or special promotions to attract more bookings.
* Owners in low-availability areas (e.g., Brooklyn) can maximize earnings by increasing rates or offering premium services.
* Investors can choose areas based on occupancy trends, ensuring higher returns on properties in high-demand locations.

⚠️ Potential Negative Growth:
* High availability in Staten Island and Bronx suggests lower demand, which could lead to price drops and reduced profitability.

#### Chart - 18 : Distribution of Revenue Across Host Listing Counts and Price Categories

In [None]:
plt.figure(figsize=(8, 5))

# Scatter plot with 'price_category' as hue
sns.scatterplot(
    data=df,
    x="total_listings_by_host",
    y="estimated_revenue_per_year",
    hue="price_category",  # Color by price category
    alpha=0.5,
    palette="viridis"  # Choose a color palette
)

plt.xscale("log")
plt.yscale("log")

plt.xticks([1, 2, 5, 10, 20, 50, 100, 300], [1, 2, 5, 10, 20, 50, 100, 300])
plt.yticks([100, 1000, 10000, 100000, 1000000], ["₹100", "₹1K", "₹10K", "₹100K", "₹1M"])

plt.xlabel("Total Listings by Host (Log Scale)")
plt.ylabel("Estimated Revenue per Year (Log Scale)")
plt.title("Do Hosts with More Listings Earn More?")

plt.legend(title="Price Category")  # Legend for price categories
plt.show()


##### 1. Why did you pick the specific chart?

* A scatter plot helps compare revenue across different hosts with varying listing counts.
* Using a log scale makes it easier to see patterns across a wide range of values.
* Adding colors for price categories helps analyze how different pricing strategies impact revenue.

##### 2. What is/are the insight(s) found from the chart?

* Hosts with more listings tend to earn higher revenue, but there are exceptions. Some hosts with fewer listings still make substantial revenue, while some with many listings earn relatively less.
* Luxury and premium listings generate significantly higher revenue, even with fewer listings. Budget and mid-range listings, even in large numbers, often remain in lower revenue ranges.
* The wide spread of revenue values for hosts with similar numbers of listings suggests that factors like pricing, occupancy rate, location, and demand play a major role.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Hosts managing multiple budget listings might consider upgrading their properties or improving pricing strategies to increase revenue per unit.
* Encouraging hosts to invest in premium or luxury categories can boost overall platform revenue.

⚠️ Potential Negative Growth:
* Since cheaper listings mostly make less money, hosts who add more budget properties without offering something unique may have trouble getting bookings and earning good revenue.
* If too many hosts shift towards luxury listings due to high revenue potential, there might not be enough affordable stays, which could drive away budget travelers.

#### Chart - 19 : Explore relationships between host listings, revenue, availability, and reviews across different stay categories.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Check dataset columns
# print(df.columns)

plt.figure(figsize=(6, 5))

# Select relevant numeric columns for pairplot
selected_columns = ["total_listings_by_host", "estimated_revenue_per_year", "availability_days_per_year", "total_reviews"]

# Filter columns that actually exist in the dataset
available_columns = [col for col in selected_columns if col in df.columns]
# available_columns

# Add categorical column for hue
if "stay_category" in df.columns:
    available_columns.append("stay_category")

# Create the pairplot
if len(available_columns) > 1:
    sns.pairplot(df[available_columns], hue="stay_category", palette="husl", diag_kind="kde")
    plt.show()
else:
    print("Not enough valid columns found for pairplot. Check column names.")


##### 1. Why did you pick the specific chart?

* The pairplot helps visualize relationships between multiple numerical variables while categorizing them by stay type, making it easier to identify patterns and trends.
* It highlights correlations, distributions, and clustering among key business metrics such as revenue, availability, and reviews, which are crucial for decision-making.

##### 2. What is/are the insight(s) found from the chart?

* Short Stay listings receive the most reviews, indicating frequent guest turnover, while longer stays (Month Stay & Extended Stay) receive fewer reviews due to lower turnover.
* Hosts with multiple listings generate more revenue encourages property investors to scale their operations and manage multiple properties effectively.
* High-revenue listings exist across different stay categories, suggesting that factors beyond stay duration, such as pricing strategy and property quality, influence revenue.
* Listings with very low availability still receive high reviews, implying that certain properties remain in demand even with limited availability.
* There is a strong presence of Short Stay listings across all variables, suggesting they dominate the market compared to other stay categories.
* Extended Stay and Month Stay listings generally have lower total reviews, possibly due to fewer guest turnovers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Short Stay is the most popular category, so businesses can focus on improving pricing, marketing, and services for short-term rentals to increase earnings.
* Some listings make much higher revenue than others, so hosts can adjust pricing and features to attract more guests and earn more.
* Hosts with many listings earn more money, which means property owners can grow their business by managing multiple properties.
* Longer stays, while fewer in number, may provide stable income, allowing businesses to balance between short-term profits and long-term stability.

⚠️ Potential Negative Growth:
* Many listings earn very little revenue, meaning some hosts may struggle to make a profit, which could force them to lower prices or leave the market.
* Keeping a property available all year does not always mean higher earnings, so hosts may be wasting money keeping their listings open without enough demand.
* Longer stays get fewer reviews, making it harder for those listings to build trust and attract new guests, which can lower bookings.
* Some high-earning listings are not available often, which could mean supply problems or poor management, stopping businesses from making more money.

#### Chart - 20 : Availability Analysis Across Stay Duration and Price Categories

In [None]:
# Create a pivot table for the heatmap
heatmap_data = df.pivot_table(
    index="stay_category",  # Rows represent stay categories
    columns="price_category",  # Columns represent price categories
    values="availability_days_per_year",  # Values to be plotted
    aggfunc="mean",  # Aggregate function to calculate mean availability
    observed=False  # Avoid future warning from pandas
)

# Set figure size
plt.figure(figsize=(8, 5))

# Create a heatmap using Seaborn
sns.heatmap(heatmap_data, annot=True, cmap="coolwarm", fmt=".1f", linewidths=0.5)

# Set labels and title
plt.xlabel("Price Category")
plt.ylabel("Stay Category")
plt.title("Heatmap: Availability Days Per Year by Stay & Price Category")

# Rotate y-ticks for better readability
plt.yticks(rotation=0)  # 0-degree rotation (horizontal labels)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

* The heatmap clearly shows the correlation between stay category, price category, and availability.
* It helps identify which stay and price combinations have higher or lower availability.
* The color-coded representation allows for quick understanding, with warm colors indicating high values and cool colors indicating low values.

##### 2. What is/are the insight(s) found from the chart?

* Budget and mid range short-term stays have lower availability because they are booked frequently.
* Luxury and premium extended stays have the highest availability, likely because fewer people can afford long stays in high-end accommodations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Budget & Mid-range short stays have lower availability, indicating high demand. Hosts can implement dynamic pricing to maximize revenue.
* Luxury extended stays have high availability, meaning lower demand. Hosts can offer discounts for long-term stays to improve occupancy rates.

⚠️ Potential Negative Growth:
* If budget/mid-range stays have very low availability, travelers may switch to competitors (e.g., hotels or other platforms) if they can’t find affordable options.
* If luxury properties are vacant for long periods, hosts might leave the platform or shift to long-term rentals outside Airbnb, leading to revenue loss.


#### Chart - 21 : Accommodation Type Distribution Across City Areas and Price Categories

In [None]:
# Aggregate data
df_counts = df.groupby(["accommodation_type", "price_category", "city_area"], observed=True).size().reset_index(name="count")

# Create FacetGrid for separate plots by city_area
g = sns.FacetGrid(df_counts, col="accommodation_type", col_wrap=3, height=5, sharey=False)

# Create grouped bar charts
g.map_dataframe(sns.barplot, x="city_area", y="count", hue="price_category", palette="coolwarm")

# Adjust layout
g.set_axis_labels("Accommodation Type", "Count")
g.set_titles(col_template="City Area: {col_name}")
g.add_legend()
# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

* The chart categorizes listings by accommodation type, making it easier to compare trends across city areas.
* It highlights pricing distribution, helping identify demand and profitability in different market segments.
* The facet grid structure ensures clarity by preventing data overlap and making insights more actionable.

##### 2. What is/are the insight(s) found from the chart?

* Manhattan & Brooklyn have the highest number of listings across all accommodation types.
* Shared & private rooms are almost exclusively budget-friendly, while premium and luxury options are more common in entire homes/apts.
* Queens & Bronx have significantly fewer listings compared to Manhattan and Brooklyn.
* Shared Rooms are the least popular accommodation type, mostly found in Brooklyn & Manhattan.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:
* Manhattan has the highest demand for luxury entire homes, making it a prime area for premium investments.
* Queens and Bronx have fewer listings but strong demand for budget-friendly accommodations, presenting an opportunity for business expansion.
* Shared room accommodations have demand in Brooklyn and Manhattan, creating potential for expansion in budget-friendly stays.

⚠️ Potential Negative Growth:
* Oversaturation in Manhattan and Brooklyn could reduce occupancy rates and limit profitability due to high competition.
* Luxury listings in Queens, Bronx, and Staten Island have low demand, making premium investments in these areas risky.
* Neglecting budget accommodations in Queens and Bronx could result in missed opportunities for growth in an underserved market.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***