# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Name- Supriya Misal**


# **Project Summary -**


Project Overview:
The objective of this project is to analyze Airbnb's New York City (NYC) dataset from 2019 to gain insights into the key metrics that influence property listings on the platform. NYC is a global tourist hotspot known for its museums, entertainment, restaurants, and commerce, making it a significant market for Airbnb. Through basic exploratory data analysis (EDA) techniques, we aim to understand the distribution of Airbnb listings based on their location, price range, room type, listing name, and other related factors.

Key Objectives:

Explore and visualize the Airbnb dataset for NYC.
Identify and analyze key metrics that influence property listings on Airbnb.
Provide insights into the distribution of listings based on location, price, room type, and listing attributes.
Offer recommendations or insights that could be valuable for Airbnb as a company.
Project Components:

Data Collection: Gather the Airbnb NYC dataset from 2019, including information on listings, reviews, host details, and geographic data.

Data Cleaning: Preprocess the dataset to handle missing values, outliers, and inconsistencies to ensure data quality.

Exploratory Data Analysis (EDA): Conduct basic EDA techniques, including summary statistics, data visualizations, and correlations, to understand the dataset's characteristics and identify trends and patterns.

Location Analysis: Analyze the geographic distribution of Airbnb listings in NYC, including popular neighborhoods, boroughs, and their respective property types.

Price Range Analysis: Investigate the pricing patterns of Airbnb listings, including factors affecting price variations, such as location, room type, and property attributes.

Room Type Analysis: Examine the distribution of different room types (e.g., entire homes, private rooms, shared rooms) on the platform.

Listing Name Analysis: Analyze the impact of listing names and descriptions on listing popularity and pricing.


# **GitHub Link -**

https://github.com/SupriyaMisal/Capston_project_Airbnb_EDA/tree/8d51914eeac31ab0e2c0cb466db87839c44c8fee

# **Problem Statement**


In an era where data-driven insights drive business success, Airbnb, a leading online marketplace for lodging and travel experiences, seeks to harness the wealth of historical booking data at its disposal to enhance the experiences of both hosts and guests. This project endeavors to conduct a comprehensive Exploratory Data Analysis (EDA) on Airbnb's booking data with the objective of uncovering valuable insights and addressing key questions to optimize the platform's operations and bolster customer and host satisfaction."

#### **Define Your Business Objective?**


"Enhancing Airbnb Experiences Through Data Analysis"

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:

from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Data Sets/Airbnb NYC 2019.csv'  ##Paste the filepath to imdb dataset you downloaded previously.
airbnb_df=pd.read_csv(path)


### Dataset First View

In [None]:
airbnb_df.head()

### Dataset Rows & Columns count

In [None]:
airbnb_df.shape

### Dataset Information

In [None]:
airbnb_df.info()

#### Duplicate Values

In [None]:
len(airbnb_df[airbnb_df.duplicated()])

In [None]:
airbnb_df.drop_duplicates(inplace=True)


#### Missing Values/Null Values

In [None]:
missing_values = airbnb_df.isnull().sum()
missing_values

Visualization of null values

In [None]:
import missingno as msno
import matplotlib.pyplot as plt

In [None]:
missing_values = airbnb_df.isnull().sum()

# Create a missing value matrix using missingno
msno.matrix(airbnb_df)

# Add labels and show the plot
plt.xlabel("Columns")
plt.ylabel("Rows")
plt.title("Missing Value Heatmap")
plt.show()

**Fill navalues with zero because 'last_review'and 'reviews_per_month'important column for data analysis and with 30% of null value.**

In [None]:
airbnb_df.fillna(0,inplace=True)

In [None]:
missing_values = airbnb_df.isnull().sum()
missing_values

### What did you know about your dataset?

The dataset given is a dataset from airbnb company, and we have to analysis the booking details.
 collection of data related to listings, bookings, and other information from the Airbnb platform.

The above dataset has 48895 rows and 16 columns. There are no mising values and duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
list(airbnb_df.columns)

In [None]:
airbnb_df.describe()

### Variables Description

**id**:property name Id


**name**:Property name

**host_id**:Id of Host who run that property

**host_name**:Host name

**neighbourhood_group**: higher-level geographical grouping or categorization of neighborhoods.

**neighbourhood**: locality where the property is located.

**latitude** :This column contains numerical values representing the north-south position of the property.

**longitude**:This column contains numerical values representing the  east-west  position of the property.

**room_type:**:Entire home/apartment,Private room/shared room

**price**:price per night

**minimum_nights**:minimum nights customer stay

**number_of_reviews**:number of total review received

**last_review**:last review received on property

**reviews_per_month**:per month how many reveiw received

**calculated_host_listings_count**:listing of property on airbnb

**availability_365:**:how much room available over all year

### Check Unique Values for each variable.

In [None]:
airbnb_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

**Question 1:**
"How many unique neighborhoods are there in total across all neighborhood groups in the dataset?"

In [None]:
neigh_df=airbnb_df.groupby("neighbourhood_group")["neighbourhood"].nunique()
print("Total neighbourhood :",sum(neigh_df))
df1=pd.DataFrame(neigh_df).reset_index()
df1

**Question 2:**How many reveiwes received by customer as per each listing?

In [None]:
review_list=airbnb_df.groupby("room_type")["number_of_reviews"].count()
df2=pd.DataFrame(review_list)
df2

**Question 3:** Find the host with the highest number of listings available in year,and compare with number of reviews and minimum nights?

In [None]:
df3 = airbnb_df.groupby("host_name").agg({"availability_365":"max","number_of_reviews":"sum","minimum_nights":"sum"}).sort_values(by="number_of_reviews",ascending=False).reset_index().head(10)
df3

**Question 4** :
Which host has the most listings in the dataset, and how many listings do they have?





In [None]:
df4=airbnb_df.groupby("host_name").agg({"calculated_host_listings_count":"sum"}).sort_values(by="calculated_host_listings_count",ascending=False).head()
df4

**Question 5:** What are the top 10 neighborhood groups and neighborhoods that have the highest average price and the highest total number of reviews in the dataset?"

In [None]:
df5=airbnb_df.groupby(["neighbourhood","neighbourhood_group"]).agg({"price":"mean","number_of_reviews":"mean"}).sort_values(by=["price"],ascending=False).head(10)
df5


**Question 6:**Calculate the total number of reviews and average review per neighborhood?

In [None]:
df6 =airbnb_df.groupby('neighbourhood').agg({'number_of_reviews':"sum","reviews_per_month":"mean"}).reset_index().head()
df6

**Question 7:** Create a new column that categorizes listings into three price tiers: "Low," "Medium," and "High" based on their price and compare average price and avg host listing count as per value catagory.

In [None]:
def price_categ(value):
  if value<=100:
    return "Low"
  elif value<=180:
    return "Medium"
  else:
    return "High"

value=int(input("enter value"))

result=price_categ(value)
print(result)


In [None]:
airbnb_df["value_cat"] = airbnb_df.apply(lambda x: price_categ(x["price"]), axis=1)

# Create a new DataFrame df4 with just the "price" and "value_cat" columns
df7 = airbnb_df.groupby("value_cat").agg({"price":"mean","calculated_host_listings_count":"mean"})
df7

**Question 8:****Calculate nighbourhood with average price?

In [None]:
df8 = airbnb_df.groupby("neighbourhood_group")["price"].mean().reset_index()
df8

**Question 9:** What is the average minimum number of nights required for a booking in each neighborhood group?

In [None]:
ert = airbnb_df.groupby(["neighbourhood_group", "room_type"])["minimum_nights"].mean().reset_index()
df9= ert.set_index(["neighbourhood_group", "room_type"])
df9

***Question 10:*** What is the sum of room nights as per Airbnb listings in the dataset?

In [None]:
room_booking=airbnb_df.groupby("room_type")["minimum_nights"].sum()
df10=pd.DataFrame(room_booking)
df10

**Question 11:**how the number of reviews has changed over the years and whether there are any notable trends or patterns?

In [None]:
# Convert 'last_review' column to datetime format
airbnb_df['last_review'] = pd.to_datetime(airbnb_df['last_review'])

# Extract year and month from the 'last_review' column
airbnb_df['year'] = airbnb_df['last_review'].dt.year
airbnb_df['month'] =airbnb_df['last_review'].dt.month

# Group by year and month and count the number of reviews
df11 =airbnb_df.groupby(['year', 'month']).size().reset_index(name='review_count')
df11

**Question 12:**Calculate booking percentage and how to create a histogram and a scatter plot for comparing two numerical columns, 'price' and 'number_of_reviews'.


In [None]:
airbnb_df["Booking Percentage"]=(airbnb_df["minimum_nights"]/airbnb_df["availability_365"])*100
airbnb_df["Booking Percentage"].replace(np.inf, 0, inplace=True)
airbnb_df.head()[["minimum_nights","availability_365","Booking Percentage",]].head()



**Question 13**:Identify the top 5 property name as per nighbourhood lowest price and hightest reviews?

In [None]:
filtered_listings = airbnb_df[airbnb_df['minimum_nights'] <= 2]

average_price_by_neighborhood = filtered_listings.groupby('neighbourhood').agg({"price": "mean", "number_of_reviews": "max"}).sort_values("number_of_reviews",ascending=False)
df13= average_price_by_neighborhood.reset_index().head()
df13

**Question14:** Create new column "calculate turnover" and find out top 3 host as per that?

In [None]:
airbnb_df["total_turnover"] =airbnb_df['price'].sum()

df14 = airbnb_df.sort_values('total_turnover', ascending=False)[["host_name","total_turnover"]]
df14.head()

### What all manipulations have you done and insights you found?


● Entire home/apt have high number of reviews and Shared room have less so host should be work on review of shared room.
●Michael having large number of booking and reviwes also.

●With high price we have 17.74% listing.
● Customers pay highest amount in Staten Island
and Manhattan that is $800 .
● Top three host base on their turnover is John	Marc & Youna,Rachel.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Question 1:**
"How many unique neighborhoods are there in total across all neighborhood groups in the dataset?"

In [None]:
neigh_df=airbnb_df.groupby("neighbourhood_group")["neighbourhood"].nunique()
print("Total neighbourhood :",sum(neigh_df))
df1=pd.DataFrame(neigh_df).reset_index()
df1

In [None]:
import matplotlib.pyplot as plt
plt.bar(df1["neighbourhood_group"], df1['neighbourhood'], color='skyblue')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Count of Unique Neighborhoods')
plt.title('Unique Neighborhoods Across Neighborhood Groups')

plt.show()

##### 1. Why did you pick the specific chart?

My first question is How many unique neighborhoods are there in total across all neighborhood groups in the dataset? in these question we find out that which neighborhood group has the most or least unique neighborhoods,and for these bar chart is perfect for visualization.Each bar in the bar chart represents a different neighborhood group. The bars can be categorized into distinct neighborhood groups, which makes it easy to visually compare the number of unique neighborhoods in each group.

##### 2. What is/are the insight(s) found from the chart?

1.Comparing Different Parts of the City

2.Identifying the Most Unique Neighborhoods.

3.As per stack holder we can check which area has large number of hotels.

4.As per customer we should take decision that which are we found extra option for accomdation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As I mentioned that  acoording to data and chart if we are airbnb stack holder we can we can find out that area has large number of property and and if any area have low number of properties then we can tieup with more properties and avoide centralization in specific Neighborhoods.

#### Chart - 2

Question 2:How many reveiwes received by customer as per each listing?

In [None]:
review_list=airbnb_df.groupby("room_type")["number_of_reviews"].count()
df2=pd.DataFrame(review_list)
df2

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize =(10, 7))

plt.pie(df2["number_of_reviews"],labels = df2.index, autopct='%1.1f%%',startangle=180,colors =["#ea5545", "#0bb4ff", 'green'])
plt.title('Airbnb Reviews by Room Type')

# show plot
plt.show()

##### 1. Why did you pick the specific chart?

Clear Comparison: A pie chart allows you to easily compare the proportion of reviews for each room type. It's straightforward to see which room type has the most reviews and how it compares to the others.

Visual Simplicity: Pie charts are visually simple and intuitive. They can quickly convey the relative sizes of categories without requiring complex data interpretation.

Percentage Representation: Pie charts often include percentage labels, which can provide viewers with precise information about each category's share.

2. What is/are the insight(s) found from the chart?

These chart can help you make informed decisions if you're an Airbnb host or property manager. For example, if you see that "Entire home/apt" is the most popular room type, you might consider investing in or listing more properties of that type. Conversely, if "Shared room" is less popular, you might adjust your pricing or marketing strategies for those listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Host prefrance: Host should be look towords shared room type review and increse maretig startegy and find out why we receive low review.

Guest Preferences: The chart may provide insights into guest preferences. If certain room types consistently receive more reviews, it could indicate that guests prefer those types of accommodations.

#### Chart - 3

Question 3: Find the host with the highest number of listings available  in year,and compare with number of reviews and minimum nights?

In [None]:
df3 = airbnb_df.groupby("host_name").agg({"availability_365":"max","number_of_reviews":"sum","minimum_nights":"sum"}).sort_values(by="number_of_reviews",ascending=False).reset_index().head(10)
df3

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Assuming df3 is already created as mentioned in your question
# If not, please ensure you have the dataframe as described in your question.

# Filter the top 5 hosts with the highest number of reviews
top_hosts = df3.head(5)

# Define data for the chart
host_names = top_hosts['host_name']
reviews = top_hosts['number_of_reviews']
minimum_nights = top_hosts['minimum_nights']

# Set the width of the bars
bar_width = 0.2

# Create index values for the x-axis positions
x = np.arange(len(host_names))

# Plotting the grouped bar chart
plt.figure(figsize=(12, 6))
plt.bar(x, reviews, bar_width, label='Total Reviews', alpha=0.7)
plt.bar(x + bar_width, minimum_nights, bar_width, label='Total Minimum Nights', alpha=0.7)

# Customize the plot
plt.xlabel('Host Names')
plt.ylabel('Count')
plt.title('Host Metrics Comparison (Top 5 Hosts by Number of Reviews)')
plt.xticks(x, host_names, rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?

1."Host Name" is a categorical variable, and bar charts are commonly used to display categorical data. Each bar represents a different host, making it easy to compare the with number of reviews and minimun night booking  between hosts.


##### 2. What is/are the insight(s) found from the chart?

1. Hosts with the Most Reviews and highest booking.
2. Number of Reviews Distribution

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes if we check any host receive low number on reveiw then we can increse Marketing and Promotion,Improving Host Performance,price distribution.

#### Chart - 4

**Question 4 **: Which host has the most listings in the dataset, and how many listings do they have?

In [None]:
df4=airbnb_df.groupby("host_name").agg({"calculated_host_listings_count":"sum"}).sort_values(by="calculated_host_listings_count",ascending=False).head()
df4

In [None]:
import matplotlib.pyplot as plt
host_names = df4.index
plt.figure(figsize=(8, 8))
plt.pie(df4['calculated_host_listings_count'], labels=host_names, autopct='%1.1f%%', startangle=180)
plt.title('Top 5 Hosts by Total Calculated Host Listings Count')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

 1.Pie charts are effective for showing the proportion of individual parts in relation to the whole. In your dataset, you wanted to visualize the contribution of each host's calculated host listings count to the total count. A pie chart is well-suited for this purpose because it shows how each part (each host) contributes to the whole (the total count).

 2.Simplicity and Clarity

##### 2. What is/are the insight(s) found from the chart?

1.Contribution of Top Hosts

2.Comparison of Hosts

3.Percentage Labels

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the pie chart can provide valuable information for business decision-making. Positive impacts can be achieved through recognition, strategic partnerships, and enhancing customer experiences. However, there are also potential negative impacts related to dependency on a small number of hosts, market imbalances, and quality control challenges. To mitigate potential negative outcomes, a platform should consider strategies to diversify its host base, implement quality assurance measures, and maintain a balanced marketplace while still recognizing and rewarding top-performing hosts.

#### Chart - 5

**Question 5:**What are the top 10 neighborhood groups and neighborhoods that have the highest average price and the highest total number of reviews in the dataset?"

In [None]:
df5=airbnb_df.groupby(["neighbourhood","neighbourhood_group"]).agg({"price":"mean","number_of_reviews":"mean"}).sort_values(by=["price"],ascending=False).head(10)
df5

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df5.reset_index(), x="price", y="number_of_reviews", hue="neighbourhood_group")
plt.title("Scatter Plot of Price vs. Number of Reviews (Top 10 Neighbourhoods)")
plt.xlabel("Average Price")
plt.ylabel("Average Number of Reviews")
plt.show()

The scatterplot is chosen because it effectively conveys the relationships between the specified variables and allows for the visualization of how different neighborhood groups compare in terms of Airbnb price and the number of reviews, making it a suitable choice for the given data and analysis goals.

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Price and Number of Reviews Relationship:WE can examine whether there is any visible relationship between the average Airbnb price and the average number of reviews. For example, you might observe that as prices increase, the number of reviews tends to decrease or vice versa.

 The insights gained from the scatterplot can have both positive and negative business impacts. The key is to carefully analyze the data, consider the broader business context, and use the insights to inform strategic decisions. Positive impacts can include optimized pricing and targeted marketing, while negative impacts may arise from pricing sensitivity, disparities across neighborhoods, or issues related to data quality and outliers. It's crucial to weigh the potential risks and benefits before implementing any changes to your business strategy.

3. Will the gained insights help creating a positive business impact?

Price vs. Popularity: You can see how the average price relates to the average number of reviews. If there is a clear trend, such as lower-priced neighborhoods having more reviews, this insight could be used to adjust pricing strategies. For example, you might consider lowering prices in neighborhoods with fewer reviews to attract more customers.

Neighborhood Segmentation: The use of different colors for different neighborhoods allows you to identify which neighborhoods perform better in terms of reviews. This can help in understanding which areas are more popular and may warrant more marketing efforts or investment.

#### Chart - 6

**Question 6:**Calculate the total number of reviews and average review per neighborhood?

In [None]:
df6 =airbnb_df.groupby('neighbourhood').agg({'number_of_reviews':"sum","reviews_per_month":"mean"}).reset_index().head()
df6

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")

# Create a barplot for the sum of 'number_of_reviews'
plt.subplot(1, 2, 1)
sns.barplot(x='number_of_reviews', y='neighbourhood', data=df6, palette='viridis')
plt.title('Total Number of Reviews by Neighbourhood')

# Create a barplot for the mean of 'reviews_per_month'
plt.subplot(1, 2, 2)
sns.barplot(x='reviews_per_month', y='neighbourhood', data=df6, palette='viridis')
plt.title('Mean Reviews per Month by Neighbourhood')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The combination of these two barplots provides a comprehensive view of how reviews are distributed across different neighborhoods, both in terms of total numbers and average monthly rates. This approach is effective for exploring and comparing these two aspects of the data within the context of neighborhood analysis.

##### 2. What is/are the insight(s) found from the chart?

1.You can identify which neighborhoods have the highest total number of reviews. These neighborhoods may be more popular among guests or have more listings on the platform.

2.This chart can reveal which neighborhoods tend to have a higher average number of reviews per month, indicating a consistent level of guest activity.

Theinsights gained from these charts can lead to positive business impacts if used strategically and with a balanced approach. However, there are potential pitfalls, including increased competition and neglecting other essential factors, that hosts and property owners should be aware of. It's crucial to use these insights as part of a broader strategy that takes into account all aspects of the guest experience and market dynamics to achieve sustainable growth.

Answer Here

#### Chart - 7

**Question 7:** Create a new column that categorizes listings into three price tiers: "Low," "Medium," and "High" based on their price and compare average price and avg host listing count as per value catagory.

In [None]:
def price_categ(value):
  if value<=100:
    return "Low"
  elif value<=180:
    return "Medium"
  else:
    return "High"

value=int(input("enter value"))

result=price_categ(value)
print(result)


In [None]:
airbnb_df["value_cat"] = airbnb_df.apply(lambda x: price_categ(x["price"]), axis=1)

# Create a new DataFrame df4 with just the "price" and "value_cat" columns
df7 = airbnb_df.groupby("value_cat").agg({"price":"mean","calculated_host_listings_count":"mean"})
df7

In [None]:
import matplotlib.pyplot as plt

categories = df7.index
mean_prices = df7['price']
mean_host_counts = df7['calculated_host_listings_count']

fig, ax1 = plt.subplots(figsize=(10, 6))

# Plot the mean prices as bars
ax1.bar(categories, mean_prices, color="#87bc45", alpha=0.7, label='Mean Price')
ax1.set_xlabel('Value Category')
ax1.set_ylabel('Mean Price', color="#2e2b28")
ax1.tick_params('y', colors="#2e2b28")

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()
ax2.plot(categories, mean_host_counts, color='r', marker='o', label='Mean Host Count')
ax2.set_ylabel('Mean Host Count', color='r')
ax2.tick_params('y', colors='r')

# Add labels, legend, and title
plt.title('Mean Price and Mean Host Count by Value Category')
plt.xticks(rotation=45)
plt.legend(loc='upper left')

# Display the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Comparison of Multiple Categories: You have multiple value categories (e.g., "Category 1," "Category 2," etc.), and you want to compare two different metrics (mean price and mean host count) across these categories. A grouped bar chart allows you to see how these two metrics vary within each category and make comparisons between categories.

##### 2. What is/are the insight(s) found from the chart?

the chart provides a visual representation of how "Mean Price" and "Mean Host Count" vary across different "Value Categories," allowing for initial insights and hypotheses about the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pricing Strategy Optimization: Understanding how mean prices vary across different value categories can inform pricing strategies. For example, if the data shows that higher-priced categories have lower mean host counts, a business might decide to focus on premium offerings in those categories. Conversely, if lower-priced categories have higher host counts, the business might adopt a competitive pricing strategy to attract more hosts.

#### Chart - 8

**Question 8: **Calculate nighbourhood with average price?

In [None]:
df8 = airbnb_df.groupby("neighbourhood_group")["price"].mean().reset_index()
df8

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.bar(df8["neighbourhood_group"], df8["price"])
plt.xlabel("Neighbourhood Group")
plt.ylabel("Mean Price")
plt.title("Mean Prices by Neighbourhood Group")
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

Comparison of Categories: Bar charts are excellent for comparing categories (in this case, neighborhood groups) and their corresponding values (mean prices). Each bar represents a distinct category, making it easy to see how the mean prices differ between neighborhood groups.

Clarity: Bar charts provide a clear and concise visual representation of the data. It's easy to read and interpret, even for individuals who may not be familiar with data visualization techniques.

##### 2. What is/are the insight(s) found from the chart?

Price Variability: You can easily see which neighborhood groups tend to have higher or lower mean prices. The differences in bar heights represent the variations in average prices across the different groups.

Expensive Neighborhoods: Identify the neighborhood groups with the highest mean prices. These are likely to be considered more upscale or expensive areas for Airbnb listings.

Affordable Neighborhoods: Conversely, you can also identify the neighborhood groups with the lowest mean prices. These may be considered more budget-friendly areas for Airbnb stays.

Price Range: The chart allows you to see the range of mean prices across all neighborhood groups, helping you understand the overall distribution of prices in the dataset.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Identifying Profitable Markets: If the chart reveals that certain neighborhood groups consistently have higher average prices, this insight can help businesses focus their marketing and sales efforts on those areas, potentially leading to increased revenue and profitability.

Negative Growth Scenarios:

Overpricing: If the chart highlights that a business has been consistently charging higher prices in neighborhoods where competitors offer similar products or services at lower prices, this can lead to a negative impact. Customers in those neighborhoods may choose competitors over the business, resulting in lost sales and market share

#### Chart - 9

**Question 9:** What is the average minimum number of nights required for a booking in each neighborhood group?

In [None]:
ert = airbnb_df.groupby(["neighbourhood_group", "room_type"])["minimum_nights"].mean().reset_index()
df9= ert.set_index(["neighbourhood_group", "room_type"])
df9

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df2 is already prepared as you mentioned
heatmap_data = df9.pivot_table(values="minimum_nights", index="neighbourhood_group", columns="room_type")

plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap="YlGnBu")
plt.title("Mean Minimum Nights by Neighbourhood Group and Room Type")
plt.xlabel("Room Type")
plt.ylabel("Neighbourhood Group")
plt.show()

##### 1. Why did you pick the specific chart?

 Heatmap is a suitable choice for displaying the mean minimum nights for different combinations of "neighbourhood_group" and "room_type" because it effectively conveys this information in a visually appealing and informative manner.

##### 2. What is/are the insight(s) found from the chart?

Variations: Heatmaps can help identify spatial variations in minimum nights. You may notice that certain neighborhoods have consistently higher or lower minimum nights across all room types. This information can be useful for property owners or renters.

omparative Analysis: By comparing the colors in different cells, you can easily identify which "neighbourhood_group" or "room_type" combinations have the longest or shortest minimum nights. This could be valuable for market research or property management decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impacts:

Optimized Pricing: If you find that certain "room_type" categories have significantly higher minimum nights in specific "neighbourhood_group" areas, you can adjust your pricing strategy accordingly. Lowering prices for rooms with longer minimum night requirements in less popular neighborhoods could attract more bookings.

Targeted Marketing: You can use the insights to tailor your marketing efforts. For example, if you notice that a particular "room_type" is in high demand in a specific neighborhood, you can create targeted marketing campaigns to promote those listings.


Negative Business Impacts:

Overpricing: Misinterpreting the heatmap could lead to overpricing rooms in areas with lower demand or longer minimum nights. This might discourage potential guests and result in lower occupancy rates.

#### Chart - 10

In [None]:
room_booking=airbnb_df.groupby("room_type")["minimum_nights"].sum()
df10=pd.DataFrame(room_booking)
df10

In [None]:
import matplotlib.pyplot as plt

# Create a bar chart
plt.figure(figsize=(8, 6))  # Adjust the figure size as needed
plt.bar(df10.index, df10['minimum_nights'],color="#9080ff")
plt.xlabel('Room Type')
plt.ylabel('Total Minimum Nights')
plt.title('Total Minimum Nights by Room Type')


# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Categorical Data: Room types are categorical data, and a pie chart is well-suited for showing the distribution of a single categorical variable. Each slice of the pie represents a room type, and the size of each slice corresponds to the proportion of minimum nights associated with that room type.

Simple Comparison: Pie charts are effective for showing the relative proportions of categories in a dataset. In this case, you can quickly see which room types have a larger or smaller share of minimum nights.

##### 2. What is/are the insight(s) found from the chart?

1.Categorical Data: Bar charts are well-suited for displaying and comparing categorical data, such as different room types in this case. Each category (room type) is represented by a separate bar.

2.Comparison: Bar charts make it easy to compare the values of different categories. In your code, you are comparing the total minimum nights for different room types, which can be effectively communicated through a bar chart.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
1.Identifying High-Demand Room Types: If the bar chart shows that certain room types have significantly higher total minimum nights compared to others, it can help the business focus its marketing and pricing strategies on those high-demand room types. This could potentially lead to increased revenue and occupancy rates.

2.Pricing Strategy: The insights from the chart may lead to adjustments in pricing strategy. If certain room types are in high demand, the business may consider increasing the prices for those rooms during peak periods to maximize revenue

Negative Impact:

1.Underperforming Room Types: If the bar chart reveals that some room types have consistently low total minimum nights, it might indicate that these room types are not attracting customers. This could lead to negative growth if resources continue to be allocated to these unprofitable room types.

#### Chart - 11

**Question 11:**how the number of reviews has changed over the years and whether there are any notable trends or patterns?


In [None]:
# Convert 'last_review' column to datetime format
airbnb_df['last_review'] = pd.to_datetime(airbnb_df['last_review'])

# Extract year and month from the 'last_review' column
airbnb_df['year'] = airbnb_df['last_review'].dt.year
airbnb_df['month'] =airbnb_df['last_review'].dt.month

# Group by year and month and count the number of reviews
df11 =airbnb_df.groupby(['year', 'month']).size().reset_index(name='review_count')
df11

In [None]:
import matplotlib.pyplot as plt
import pandas as pd


# Create a line plot to visualize the frequency of reviews over time
plt.figure(figsize=(12, 6))
plt.plot(df11['year'], df11['review_count'], marker='o', linestyle='-', color='b')
plt.title('Frequency of Reviews Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Reviews')
plt.grid(True)

##### 1. Why did you pick the specific chart?

line plot to visualize the frequency of reviews over time because it is well-suited for showing how a numerical variable (in this case, the number of reviews) changes over a continuous time period (years and months). Here are some reasons why a line plot is a suitable choice for this type of data:

##### 2. What is/are the insight(s) found from the chart?

Trend Identification: Look for overall trends in the number of reviews over the years. Is there a noticeable increase or decrease in review activity? Identifying trends can be useful for understanding how popular your listings have been over time.

Seasonal Variations: Examine whether there are recurring patterns or seasonal variations in the review frequency. For example, you might see spikes in reviews during certain months, which could correspond to tourist seasons or holidays.

Yearly Changes: Identify any significant changes or events in the dataset by observing abrupt shifts in the line. These changes could be due to changes in property management, marketing efforts, or external factors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from analyzing review frequency over time can positively impact your business by optimizing strategies, improving operations, and capitalizing on successful periods. However, they can also reveal negative growth trends or issues that need immediate attention. It's crucial to act upon these insights proactively, addressing challenges and capitalizing on opportunities to ensure long-term success in the hospitality or property rental business.

Chart - 12

 **Question 12**Calculate booking percentage and how to create a histogram and a scatter plot for comparing two numerical columns, 'price' and 'number_of_reviews'.

In [None]:
airbnb_df["Booking Percentage"]=(airbnb_df["minimum_nights"]/airbnb_df["availability_365"])*100
airbnb_df["Booking Percentage"].replace(np.inf, 0, inplace=True)
airbnb_df.head()[["minimum_nights","availability_365","Booking Percentage",]].head()




In [None]:
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.hist(airbnb_df["minimum_nights"], bins=20)
plt.xlabel("Minimum Nights")
plt.ylabel("Frequency")
plt.title("Distribution of Minimum Nights")

plt.subplot(1, 3, 2)
plt.hist(airbnb_df["availability_365"], bins=20)
plt.xlabel("Availability 365")
plt.ylabel("Frequency")
plt.title("Distribution of Availability 365")

plt.subplot(1, 3, 3)
plt.hist(airbnb_df["Booking Percentage"], bins=20)
plt.xlabel("Booking Percentage")
plt.ylabel("Frequency")
plt.title("Distribution of Booking Percentage")

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Histograms are valuable for understanding the distribution of individual variables. Here's why histograms are necessary:

"Minimum Nights" Histogram: This histogram helps you see the distribution of the "minimum_nights" variable, which is essential to understand the typical duration required for bookings. It can reveal if there are any common values or outliers in this variable.

"Availability 365" Histogram: This histogram provides insights into the distribution of the "availability_365" variable, helping you understand how often listings are available throughout the year. It can reveal patterns in availability, such as whether most listings are available year-round or only seasonally.

"Booking Percentage" Histogram: The histogram for "Booking Percentage" allows you to examine the distribution of this calculated metric. It helps you understand the distribution of booking percentages across listings and identify any unusual patterns or outliers in this variable.

##### 2. What is/are the insight(s) found from the chart?

The "Minimum Nights" histogram can reveal the most common duration required for bookings. If there's a peak at a specific value, it suggests a preferred booking duration among listings.
The "Availability 365" histogram can show how frequently listings are available throughout the year. It may help identify whether most listings are available year-round or only for certain periods.
The "Booking Percentage" histogram can provide insights into how booking percentages are distributed among listings. You might identify a typical range of booking percentages and whether there are any outliers with exceptionally high or low booking percentages.

The gained insights have the potential to create a positive business impact by optimizing policies, marketing strategies, and pricing. However, there are also insights that, if not addressed or acted upon, could lead to negative growth or missed opportunities. It's essential for the business to use these insights proactively to make informed decisions and adapt its strategies to the dynamic market conditions in the hospitality industry.

3.Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

insights gained from histograms can have a significant impact on a business, both positive and negative. It's essential for the business to use these insights strategically to optimize its operations, pricing, and marketing strategies to enhance revenue and overall growth while addressing any issues that may be hindering success. The specific actions taken in response to these insights will determine whether they result in positive or negative business outcomes.

#### Chart - 13

**Question 13**:Identify the top 5 property name as per nighbourhood lowest price and hightest reviews?

In [None]:
filtered_listings = airbnb_df[airbnb_df['minimum_nights'] <= 2]

average_price_by_neighborhood = filtered_listings.groupby('neighbourhood').agg({"price": "mean", "number_of_reviews": "max"}).sort_values("number_of_reviews",ascending=False)
df13= average_price_by_neighborhood.reset_index().head()
df13

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Assuming you have the df7 DataFrame as you mentioned
neighbourhoods = df13['neighbourhood']
average_prices = df13['price']
max_reviews = df13['number_of_reviews']

# Set the width of the bars
bar_width = 0.35

# Create index values for the x-axis positions
x = np.arange(len(neighbourhoods))

# Plotting the grouped bar chart
plt.figure(figsize=(12, 6))
plt.bar(x, average_prices, bar_width, label='Average Price', alpha=0.7)
plt.bar(x + bar_width, max_reviews, bar_width, label='Maximum Number of Reviews', alpha=0.7)

# Customize the plot
plt.xlabel('Neighbourhood')
plt.ylabel('Value')
plt.title('Neighbourhood Comparison (Top 5 by Maximum Number of Reviews)')
plt.xticks(x + bar_width/2, neighbourhoods, rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

It allows you to visually compare two different metrics (average price and maximum number of reviews) for each of the top 5 neighborhoods by maximum number of reviews. Here are the reasons for choosing this chart:

##### 2. What is/are the insight(s) found from the chart?

Neighborhoods with High Maximum Reviews.

Price Variation Among Neighborhoods.

Relationship Between Reviews and Prices.

Differences in Popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can have both positive and negative implications for a business. The key is to use these insights strategically, adjusting pricing, marketing, and customer experience efforts to capitalize on strengths and address weaknesses in a way that maximizes growth and profitability. It's important to strike a balance between offering competitive prices and providing a high-quality experience to guests in order to sustain positive business growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
plt.figure(figsize=(12, 8))
corr_matrix = airbnb_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix for Airbnb Dataset')
plt.show()

##### 1. Why did you pick the specific chart?

 Correlation Matrix: A correlation matrix is a table that shows the correlation coefficients between many variables. Heatmaps are particularly well-suited for displaying this type of data because they can efficiently represent a large amount of information in a visually intuitive way.

 Heatmap is a widely accepted and effective choice for visualizing correlation matrices, making it easier for data analysts and viewers to identify patterns and relationships within the data.

##### 2. What is/are the insight(s) found from the chart?

1.Strength of Relationships:
 A correlation matrix heatmap allows you to quickly identify the strength and direction of relationships (positive or negative) between pairs of variables. Positive correlations are indicated by warmer colors (closer to red), while negative correlations are indicated by cooler colors (closer to blue).

2.Highly Correlated Variables:
You can identify pairs of variables that are highly correlated (either positively or negatively) by looking for cells in the heatmap with strong colors (either very red or very blue). This suggests that changes in one variable are associated with changes in the other.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select the specific columns you want for the pair plot
selected_columns = ['price', 'minimum_nights', 'number_of_reviews','calculated_host_listings_count']

# Create a pair plot for the selected columns
sns.pairplot(data=airbnb_df[selected_columns])

# Show the plots
plt.show()

##### 1. Why did you pick the specific chart?

The choice of a pair plot in this context is appropriate because it allows for a comprehensive exploration of relationships and distributions among the selected numerical variables in the Airbnb dataset.

##### 2. What is/are the insight(s) found from the chart?

Scatterplots: Scatterplots in a pair plot can reveal the strength and direction of relationships between variables. For example, you can look for linear or non-linear trends in the data points. Positive or negative correlations between variables may be evident.

Outliers: Outliers can be identified in scatterplots as data points that deviate significantly from the general trend. Outliers might indicate errors in the data or unusual observations.

In [None]:
airbnb_df.columns

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.


Analysis regarding property: Conduct thorough market research to understand industry trends, customer preferences, review ,listing property. This will help the client make informed decisions and identify opportunities and threats.

Avoide Dependancy: If any area have less number of listing but high number of booking so client should reduce there dependancy on property and increase listing on that area.

Pricing Optimization: Use the data to optimize pricing for listings based on factors like location, room type, and historical booking percentages to maximize revenue.

Demand Forecasting: Predict future demand for listings in different neighborhoods to help hosts better plan for availability and pricing.

Host Performance Analysis: Evaluate host performance using metrics like reviews per month, total turnover, and booking percentage to identify top-performing hosts.

Neighborhood Analysis: Analyze which neighborhoods have the highest demand and the potential for growth in the short and long term.


Marketing and Promotion: Determine which types of listings (e.g., entire homes, private rooms) are more popular and tailor marketing efforts accordingly.

Host Training and Support: Identify hosts with lower booking percentages and provide them with support or training to improve their listing's performance.

Customer Satisfaction: Use review data to assess customer satisfaction and make improvements based on feedback.


# **Conclusion**

We defined some points which can
help airbnb in their business:

● Entire home/apt have high number of reviews and Shared room have less so host should be work on review of shared room.

●Michael having large number of booking and reviwes also.

●With high price we have 17.74% listing.

● Customers pay highest amount in Staten Island
and Manhattan that is $800 .

● Top three host base on their turnover is John	Marc & Youna,Rachel.


Airbnb data analysis provides valuable insights for both hosts and travelers. Hosts can use these insights to optimize their listings and maximize their income, while travelers can make more informed choices to enhance their stay. However, it's essential for all parties to consider factors such as location, price, property type, and reviews to ensure a satisfying and enjoyable Airbnb experience. Additionally, staying informed about local regulations and compliance is crucial to avoid potential issues.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***