# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual (Rohit Panchal)


# **Project Summary -**

Write the summary here within 500-600 words.

* Airbnb operates as a global online marketplace that connects hosts and travelers, facilitating short-term stays in residential properties. The company’s business model is primarily based on creating a platform where hosts can offer unique accommodations, ranging from entire homes and apartments to private or shared rooms, to guests seeking short-term rentals. As a mediator, Airbnb charges a service fee from both the hosts and guests, which is a critical part of its revenue generation strategy.

* In this exploratory data analysis (EDA) project, the primary objective is to examine Airbnb’s dataset to understand how certain business-related factors influence its profit or loss, revenue generation, and overall market dynamics. Specifically, we are tasked with identifying variables that might be driving changes in room prices, the booking patterns of different locations, and the impact of customer reviews on pricing strategies.

* The dataset provided for this analysis includes columns such as room type, price, location, reviews, and neighborhood group. However, some key columns, like last_review and reviews_per_month, are sparsely filled or missing, creating potential challenges in deriving complete insights. Filling these gaps and ensuring data integrity is crucial for making accurate business decisions.

* Our analysis revolves around several key factors:

1. The variation in room prices based on room type and location
2. How different neighborhood groups prefer specific room types and locations
3. The relationship between customer reviews and room prices

* Our goal is to not only derive insights that will help Airbnb optimize its pricing strategies and improve its service offerings but also to address any data inconsistencies or fraudulent activity that might be affecting its business. Moreover, we aim to provide recommendations for future business expansion, improved host engagement, and enhanced customer satisfaction.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

"Analyze the relationship between room type (Private room, Shared room, Entire home/apt) and key factors such as:

* Pricing: How prices vary by room type, location (latitude/longitude), and neighborhood.
* Availability: Room type availability patterns across different locations.
* Reviews: How reviews impact pricing for each room type.

# Investigate:

* Price fluctuations for each room type across locations.
* Neighborhood groups booking preferences and prices.
* Location characteristics influencing room type demand.
* Review and availability impacts on pricing for each room type.

# Research Questions:

* How do prices differ among Private rooms, Shared rooms, and Entire home/apt across various locations?
* Which neighborhood groups book which room types at what prices?
* How do reviews and availability affect pricing for each room type?
* Which room types receive high attention from specific groups?


#### **Define Your Business Objective?**

To provide a platform that connects people looking to rent out their homes with those seeking short-term accommodations. The core objectives include:

1. Facilitate Short-Term Rentals

2. Offer Unique Experiences

3. Expand to Diverse Markets

4. Provide a Flexible Earning Opportunity for Hosts

5. Build a Community-Based Trust Model

6. Maximize Value for Both Hosts and Guests

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Important Libraries for data Analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset by usind pandas
data = pd.read_csv("/content/Airbnb NYC 2019.csv")

# Make all the columns are visible
pd.set_option("display.max_columns",None)

### Dataset First View

In [None]:
# Dataset First Look use head(10) to see first 10 rows of data set by default it is 5
data.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count by function data.shape
Rows,Columns = data.shape
print(f"Rows: {Rows} & Columns: {Columns}")

### Dataset Information

In [None]:
# Dataset Info
data.info()

# shows us not null values in the columns and the data type for each column

There are three column in the data set has float data type Seven column has int data type and Six column has object data type

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

# This method shows us how many rows in the data are duplicated but check for the primary key is right way to find the duplicate data entry

In [None]:
# For this data set we have Id column has the primary column so chech for the duplicate value in this column
data["id"].duplicated().sum()

# There is no duplicate values in our data set

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Find the missign values in data set and assing it into variable
missing_values = data.isnull().sum()

# Make data frame of the missing vlaues in data set by filtering
missing_df = missing_values[missing_values > 0].reset_index()

# Change the columns name of the newly made data frame so we can ietarte easily
missing_df.columns = ["columns","missing_count"]

# Show the missing_df
missing_df


In [None]:
# Visualizing the missing values

# Set the figure size of the plot box
plt.figure(figsize=(6,4))

# Use sns to plot barplot to visualize the missing df
sns.barplot(data=missing_df,x="columns",y="missing_count",color="red")

# Set X labels
plt.xlabel("Columns",fontsize=12)

# Set Y lables
plt.ylabel("Count",fontsize=12)

# Rotate x ticks to 90'
plt.xticks(rotation=90)

# Set the title for the visualizaion
plt.title("Count of missing values",fontsize=14,color="green")

# Show the graph
plt.show()

#### Insight
* "Our dataset contains four columns with empty rows. We will impute missing values in Elewen columns, 'Name','Host Name's, using random values since they have a relatively small number of empty rows. However, we will not impute missing values in the 'Review Per Month' and 'Last Review' columns as they have a significant proportion of empty rows (over 20%), and filling them could potentially skew our data analysis and lead to inaccurate insights."

### What did you know about your dataset?

Answer Here
# Here is the description for each column in the data set

* id: Unique identifier for the listing.
* name: Name of the listing.
* host_id: Unique identifier for the host.
* host_name: Name of the host.
* neighbourhood_group: Larger geographic area the listing belongs to.
* neighbourhood: Specific neighborhood where the listing is located.
* latitude: Latitude coordinate of the listing location.
* longitude: Longitude coordinate of the listing location.
* room_type: Type of room offered (e.g., apt, private room, shared room).
* price: Price per night for the listing room based on type of room.
* minimum_nights: Minimum number of nights required for a booking.
* number_of_reviews: Total number of reviews the listing has received.
* last_review: Date of the most recent review.
* reviews_per_month: Average number of reviews per month.
* calculated_host_listings_count: Total number of listings the host has.
* availability_365: Number of available days for booking in a year.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data[["price","minimum_nights","number_of_reviews","availability_365"]].describe()

### Variables Description

Answer Here

* Price: The price range is broad (from 0 to 10,000), with a mean of 152.2. Outliers like prices of 0 might indicate typos or mistakes and should be removed. The higher prices, particularly near 10,000, seem genuine due to well-maintained, furnished rooms in prime locations. The interquartile range shows that 25% of prices are near 70, while 75% are close to 175, with most data falling between these values.

* Minimum Nights: The minimum number of nights ranges from 1 to 1,250, with a mean of 22. However, 75% of the data shows minimum stays close to 5 nights, meaning there are outliers where the minimum stay is much higher, which may need review.

* Number of Reviews: The number of reviews ranges from 0 to 629. Some listings might have no reviews, which seems valid, as some users may not leave reviews. Listings with high reviews are likely to be well-reviewed rooms with good facilities.

* Availability_365: Some rooms show 0 availability, which could mean they are booked or rented privately by the landlord. This column has 25% of data showing 0 availability, which raises questions about their status (booked, removed, or rented privately). Listings with 365 days of availability may have issues like poor reviews, bad location, or high prices, making them less attractive. Investigating these rooms can help landlords improve and attract more customer

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
Unique_values = data.nunique()
Unique_values

In [None]:
# Visulalization of the unique values in data

# Set the figure size
plt.figure(figsize=(6,4))

# Plot graph for the unique values
Unique_values.plot(kind="bar")

# Set x labels
plt.xlabel("Columns",color="red")

# Set y lable
plt.ylabel("Count",color="red")

# Set the title
plt.title("Count of Unique values",color="red")

# Show the graph
plt.show()

#### Insights
* id: Each value is unique, with 48,895 unique entries.
* name: Also mostly unique, but may have some duplicates.
* host_id: Unique for each host, with 48,895 unique entries.
* price: Contains 674 unique price values, indicating several listings share the same price.
* neighbourhood_group: Consists of 5 unique groups.
* room_type: Has 3 unique type  of rooms (shared,apt,family).



## 3. ***Data Wrangling***

In [None]:
# check for the null values in dataset
data.isnull().sum()

In [None]:
# Check the data type for the each column if it is incorrect then make it correct
data.info()

In [None]:
data.head()

### Data Wrangling Cod

In [None]:
# Write your code to make your dataset analysis ready.

# Fill the values in the null rows of name column by bfill method
data["name"] = data["name"].fillna("bfill")

In [None]:
# Fill the values in the null rows of host name column by bfill method
data["host_name"] = data["host_name"].fillna("bfill")

In [None]:
# Change the data type of the last reviw column it contain date but it show object
data["last_review"] = pd.to_datetime(data["last_review"])

In [None]:
# add one column to the data price per night
data["price_per_night"] = data["price"] / data["minimum_nights"]

In [None]:
data.isnull().sum()

In [None]:
# After wrangling our data check for further analysis
data.info()

### What all manipulations have you done and insights you found?

Answer Here.

* Check for Null Values:
Columns "name" and "host_name" have some null values, while "last_review" and "reviews_per_month" have more than 20% null values.

* Check Data Types:
All columns have the correct data types, except for "last_review", which is of type object but should be a date.

* Handle Null Values:
Fill null values in "name" and "host_name" using the bfill method (backfill). This will fill null rows with the previous value.

* Convert Data Type:
Change the data type of the "last_review" column from object to datetime using pd.to_datetime().

* Add New Column:
Create a new column called "price_per_night" by dividing the "price" by "minimum_nights".

* Final Check:
Verify that all columns, except "last_review" and "reviews_per_month", now have no null values and that all data types are correct.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Set the figure size
plt.figure(figsize=(5,4))

# Use count plot
sns.countplot(data=data,x="room_type",color="green")

# Set x lables
plt.xlabel("Room type",color="blue")

# Set y labels
plt.ylabel("Number of room",color="blue")

# Set title
plt.title("Count of each room type",color="blue")

# Show chart
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

A count plot is ideal for visualizing the distribution of categorical data, such as room types in dataset. It helps quickly understand how many listings belong to each category (e.g., private room, entire home/apartment, or shared room). By counting the occurrences of each room type

* Identify which room type is most common in the dataset.
* Determine the share of private rooms, entire homes/apartments, and shared rooms in the market.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


* In the entire dataset, the entire home/apartment category has the largest share, with close to 25,000 listings, making up around 50% of the data. This suggests that a significant portion of the business caters to families or groups who prefer booking entire spaces.

* The next most common category is private rooms, with over 20,000 listings. This indicates that couples or single travelers also form a substantial part of the business.

* However, shared rooms contribute very little, with only 2,000-3,000 listings. This shows that shared rooms are not as popular among customers in this market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* The insights gained from the count plot can significantly impact our business strategy. Since entire homes/apartments make up the largest share of listings, we should focus on enhancing their appeal to travelers. Strategies could include offering promotional coupons for hosts or developing features that highlight the benefits of booking entire spaces. As we grow, prioritizing entire homes/apartments in our listings will help us attract a larger customer base.

* Private rooms, being the second-largest category, also warrant attention. We need to enhance their attractiveness, addressing potential customer concerns related to safety and privacy. If we expand into other regions, it will be essential to ensure that our offerings include a healthy mix of private rooms and entire apartments.

* Lastly, the shared room category has minimal contribution to our business. We should investigate the reasons for this low listing number. Possible factors could include location issues, privacy concerns, or quality perceptions. Identifying regions where shared rooms have higher demand versus those with low interest will be crucial for making informed decisions on how to improve this segment of our offerings.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Set the figure size
plt.figure(figsize=(5,4))

# Use count plot
sns.countplot(data=data,x="neighbourhood_group",color="green")

# Set x lables
plt.xlabel("neighbourhood group",color="blue")

# Set y labels
plt.ylabel("Number of group",color="blue")

# Set title
plt.title("Count of each group",color="blue")

# Show chart
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

A count plot is a visualization tool used for categorical data. It displays the count of each category within a specified column in the dataset. Typically, the categorical variable is placed on the x-axis, while the count of occurrences is represented on the y-axis. This allows for an easy comparison of the frequency of different categories, helping to identify trends and distributions within the data.








##### 2. What is/are the insight(s) found from the chart?

Answer Here

* It appears that two out of the five neighborhood groups significantly contribute to our business. The groups Brooklyn and Manhattan are particularly attracted to our service offerings, accounting for nearly 80% of our total business.

* In contrast, the Queens neighborhood group contributes approximately 5,000 listing, while Staten Island and Bronx show very little interest, combining both contribution is less than 10% of the total business.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* The insights gained are crucial for growing our business. We should focus on the Bronx and Staten Island groups, even though they are not currently contributing to our revenue. We need to investigate the reasons behind this lack of engagement. Potential factors could include:


1. Location issues: Are there enough listings available in these areas?
2. Quality of rooms: Are the accommodations not meeting customer expectations?
3. Pricing concerns: Are prices too high compared to the quality or amenities offered?
4. Customer preferences: What types of rooms are potential customers in these neighborhoods looking for?
* Additionally, we should also pay attention to the Queens group, as they similarly show lower contributions. It's essential to identify any specific issues affecting this group and determine if there are competitors in the market that dominate these neighborhoods.

* On the other hand, the performance of Brooklyn and Manhattan is impressive, contributing approximately 80% of our business. We need to understand what drives their engagement:

1. Location advantages: Are there unique features or attractions in these neighborhoods?
2. Pricing benefits: Are our prices competitive or better than those in other areas?
3. Preferred offerings: What facilities or services do customers from these neighborhoods value the most?

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Set the figure size
plt.figure(figsize=(5,4))

# Use count plot
sns.scatterplot(data=data,x="room_type",y="price")

# Set x lables
plt.xlabel("Room type",color="blue")

# Set y labels
plt.ylabel("Price",color="blue")

# Set title
plt.title("Price vs Room type",color="blue")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.


A scatter plot displays the actual data points on a chart, often used to show relationships between two continuous variables. In the case of categorical data, like room type, we can compare the price for each room type. When working with three dimensions, we introduce a third variable using the hue argument, which helps differentiate data points by color based on the third variable. This makes it easier to visualize and compare how room type (categorical) and price (continuous) relate, with an additional dimension (like neighbourhood group) represented by color on the graph

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The private room and entire home/apartment categories exhibit a broad price range, starting from 0 and extending up to 10,000. Both room types have listings across the full spectrum, indicating a wide variety in pricing. On the other hand, shared rooms have a much smaller price range, with prices closing around 2,000, showing a more limited variation in this category. This difference suggests that shared rooms tend to be more budget-friendly, while private rooms and entire apartments cater to a broader range of budgets, from affordable to luxury options

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* The insights from the chart are crucial for understanding and improving our business. The observation that some rooms have a price of zero raises potential concerns:

* It could be due to fraud, where landlords list their properties at zero to bypass the platform's payment system, allowing them to collect 100% of the rental fees privately.
Alternatively, it might be a data entry mistake, but given the frequency, it appears more deliberate than just a few typos.
This issue indicates that some landlords may be using the platform to advertise their properties for free and then handle transactions off-platform, which could hurt our revenue model. Implementing stricter verification or validation for pricing could help address this problem.

* On the other hand, the high prices for many private rooms and entire homes/apartments (greater than 2,000) seem legitimate. Properties in prime locations or those that are well-maintained and furnished typically command higher prices.

* As for shared rooms, the data shows that people are less willing to pay more, with prices usually capped below 2,500. This suggests that shared rooms cater to budget-conscious travelers, and their limited price range reflects this

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Set the figure size
plt.figure(figsize=(12,6))

# Use count plot
sns.scatterplot(data=data,hue="room_type",y="longitude",x="latitude")

# Set x lables
plt.xlabel("Room type",color="blue")

# Set y labels
plt.ylabel("Price",color="blue")

# Set title
plt.title("Price vs Room type",color="blue")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

* A scatter plot is ideal for comparing data in three dimensions. In our case, we can use the longitude and latitude columns to map the locations and add a third dimension using room type. This allows us to visually identify:

* Which room type (private room, entire apartment, or shared room) is preferred in specific locations.
Whether private rooms are more common in certain neighborhoods, or if entire apartments dominate in others.

* This geographical distribution can help us understand customer preferences based on location, allowing us to tailor our offerings and marketing strategies to target specific regions more effectively.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* The data displayed in the scatter plot reveals that all types of rooms (private rooms, entire apartments, and shared rooms) are booked across every region, with no clear preference for a specific room type in any particular location. This suggests that customers choose room types based on their individual requirements rather than location-based preferences. The absence of a specific distribution pattern indicates that factors other than location

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* It seems that the output from the scatter plot is not as useful for making impactful business decisions. The mixed data shows that all room types (private rooms, entire apartments, and shared rooms) are booked across every location, so we can't link specific room types to particular regions. While shared rooms have fewer occurrences, they still appear everywhere, making it difficult to draw any clear conclusions from this chart.

* As a result, this plot doesn’t provide actionable insights for decision-making or business growth strategies. We might need to explore other aspects of the data, such as pricing trends, customer preferences, or booking patterns, to gather more meaningful insights for improving our services.

#### Chart - 5

In [None]:
# See how room type are distributed where room are available for 365 days
availability_365 = data["room_type"][data["availability_365"] == 365].value_counts()
availability_365

In [None]:
# See how the room type are distributed where rooms are not availble
availability_0 =data["room_type"][data["availability_365"] == 0].value_counts()
availability_0

In [None]:
# Chart - 5 visualization code

# Set the figure size for the two charts
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6),sharex=True)

# Use pie chart to show the distribution of room type in 365 days
ax1.pie(availability_365,labels=availability_365.index,autopct="%1.1f%%")
ax1.set_title("availability 365 days",color="red")

# Use pie chart to show the distribution of room type in 0 days
ax2.pie(availability_0,labels=availability_0.index,autopct="%1.1f%%")
ax2.set_title("availability 0 days",color="red")

# show the chart
plt.show()




##### 1. Why did you pick the specific chart?

Answer Here.

* We need to create a pie chart to show the distribution of categorical data (room availability) and how much each category contributes to the overall data.

In this case, we have two categories:

* Room availability for 365 days
* Room availability for 0 days

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* For rooms available for 365 days, the pie chart shows that 52.6% of private rooms are available for the entire year. Entire homes/apartments make up 37% of the total availability, followed by shared rooms, which account for only 10%.

* For rooms available for 0 days, the distribution is slightly different. 47.7% of private rooms are unavailable for any days, while entire homes/apartments make up 50.6% of the total. Shared rooms are the least, with only 1.7% of them available for 0 days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* Most private rooms are available for 365 days, which raises concerns. We need to investigate why these rooms remain available for the entire year. Are they not being booked by anyone, or are they simply open for services throughout the year? The same inquiry should be made for entire homes/apartments and shared rooms to understand why they have year-long availability.

* On the other hand, we also need to focus on rooms that are unavailable for the entire year. What's the reason behind this? Are they fully booked by someone for the year, or is there another reason for their unavailability?

* Since a significant portion of our business revenue comes from entire homes/apartments and private rooms, it's crucial to investigate why they are not available year-round. This could impact our revenue and operations, so it needs careful consideration.

#### Chart - 6

In [None]:
# Count each type of room booked by each type of group for this we use groupby function
count_room_type = data.groupby(["neighbourhood_group","room_type"]).size().reset_index(name="count")
count_room_type

In [None]:
# Chart - 6 visualization code

# Set the figure size
plt.figure(figsize=(10,4))

# Use count plot
sns.barplot(data=count_room_type,y="count",x="neighbourhood_group",hue="room_type")

# Set x lables
plt.xlabel("Neighbourhood_group and room type",color="blue")

# Set y labels
plt.ylabel("number of room",color="blue")

# Set title
plt.title("Number of Room booked by each group",color="blue")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

* A bar chart is used for displaying the categorical distribution of the data. In this case, we have 5 neighborhood groups and 3 types of rooms. We need to find how many of each room type has been booked by each group. A bar chart is the correct choice to show this in one chart. On the x-axis, it will show the groups, and each room type will be represented for every group

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* Insights from the chart show that out of the 5 groups, 2 are the most active in our business, while 2 are much less active, and 1 is moderately active. The Brooklyn and Manhattan groups are the most active, and the rooms booked by them are mostly not shared rooms. They contribute significantly to our business through bookings of apartments or private rooms. Manhattan, in particular, has the highest number of apartment bookings compared to any other group, followed by Brooklyn and then Queens. Brooklyn and Queens have the maximum bookings for private rooms, while other groups also show a higher interest in private rooms over apartments

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
* "The insights we gathered from the chart are helpful for growing our business in terms of revenue generation and expansion. We see that Manhattan is primarily interested in booking apartments, so we should provide more apartment-related suggestions to this group. Brooklyn tends to book more private rooms, but their interest in apartments is not far behind. To increase profit, we should offer more benefits for booking apartments in Brooklyn. Queens also shows a strong interest in private rooms, so we should apply a similar strategy for this group and look for ways to moderately increase our business within the Queens group. Lastly, we need to pay serious attention to Bronx and Staten Island, as these groups are much less active in the business.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Set the figure size
plt.figure(figsize=(8,4))

# Use count plot
sns.histplot(data=data,x="price",bins=10)

# Set x lables
plt.xlabel("Price",color="red")

# Set y labels
plt.ylabel("Peak",color="red")

# Set title
plt.title("Distribution of price between 10 points",color="red")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

* Histograms are used to visualize the distribution of continuous data across different bins. We can set as many bins as needed. In this case, we have a price column for rooms, which varies based on location and type opf rooms and other factors. To understand how prices are distributed in the dataset, we use a histogram to display the distribution of prices for the given data

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* 8 We can clearly see that more than 90% of the prices in our data fall between 0 and 1000, while the remaining prices are distributed across other bins. The price distribution is left-skewed, with the majority of prices concentrated between 0 and 1000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* The given insights are extremely helpful for growing our business and identifying fraudulent activities. In this dataset, more than 90% of the prices fall between 0 and 1000, which raises serious concerns. It's important to investigate how this is possible. In my view, it seems that some landlords might be showing lower prices for apartments and private rooms to attract more customers, then demanding higher amounts later. This could be an attempt to hide legal transactions or avoid paying platform fees.

#### Chart - 8

In [None]:
# Find the sum of the price paid by group
price_sum = data.groupby(["neighbourhood_group"]).agg({"price":"sum"}).reset_index()
price_sum

In [None]:
# Chart - 8 visualization code

# Set the figure size
plt.figure(figsize=(10,4))

# Use count plot
plt.pie(data=price_sum,x="price",labels="neighbourhood_group")

# Set title
plt.title("Contribution in price by Each Group",color="blue")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
* A pie chart is used to show the proportion of each category in relation to the whole. In this case, we want to visualize the total sum of prices paid by each group over a certain period. With the help of a pie chart, we can clearly see the percentage contribution of each group to the overall total

##### 2. What is/are the insight(s) found from the chart?

Answer Here
* The insights from the pie chart are consistent with other factors in the dataset. Manhattan's contribution is the highest, followed by Brooklyn and then Queens. The two groups, Bronx and Staten Island, are much less active, as they contribute very little.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* The insights we gathered from the chart are helpful for growing our business in terms of revenue generation and expansion. We see that Manhattan is primarily interested in booking apartments, so we should provide more apartment-related suggestions to this group. Brooklyn tends to book more private rooms, but their interest in apartments is not far behind. To increase profit, we should offer more benefits for booking apartments in Brooklyn. Queens also shows a strong interest in private rooms, so we should apply a similar strategy for this group and look for ways to moderately increase our business within the Queens group. Lastly, we need to pay serious attention to Bronx and Staten Island, as these groups are much less active in the business.

#### Chart - 9

In [None]:
# How reviews affect the price of the room type
price_to_reviews = data.groupby(["room_type"]).agg({"price":"mean","reviews_per_month":"sum"}).reset_index()
price_to_reviews

In [None]:
# Chart - 9 visualization code

# Set the figure size
plt.figure(figsize=(8,4))

# Use count plot
sns.lineplot(data=price_to_reviews,x="price",y="reviews_per_month",hue="room_type",color="red",linewidth=2,marker="o")

# Set x lables
plt.xlabel("Price",color="red")

# Set y labels
plt.ylabel("review",color="red")

# plot and change the color and style of the grid
plt.grid(color="green",linestyle="--",linewidth=0.5)

# Set title
plt.title("Review vs Price",color="red")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

* A line chart is used for continuous data to show the relationship between two variables. It displays data points connected by straight lines, making it easy to observe trends over time. Line charts are particularly effective for highlighting changes in values, identifying patterns, and comparing multiple datasets

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* The insights from the line chart show that prices tend to be lower for rooms with fewer reviews and higher for those with more reviews across all room types. However, the relationship between price and reviews is not directly proportional. While lower reviews are associated with lower prices, as reviews increase more than tenfold, the price does not increase at the same rate, indicating a weaker connection between price and review count.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* While the insights gained are somewhat helpful for growing our business, they are not entirely reliable. Some customers are not interested in leaving reviews, and others may provide them without much thought. We cannot assume that high reviews always mean high prices—it could be due to factors like local facilities. In some cases, low reviews might be for properties in prime areas with limited facilities, while the price remains high for prime areas and lower for local areas. To improve accuracy, we should encourage customers to provide genuine reviews as much as possible, perhaps by requesting feedback after they’ve received the service.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Choose data that are required for heat map
num_data = data[["latitude","longitude","price","minimum_nights","number_of_reviews","last_review","availability_365"]]
num_data

In [None]:
# Correlation Heatmap visualization code

# Set the figure size
plt.figure(figsize=(8,4))

# Use count plot
sns.heatmap(num_data.corr(),annot=True)

# Set title
plt.title("Review vs Price",color="red")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

* A correlation heatmap shows the relationship between numerical data, indicating how one variable is affected by another. In our case, we have columns like 'price' and 'review', and the heatmap is used to visualize their relationship. A value of 1 indicates a strong positive correlation, 0 indicates no correlation, and -1 indicates a strong negative correlation between the variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* The insights from the chart are not very promising, as none of the columns in our dataset show a strong correlation. Some variables have a slight positive or negative correlation, but overall, the price of the room is not significantly affected by any single variable in the dataset. There are several other factors displayed in the chart, but they also lack strong relationships.

#### Chart - 11 - Pair Plot

In [None]:
# Select data for the pair plot
pair = data[["price","minimum_nights","number_of_reviews","last_review","availability_365","room_type"]]

In [None]:
# Pair Plot visualization code

# Set the figure size
plt.figure(figsize=(8,4))

# Use count plot
sns.pairplot(pair,hue="room_type")

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

* A pair plot is used to visualize the relationships between multiple variables in a dataset. It creates a grid of scatter plots for each pair of variables, to observe patterns, correlations, and distributions

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* In every case, we observe that the price for entire homes/apartments is consistently high, followed by private rooms. Regarding availability and minimum nights, the price for entire homes/apartments does not fluctuate much and remains high across all scenarios

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

#### Airbnb’s core business model focuses on building a trusted community where hosts and guests can mutually benefit. The platform empowers hosts by giving them a flexible way to generate income from their property, while guests gain access to a diverse range of accommodations that traditional hotels might not offer. To maintain and expand this model, Airbnb needs to continuously analyze market trends, host behavior, customer preferences, and pricing strategies.

1. Room Type and Price Analysis:

* From the dataset, we observed that different room types have varying price ranges, which significantly influence Airbnb’s revenue. The most common room types are entire homes or apartments, private rooms, and shared rooms. According to our analysis, entire homes/apartments and private rooms are the most frequently booked room types. This trend is reflected in the higher number of bookings and availability for these types, which suggests that customers prefer more private and personal spaces. Shared rooms, on the other hand, are less popular, potentially due to concerns over privacy or personal space.

* Pricing is a critical component of Airbnb’s revenue. We found that the average price across all room types is around  152 per night. However, there are certain outliers where rooms are priced as low as 0, which raises a red flag. This could indicate either a data entry mistake or, more concerningly, fraudulent behavior on the part of hosts. Some hosts might list rooms at a price of 0 to attract potential customers but charge them higher rates through private transactions to avoid platform fees. Such activities need to be closely monitored and rectified to maintain transparency and trust on the platform.

2. Location-based Price Variations:

* The location of a listing plays a crucial role in determining its price. From our analysis, it is evident that properties located in popular areas like Brooklyn and Manhattan tend to command higher prices. These areas contribute significantly to Airbnb’s revenue due to their desirability among travelers, who are often willing to pay a premium for staying in central or attractive locations. For example, rooms with scenic views in hilly areas or properties near the sea tend to have higher price tags due to the location’s appeal.

* On the flip side, neighborhoods like Bronx and Staten Island contribute less to the overall business. This could be due to lower demand in these regions or fewer hosts offering properties. To expand Airbnb’s business in these areas, the company may need to investigate why they are underperforming and introduce targeted marketing campaigns or incentives to attract more guests and hosts. Expanding the range of available properties in underrepresented locations could also enhance the diversity of options for travelers.

3. The Role of Reviews in Pricing and Bookings:

* One of the assumptions made during our analysis was that customer reviews would strongly influence room prices. Generally, properties with higher reviews are expected to be priced higher, given that positive feedback typically leads to higher demand. However, the data revealed that there is no clear, strong correlation between reviews and prices. While some highly reviewed properties are priced higher, many listings with a significant number of positive reviews still have low prices. This could be due to various factors, such as the host’s pricing strategy, location, or room type.

* Moreover, we noticed that many columns related to reviews, such as last_review and reviews_per_month, contain missing data. This lack of information limits our ability to fully understand customer feedback patterns and their impact on pricing. It is essential for Airbnb to encourage hosts to fill in these fields accurately so that the company can make more informed decisions based on customer feedback. By improving the data collection process, Airbnb can gain deeper insights into how reviews affect booking trends and room prices.

4. Neighborhood Groups and Customer Preferences:

* The dataset reveals that Airbnb operates in five main neighborhood groups, but there is potential to expand into more regions or create additional groups within existing markets. Each neighborhood group shows different preferences for room types and locations. For instance, Manhattan and Brooklyn are hotspots for entire homes and apartments, while other groups might prefer different types of accommodations.

* To increase its market share, Airbnb could consider expanding its offerings by introducing new types of rooms, such as garden rooms, poolside rooms, or family-friendly accommodations. These new options could cater to specific customer preferences and attract new demographics, thereby boosting revenue.

5. Addressing Fraudulent Listings and Data Integrity:

* One of the critical issues uncovered during our analysis is the presence of listings with a price of $0. While there could be legitimate reasons for some properties being listed at such prices (e.g., promotions or discounts), it is more likely that these represent data entry errors or fraudulent activities by hosts. Fraudulent listings, where hosts underreport prices to avoid platform fees, can damage Airbnb’s reputation and result in lost revenue. To mitigate this risk, Airbnb should implement stronger data validation processes to ensure the accuracy of pricing information and prevent misuse of the platform.

# **Conclusion**

Write the conclusion here.

* In conclusion, this EDA project has provided valuable insights into the factors influencing Airbnb’s business. The analysis shows that room type, location, and host activity play significant roles in determining the company’s revenue generation. Brooklyn and Manhattan are key contributors to Airbnb’s success, while areas like Bronx and Staten Island require more attention to boost engagement and revenue.

* The findings also suggest that Airbnb could expand its room type offerings to cater to more diverse customer needs, such as introducing specialized rooms for families, nature lovers, or luxury travelers. Improving data integrity and addressing fraudulent listings are critical steps in ensuring that Airbnb continues to provide a transparent and trustworthy platform.

* By leveraging the insights gained from this analysis, Airbnb can refine its pricing strategies, enhance host and guest experiences, and expand into new markets, ultimately maximizing value for both hosts and guests while continuing to grow its business globally.

## Note

"All the explanations in this project for every chart, summary, and conclusion were written by me, and I used ChatGPT to rewrite everything".

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***