<a href="https://colab.research.google.com/github/Nakulcj7/EDA-project/blob/main/Airbnb_booking_analysis(capstone_project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Airbnb Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
**Name** - Nakul CJ


# **Project Summary -**

Airbnb,Inc is an american based company that operates an online marketplace for lodging,homestays for vaction rentals and tourism activities.Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world.

The Airbnb NYC 2019 dataset consists of information on 48,895 Airbnb listings in New York City, and the main objective of this project was to conduct an EDA of the dataset to obtain valuable insights into the Airbnb market in the city. To accomplish this goal, the EDA was divided into two primary phases: data cleaning and preprocessing, and exploratory analysis.

The first stage involved identifying and addressing missing or inconsistent data, eliminating irrelevant variables and outliers,and ensuring that the data was appropriate for analysis. Following the data cleaning and preprocessing, the exploratory analysis was carried out, which involved examining the variable's distribution, identifying patterns and relationships between variable's, and using visualizations to obtain insights into the Airbnb market in New York City. This process revealed some interesting findings, such as the most popular neighborhoods, the most common room types, the average price of listings in different neighbourhoods etc.

This report on the EDA are insights and recommendations for both Airbnb hosts and the company. The recommendations for hosts in high-demand neighborhoods is to offer discounts to gain a competitive edge, while those in less popular areas should consider lowering their prices or providing more significant discounts to attract more guests.

This project demonstrated the importance of data analysis in comprehending and enhancing business performance.Through analyzing the Airbnb NYC 2019 dataset, valuable insights were acquired, which could be used by Airbnb hosts and the company to make correct decisions.The project emphasizes the potential of data analysis to improve business performance and highlights the significance of making data-driven decisions.



# **GitHub Link -**

https://github.com/Nakulcj7/EDA-project.git

# **Problem Statement**



Airbnb dataset is used to understand the factors that impact booking rates and customer satisfaction, in order to identify areas for improvement and optimize revenue.

#### **Define Your Business Objective?**

Business objective for performing an exploratory data analysis (EDA) on the Airbnb dataset is to identify opportunities to improve the user experience and drive revenue growth. By analyzing patterns and trends in the data, the company can gain insights into customer preferences and behaviors, as well as identify potential areas for improvement in terms of pricing and availability. This information can be used for strategic decisions around marketing, product development, and service enhancements to ultimately increase bookings, customer satisfaction, and profitability.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime
from datetime import date

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = ('/content/drive/My Drive/')

### Dataset First View

In [None]:
# Dataset First Look
airbnb_df = pd.read_csv(path + 'Airbnb NYC 2019.csv')

In [None]:
airbnb_df.head()

In [None]:
airbnb_df.shape

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows_count = len(list(airbnb_df.index))
columns_count = len(list(airbnb_df.columns))
print(f"Number of rows: {rows_count}")
print(f"Number of columns: {columns_count}")



### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Identifying duplicates
duplicate_values = airbnb_df.duplicated()
# print(duplicate_values)
# Counting the number of duplicates
duplicate_values_count = duplicate_values.sum()
print(f'There are {duplicate_values_count} duplicate rows in the dataset.')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_count = airbnb_df.isnull().sum()
print(missing_count)

In [None]:
# Visualizing the missing values
missing_count.plot(kind='bar', color='Green', width= 0.7)

### What did you know about your dataset?
From the basic analysis,there are four columns which are having missing/null values in it.The columns are name,host_name,last_review and reviews_per_month.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns =list(airbnb_df.columns)
print(columns)

In [None]:
# Dataset Describe
airbnb_df.describe()

In [None]:
# Dataset Describe(on selected columns)
numerical_colums_description = airbnb_df[['price','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]
numerical_colums_description.describe()

### Variables Description

There are some problems with the dataset,such as:


*   The minimum price is zero for a property which can't be true and perhaps an indication of the incorrect entry of data in the price column.
*   The minimum days for availability of a property is listed as zero for some properties. Maybe these are old properties which are no more available for bookings or these are new properties which are listed but still not available for booking or both.

*  The minimum number of reviews is zero which is surprising for a property to have zero number of reviews. Maybe these are new properties that have been recently listed and that might be the reason for zero number of reviews.







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def num_of_unique_elements(columns):
  for index,column_name in enumerate(columns):
    print(f'Number of unique elements in {columns[index]}: {airbnb_df[columns[index]].value_counts().count()}')

print(num_of_unique_elements(columns))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Making a copy of dataset before making changes to it
airbnb = airbnb_df .copy()

In [None]:
# Finding columns null values or missing values
airbnb.isnull().sum()

In [None]:
# Removing rows with the missing names of the property name

airbnb = airbnb[~airbnb['name'].isnull()]

airbnb.isnull().sum()


In [None]:
# Removing the column "reviews_per_month"

if 'reviews_per_month' in airbnb.columns:
  airbnb.drop('reviews_per_month', axis = 1, inplace = True)

airbnb.isnull().sum()


In [None]:
# Removing the rows corresponding to null values for "host_name"

airbnb= airbnb[~airbnb['host_name'].isnull()]
airbnb.isnull().sum()

In [None]:
# Checking whether the number_of_reviews having 0 as value and the number of last_review with missing values are same or not

count_1 = 0 # counts number of missing values in the last_review
for i in airbnb['last_review']:
  if pd.isnull(i):
    count_1 = count_1 + 1

print(f"The missing values in the last_review are: {count_1}")

count_2 = 0 # counts number_of_reviews as 0
for j in airbnb['number_of_reviews']:
  if j == 0:
    count_2 = count_2 +1

print(f"The values in the number_of_reviews as 0 are: {count_2}")


This indicates that  values in the number_of_reviews as 0 are exactly equal to the number of missing values in the last_review,which is expected.

Assumption

To fill in the missing values in the last_review column, the approach taken is to replace them with the most recent date available in the column. This assumption is based on the belief that properties with 0 number_of_reviews are typically new listings on Airbnb. However, it is important to note that there could be some older properties with zero reviews, although this scenario is rare.


In [None]:
# Filter rows with non-null values for 'last_review' column
last_review_null_removed = airbnb[~airbnb['last_review'].isnull()]

# Convert format of 'last_review' column from string to datetime
last_review_null_removed['last_review'] = pd.to_datetime(last_review_null_removed['last_review'], format='%Y-%m-%d')

# Print information about the modified dataframe
last_review_null_removed.info()

In [None]:
#Again Converting 'last_review' column to datetime format
airbnb['last_review'] = pd.to_datetime(airbnb['last_review'], errors='coerce')

# Find the latest date in the 'last_review' column
latest_date = airbnb['last_review'].max()

# Replace null values with the latest date
airbnb['last_review'].fillna(latest_date, inplace=True)

# Check the number of null values in the dataframe
airbnb.isnull().sum()

In [None]:
# Their is also problem with the price column

airbnb['price'].describe()

The minimum value of price can't be zero

In [None]:
#checking how many values in price column are zero.

airbnb[airbnb['price'] == 0]

In [None]:
# Since number of rows containing Zero as price are less these rows can be removed from the dataset
airbnb = airbnb[airbnb['price'] != 0]
airbnb

### What all manipulations have you done and insights you found?

The columns such as  name, host_name, last_reviews and reviews_per_month had missing/null values.

*   Number of missing values for the column name are 16,so the rows containing missing values for name column have been removed.
*   The number of missing values for the column host_name are 21,so the rows containing missing values for host_name have been removed.

*   The column reviews_per_month has 10037 number of missing values.There is no way to calculate reviews_per_month for the missing values from the number_of_reviews column since the dataset does not specify that for how many months the property is listed.If it would have been provided then reviews_per_month could have been calculated by dividing number_of_reviews by the the months for which the property has been listed. So, there is no other option than dropping the reviews_per_month column and hence it has been dropped.
*   To fill in the missing values in the last_review column, the approach taken is to replace them with the most recent date available in the column. This assumption is based on the belief that properties with zero number_of_reviews are typically new listings on Airbnb. However, it is important to note that there could be some older properties with zero reviews, although this scenario is highly unlikely.

*   The rows containing zero value for price have been removed from the dataset as number of such rows was very less and price can't be zero for a property.







## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# neighbourhood_group with their associated number of properties

airbnb['neighbourhood_group'].value_counts()



In [None]:
# Chart - 1 visualization code

plt.figure(figsize = (6,4))

# creating the countplot with Seaborn

ax = sns.countplot(data = airbnb, x = 'neighbourhood_group')

for bars in ax.containers:
    ax.bar_label(bars)


# setting plot title and labels
plt.title('Top 5 neighbourhood_groups with highest listings', fontsize = 12)
plt.xlabel('neighbourhood_group', fontsize = 10)
plt.ylabel('Number of properties', fontsize = 10)

# changing fontsize of tick labels
plt.tick_params(axis='x', labelsize=8)
plt.tick_params(axis='y', labelsize=8)

# display plot
plt.show()

##### 1. Why did you pick the specific chart?

Using the Countplots method,we can understand how many times each category appears in the dataset

##### 2. What is/are the insight(s) found from the chart?

Manhattan has the highest number of properties,followed by Brooklyn,Queens,Bronx and Staten island.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The positive business impact would be that identifying the top neighbourhood
groups with the highest number of listings can help the platform focus its marketing efforts and resources on those areas. This targeted approach can lead to increased bookings and revenue.

The negative impact would be that it reveals a significant decrease in bookings or occupancy rates for a specific neighbourhood group, it may indicate declining popularity or demand in that area. In such cases, the platform may need to reassess its marketing strategies or explore new neighbourhoods to maintain growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Top 3 properties names with highest listings

airbnb['name'].value_counts().head(3)

In [None]:
# Chart - 2 visualization code

plt.figure(figsize =(11,6))

# creating the barplot with Seaborn
sns.barplot(x= airbnb['name'].value_counts().head(3).index,
            y= airbnb['name'].value_counts().head(3).values,
            palette= 'rocket',width = 0.6)


# setting plot title and labels
plt.title('Top 3 Property names with highest listings', fontsize = 17)
plt.xlabel('Property', fontsize = 15)
plt.ylabel('Number of Listings', fontsize = 15)

# changing fontsize of tick labels
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)

# rotate x-axis labels
plt.xticks(rotation= 30)

# display plot
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is a common and effective way to visually represent categorical data, where the x-axis represents the categories and the y-axis represents the frequency or count of each category. A bar chart is a simple and easy-to-understand visual representation of the data, making it accessible to a wider audience.

##### 2. What is/are the insight(s) found from the chart?

The properties with the names Hillside Hotel, Home away from home, and New york Multi-unit building are the top 3 most listed properties according to property name.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The positive business impact would be that by understanding the most popular property names can help the platform identify trends and preferences among users. This information can be used to optimize listing titles and descriptions to attract more bookings and increase user engagement.
*   The negative impact would be that if the analysis reveals that the top property names are associated with a specific location or neighborhood that is experiencing a decline in popularity or demand, it may negatively impact bookings and revenue for that area. The platform may need to reassess its marketing strategies or explore new locations to mitigate the negative impact.



#### Chart - 3

In [None]:
# Top 5 host_name who are caretakers of maximum properties

airbnb['host_name'].value_counts().head()

In [None]:
# Chart - 3 visualization code
plt.figure(figsize = (7,5))

# creating the barplot with Seaborn
sns.barplot(x= airbnb['host_name'].value_counts().head().index,
            y= airbnb['host_name'].value_counts().head().values,
            palette= 'cubehelix',width = 0.5)


# setting plot title and labels
plt.title('Top 5 host_name with highest listings', fontsize = 15)
plt.xlabel('host_name', fontsize = 14)
plt.ylabel('Number of Listings', fontsize = 14)

# changing fontsize of tick labels
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)

# display plot
plt.show()


##### 1. Why did you pick the specific chart?

Bar charts are easy to interpret, even for those unfamiliar with data visualization. The height or length of each bar directly corresponds to the value being represented, allowing for quick comparisons and insights.

##### 2. What is/are the insight(s) found from the chart?

The top 5 host who have maximum listings are Michael, David, Sonder(NYC), John, and Alex.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The positive impact would be that the platform can leverage this insights to establish partnerships with the top hosts, offering them exclusive benefits, support, and resources. Collaborating with successful hosts can lead to improved host retention, enhanced property offerings, and increased customer satisfaction.
*   The negative impact would be as these analysis reveals a heavy reliance on a few top hosts, it may lead to an imbalance in the platform's host ecosystem. Over-dependence on a small number of hosts creates a vulnerability where any adverse events impacting those hosts can significantly affect the platform's operations and revenue. Diversification and encouraging a broader base of successful hosts can help mitigate this risk.



#### Chart - 4

In [None]:
# Visualizing Property density in neighbourhood groups

# Chart - 4 visualization code
# creating scatter plot showing the density of properties in different neighbourhood_group
sns.scatterplot(x= 'longitude', y= 'latitude', data= airbnb, hue= 'neighbourhood_group')

# set the x-axis and y-axis labels

plt.xlabel('latitude', fontsize = 12)
plt.ylabel('longitude', fontsize = 12)

# set the title
plt.title('Property Location Plot', fontsize = 15 )

# display the plot
plt.show()



##### 1. Why did you pick the specific chart?

A scatter plot is a type of graph that is used to display the relationship between two variables. It is useful when you want to visualize how the values present in the two columns are related.

##### 2. What is/are the insight(s) found from the chart?

We have already seen earlier that the neighbourhood_group Manhattan has highest number of properties listed. Now from the above scatter plot we can also see that the Manhattan has the highest density of properties.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The positive impact would be that this provides insights into underrepresented or underserved neighborhood groups with lower property densities. This information can help identify potential expansion opportunities for the platform. By targeting these areas, the platform can attract more hosts and guests, diversify its offerings, and tap into new markets.
*   The negative impact would be this scatter plot reveals a significant concentration of properties in a few select neighborhood groups, it may lead to oversaturation and intense competition within those areas. This can result in a downward pressure on prices, reduced profitability for hosts.



#### Chart - 5

In [None]:
# room_type available

airbnb['room_type'].value_counts()

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(5,5))
plt.pie(list(airbnb['room_type'].value_counts()),labels=list(airbnb['room_type'].value_counts().index),autopct='%1.1f%%',explode = [0.06, 0.06, 0.06])
plt.show()

##### 1. Why did you pick the specific chart?



A pie chart is a type of graph that is used to display the proportion or percentage of each category in a data set. It is useful when you want to visualize how different categories contribute to a whole.


##### 2. What is/are the insight(s) found from the chart?

An entire home/apt is more preferred and then followed by private room.Shared room is preferred by a percentage of few.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The positive impact would be that the distribution of room types influences pricing strategies. Hosts can adjust their prices based on the demand and supply dynamics of each room type. Understanding the popularity of different room types allows hosts to price competitively and maximize their revenue potential while ensuring the affordability and value for guests.
*   The negative impact would be that if the majority of the distribution is skewed towards a few highly popular room types, there may be a risk of oversaturation and intense competition among hosts offering similar room types. This can lead to reduced profitability and potential negative impacts on hosts business growth.



#### Chart - 6

In [None]:
# Price range of listed properties

airbnb['price'].sort_values(ascending= False)

In [None]:
# Categorizing properties based on price

# creating a function to categorize properties based on their price

def categorize_price(price):
    if price < 50:
        return 'Low'
    elif price >= 50 and price < 2000:
        return 'Medium'
    else:
        return 'High'

# applying the function to the 'price' column of the 'airbnb' DataFrame
airbnb['price_category'] = airbnb['price'].apply(categorize_price)

In [None]:
price_category = airbnb.groupby('price_category', as_index = False)['id'].count().rename(columns={'id':'Count'}).sort_values('Count', ascending = False)
price_category

In [None]:
# Chart - 6 visualization code

ax = sns.countplot(data = airbnb, x = 'price_category')

for bars in ax.containers:
    ax.bar_label(bars)

# setting plot title and labels
plt.title('Count of Property based on Price category', fontsize = 15)
plt.xlabel('Price category', fontsize = 14)
plt.ylabel('Count of Properties', fontsize = 14)

# changing fontsize of tick labels
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)

# display plot
plt.show()




##### 1. Why did you pick the specific chart?

Countplots are particularly useful when you want to understand how many times each category appears in a dataset.

##### 2. What is/are the insight(s) found from the chart?

Most number of properties fall in the category of medium price property, followed by few properties in low price category and very few properties are in high price category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The positive impact would be that these insights from the countplot can guide product development decisions. By analyzing the popularity of price categories, the platform can identify opportunities to expand its offerings or introduce new features targeted at specific price segments. This allows the platform to diversify its portfolio and attract a wider range of guests.
*   The negative impact would be that if the count plot shows a lack of diversity in price categories, with a majority of properties concentrated in a narrow range, it may limit the platform's ability to cater to a broader market. This lack of pricing options can result in missed opportunities to attract guests with different budget preferences, leading to reduced growth potential.



#### Chart - 7

In [None]:
# Creating a new column named last_review_year from the last_review_column

airbnb['last_review_year'] = airbnb['last_review'].dt.year

In [None]:
# Number of unique years for last_review_year

airbnb['last_review_year'].nunique()

In [None]:
# Chart - 7 visualization code

# Years for which the last review is available for a property

ax = sns.countplot(data= airbnb, x= 'last_review_year')

for bars in ax.containers:
  ax.bar_label(bars)

# setting plot title and labels
plt.title('Count of Property based on last review year', fontsize = 14)
plt.xlabel('Last review year', fontsize = 13)
plt.ylabel('Count of Properties', fontsize = 13)

# changing fontsize of tick labels
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)

# display plot
plt.show()

##### 1. Why did you pick the specific chart?



Countplots are particularly useful when you want to understand how many times each category appears in a dataset.


##### 2. What is/are the insight(s) found from the chart?

Most of the properties is been reviewed in 2019.There is a gradual increase in reviews each year from 2014.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The positive impact would be that the platform can assess the performance of properties over time. It helps identify properties that consistently receive positive reviews and maintain a high level of guest satisfaction. Such insights enable the platform to highlight these well-performing properties, boosting their visibility and attracting more bookings.
*   The negative impact would be that if there is a significant number of properties with outdated or no reviews, it may indicate a lack of guest engagement or activity on the platform. This can negatively impact the overall trust and confidence of potential guests, leading to decreased bookings and slower growth.



#### Chart - 8

In [None]:
# Top 5 Number of days for which the properties are available

airbnb['availability_365'].value_counts().head()

In [None]:
# Calculating median

airbnb['availability_365'].median()


In [None]:
# Box plot to see the variation of values in column availability_365

plt.figure(figsize = (6,6))
sns.boxplot(data= airbnb, y= 'availability_365' )
plt.show()

In [None]:
# Checking whether the properties which are having 0 as availability days are operational or not

availability_0_check = airbnb.loc[airbnb['availability_365'] == 0,['availability_365','last_review']]
availability_0_check

By this we can confirm that most properties having availability as '0' days have a last review date.So these properties are operational.So we can replace these '0' days with the median of avaiability_365 column.

In [None]:
# removing rows containing 0 as the number of days available in a year for median of availability_365 column

availability_0_removed = airbnb[airbnb['availability_365'] != 0]

# Replacing 0 days with the median of the availability_365 column

airbnb['availability_365'] = np.where(airbnb['availability_365'] == 0, availability_0_removed['availability_365'].median(),airbnb['availability_365'] )


In [None]:
# Calculating median after removing 0 values from the availability_365 column

airbnb['availability_365'].median()


In [None]:
airbnb['availability_365'].value_counts().head()


In [None]:
# Chart - 8 visualization code

# Box plot to see the variation of values in column availability_365 after replacing 0 values with median

plt.figure(figsize = (6,6))
sns.boxplot(data= airbnb, y= 'availability_365' )
plt.show()

##### 1. Why did you pick the specific chart?

A box plot, also known as a box-and-whisker plot, displays several key characteristics of a dataset, including the median, quartiles, and any outliers.Box plots are useful because they allow you to quickly visualize the spread and skewness of a dataset, as well as identify any outliers that may be present.

##### 2. What is/are the insight(s) found from the chart?

The first box plot shows that the data for availability_365 column is highly skewed. After the 0 values are removed from the availabity_365 column it can be easily seen that the median of the data has increased and the skewness of the data has also reduced. This means that the data of the availabity_365 column has been improved for further analysis and now the analysis done will be meaningful.

**Q. Who is the best host for each room?**

In [None]:
# Function to get the best host in each room_type

def best_host_in_room_type(room_type_name):
  temp_df = airbnb[airbnb['room_type'] == room_type_name]
  temp_df_count = pd.DataFrame(temp_df[['room_type','host_name']].value_counts().head(1)).reset_index().rename(columns={0:'Number of Properties'})
  return temp_df_count

# Making a list containing the names of all the neighbourhood_groups

room_type_list = list(airbnb['room_type'].unique())
room_type_list

# Creating an empty list to store the dataframe returned by the above function
best_host_in_room_type_df_list = []

# Using loop to input neighbourhooh_group names to the function and appending the returned dataframe from the function to the list
for i in room_type_list:
   best_host_in_room_type_df_list.append(best_host_in_room_type(i))

# concatinating the dataframes returned by function to get final datframe
best_host_in_room_type_df = pd.concat(best_host_in_room_type_df_list)
best_host_in_room_type_df.sort_values('Number of Properties', ascending= False)


The above table shows the names of the best host in each of the room_type based of number of properties handled by them. Sonder is the best host corresponding to room type Entire home/apt. Followed by David for Private room and Sergii for shared room. Sonder is also the best host in Manhattan. So, if a customer requires Entire home/apt in Manhatten, Solder should be recommended to customer.

**Q. Who is the best host in neighbourhood_group ?**

In [None]:
# Function to get the best host in each neighbourhood_group

def best_host_in_neighbourhood_group(neighbourhood_group_name):
  temp_df = airbnb[airbnb['neighbourhood_group'] == neighbourhood_group_name]
  temp_df_count = pd.DataFrame(temp_df[['neighbourhood_group','host_name']].value_counts().head(1)).reset_index().rename(columns={0:'Number of Properties'})
  return temp_df_count

# Making a list containing the names of all the neighbourhood_groups

neighbourhood_group_list = list(airbnb['neighbourhood_group'].unique())
neighbourhood_group_list

# Creating an empty list to store the dataframe returned by the above function
best_host_in_neighbourhood_group_df_list = []

# Using loop to input neighbourhooh_group names to the function and appending the returned dataframe from the function to the list
for i in neighbourhood_group_list:
   best_host_in_neighbourhood_group_df_list.append(best_host_in_neighbourhood_group(i))

# concatinating the dataframes returned by function to get final datframe
best_host_in_neighbourhood_group_df = pd.concat(best_host_in_neighbourhood_group_df_list)
best_host_in_neighbourhood_group_df.sort_values('Number of Properties', ascending= False)


The above table shows the names of the best host in each of the neighbourhood_group based of number of properties handled by them. Sonder is the best host corresponding to neighbourhood_group Manhattan.

**Q. Who is the best host for each price category ?**

In [None]:
# Function to get the best host in each price_category

def best_host_in_price_category(price_category_name):
  temp_df = airbnb[airbnb['price_category'] == price_category_name]
  temp_df_count = pd.DataFrame(temp_df[['price_category','host_name']].value_counts().head(1)).reset_index().rename(columns={0:'Number of Properties'})
  return temp_df_count

# Making a list containing the names of all the neighbourhood_groups

price_category_list = list(airbnb['price_category'].unique())
price_category_list

# Creating an empty list to store the dataframe returned by the above function
best_host_in_price_category_df_list = []

# Using loop to input neighbourhooh_group names to the function and appending the returned dataframe from the function to the list
for i in price_category_list:
   best_host_in_price_category_df_list.append(best_host_in_price_category(i))

# concatinating the dataframes returned by function to get final datframe
best_host_in_price_category_df = pd.concat(best_host_in_price_category_df_list)
best_host_in_price_category_df.sort_values('Number of Properties', ascending= False)




The above table shows the names of the best host in each of the price_category based of number of properties handled by them. Michael is the best host corresponding to room type Medium. Followed by Kazuya for Low and Henry for High. Michael is also the best host in Brooklyn. So, if a customer want properties in Brooklyn in the medium price category then Michael should be recommended.

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
airbnb_numerical_columns = airbnb[['price','minimum_nights','number_of_reviews','calculated_host_listings_count','availability_365','last_review_year']]
# Computing the correlation matrix using the corr method
corr = airbnb_numerical_columns.corr()
plt.figure(figsize=(6,6))
heatmap = sns.heatmap(corr, cmap= 'Spectral', annot = True, vmax = 1, vmin = -1)
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps are graphical representations of data that use color-coding to represent values in a matrix. They are commonly used to visualize the correlation between different variables in a dataset.

##### 2. What is/are the insight(s) found from the chart?

The heatmap indicates that there is no significant correlation between the numerical columns of the dataset. The values in the numerical columns appear to be independent of each other and not strongly related to the values in the other columns.

#### Chart - 10 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(airbnb_numerical_columns, diag_kind='hist')
plt.show()

##### 1. Why did you pick the specific chart?

Pairplots can be a useful tool for exploring and identifying patterns and trends in your data, and for identifying potential correlations between variables. They can help to quickly visualize and understand the relationships between multiple variables in a dataset.

##### 2. What is/are the insight(s) found from the chart?

The above pairplot provides a visual representation of the correlation between numerical columns in the dataset. It shows that variables with a positive correlation coefficient in the heatmap also display a positive correlation in the corresponding scatter plot, while variables with a negative correlation coefficient show a negative correlation. If the correlation coefficient is close to zero, the pairplot also shows that the variables have little or no relationship with each other. In this way, the pairplot can help to identify and confirm patterns and relationships in the data, and can be a useful tool in exploratory data analysis.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the insights gained from the EDA of the Airbnb NYC 2019 dataset, here are some potential solutions to the business objectives:


1.   Improve occupancy rate:
*   Improve the listing description and photos to better showcase the property and increase its appeal to potential guests.



2.   Increase revenue:

*   Analyze the pricing trends in different neighborhoods based on different criterions and adjust the pricing strategy to maximize revenue.
*   Provide additional amenities or services to increase the perceived value of the listing and justify higher prices.

3.   Enhance guest experience:

*   Provide high-quality amenities and services, such as reliable Wi-Fi, comfortable bedding, and helpful local recommendations, to improve guest satisfaction and encourage positive reviews.
*   Use guest feedback to identify areas for improvement and implement changes to enhance the overall guest experience.
*   Respond promptly and courteously to guest inquiries and complaints to build a positive reputation and encourage repeat business.










# **Conclusion**



1.   The most number of properties listed are present in Manhattan and Brooklyn followed by Queens, Bronx, and Staten Island.
2.   The properties with the names Hillside Hotel, Home away from home, and New york Multi-unit building are the top 3 most listed properties according to property name.

1.   The top 5 host who have maximum listings are Michael, David, Sonder(NYC), John, and Alex.

1.   Manhattan has the highest density of properties.

1.   Most of the people prefer entire home/Apartment
2.   Most number of properties belong to the category of medium price range.

7.   Most of the properties is been reviewed in 2019.There is a gradual increase in reviews each year from 2014.

1.   Sonder is the best host corresponding to room type Entire home/apt. Followed by David for Private room and Sergii for shared room. Sonder is also the best host in Manhattan. So, if a customer requires Entire home/apt in Manhatten, Solder should be recommended to customer.
2.   Michael is the best host corresponding to room type Medium. Followed by Kazuya for Low and Henry for High. Michael is also the best host in Brooklyn. So, if a customer want properties in Brooklyn in the medium price category then Michael should be recommended.

1.   The heatmap and pairplot indicate that there is no significant correlation between the numerical columns of the dataset. The values in the numerical columns appear to be independent of each other and not strongly related to the values in the other columns.

Overall, the Airbnb NYC 2019 dataset provides valuable insights into the short-term rental market in New York City. The dataset can be used by hosts to better understand their competition and adjust their pricing strategy, improve occupancy rate, and enhance customer experience.



















### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***