<a href="https://colab.research.google.com/github/Rupayan93/Pro/blob/main/Airbnb_EDA_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team  -**   Rupayan Paul

# **Project Summary -**

Since its inception in 2008, Airbnb has transformed the way guests and hosts experience travel, offering a unique and personalized alternative to traditional accommodation. This project centers around the exploration and analysis of Airbnb's extensive dataset, consisting of millions of listings. The dataset, a valuable repository of information, serves as a cornerstone for understanding customer and host behaviors, optimizing listing performance, ensuring platform security, and guiding strategic business decisions. With around 49,000 observations across 16 columns of categorical and numeric values, the analysis aims to extract actionable insights that will contribute to Airbnb's ongoing success and innovation.

# **GitHub Link -**

https://github.com/Rupayan93/Pro.git

# **Problem Statement**


As Airbnb has grown into a globally recognized service, the challenge lies in effectively harnessing the vast amount of data generated by millions of listings. This data, encompassing diverse categories and numeric values, holds untapped potential for addressing key aspects of the platform's functionality. The problem is to analyze this dataset comprehensively, uncovering patterns and insights that can enhance customer experiences, improve listing performance, ensure security, and inform strategic business decisions. The task is to navigate through this intricate dataset, understanding the nuances, and deriving meaningful understandings that can propel Airbnb's continued evolution.



#### **Define Your Business Objective?**

The project aims to analyze the extensive dataset provided by Airbnb, spanning millions of listings and over a decade of platform interactions. The primary objectives include :


*Understanding Customer's preference.

*Enhance Listing Performance.

*Drive Marketing Initiatives.

*Inform Business Strategy.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from dateutil.relativedelta import relativedelta

### Dataset Loading

In [None]:
# Load Dataset
#mounting your drive, so that you can access the files there
#you'll receive a authentication prompt. Complete it.
from google.colab import drive
drive.mount('/content/drive')

filepath="/content/drive/MyDrive/Colab Notebooks/AIRBNB EDA/Airbnb NYC 2019.csv"
ar_df=pd.read_csv(filepath)

### Dataset First View

In [None]:
# Dataset First Look
ar_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

ar_df.shape

### Dataset Information

In [None]:
# Dataset Info
ar_df.dtypes

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = ar_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Miss_val=(ar_df.isnull().sum())
print(Miss_val)

In [None]:
# Visualizing the missing values

missing_percentage = (Miss_val / len(ar_df)) * 100

# Create a bar plot to visualize missing values
plt.figure(figsize=(10, 6))
missing_percentage.plot(kind='bar', color='skyblue')
plt.title('Percentage of Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Percentage Missing (%)')
plt.xticks(rotation=45, ha='right')
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
ar_df.columns

In [None]:
# Dataset Describe
ar_df.describe()

### Variables Description



```
# This is formatted as code
```

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for column in ar_df.columns:
    unique_counts = ar_df[column].nunique()
    print(f"Unique counts for {column}: {unique_counts}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

ar_df = ar_df[~ar_df.last_review.isna()] # Removed the rows with Null values as they are less than 30%
ar_df = ar_df.drop(columns=['name', 'latitude','longitude']) #Dropped columns which are not required for our analysis

# Extracting year and month from last_review and adding 2 new columns to our dataset which can be used to do further analysis.
ar_df['last_review'] = pd.to_datetime(ar_df['last_review']) #changing the datatype
ar_df['month_name'] = ar_df['last_review'].dt.strftime('%B')# Extracting the MOnth Name to a new Column 'month_name'
ar_df['year'] = ar_df['last_review'].dt.year #Extracting year from to a new columnn year


# Adding a date difference column to derive the status as active or inactive
max_review_date = ar_df['last_review'].max() # The last date on which the lst review was received
ar_df['month_difference'] = ar_df['last_review'].apply(lambda x: relativedelta(max_review_date, x).months) # adding the date_difference column
ar_df['status'] = ar_df['month_difference'].apply(lambda x: 'Active' if x <= 6 else 'Inactive') # adding the status column


### What all manipulations have you done and insights you found?

1) Firstly, I have checked the percentage of null values and I found that there are 2 columns with 20% null values, so I decided to delete the rows.

2) We have columns like ID which is  unique. We also have the neighbourhood and nighbourhood group which shows the locality of the listings, so, there is no requirement of name, longitude and latitude columns. Hence dropped them from the dataframe.

3) So, now we have clean up our data, further I have converted the last review column to date time format and ectracted the year and month to 2 new columns 'month_name' and 'year'.

4) Also, I have added another column as month_difference to the dataset from last review which will tell us since how many months a particular id didn't received any reviews. We have taken  the max date of the last_review column and compared it with dates in every row and get the number of months to the new column. ALso, we will have added another column as Status which will show either a id is active or inactive based on the condition that if the month difference is more than 6 months, then it will be considered as inactive else Active.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

#Checking for correlation:

sns.set(style="ticks")
sns.pairplot(ar_df, kind="scatter")
plt.show()

##### 1. Why did you pick the specific chart?

I have used a scater pair plot chart to checkif there is any strong correlaton can be found among the numerical variables but there doesn't seems to be any strong relationship.

##### 2. What is/are the insight(s) found from the chart?

THe above scatter plot shows us some pattern:

1) A relationship between year and number of reviews which indicates that the number of visitors are increases year by year

2) Negative correlation between month_difference and number pof reviews.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

So, from the above insights, we can conclude that the business shows a positive sign as it is acquring more and more travellers year on year.

We can also see that the listing with higher month_difference have less number of reviews which indicates that the listings are not preferred much by travellers.

#### Chart - 2

In [None]:
# Chart - 2 visualization code


fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Box Plot for 'price'
sns.boxplot(x='price', data=ar_df, ax=axes[0], showfliers=False)
axes[0].set_title('Distribution of Prices')


# Plot 2: Violin Plot for 'minimum_nights'
status_counts = ar_df['status'].value_counts()
sns.barplot(x=status_counts.index, y=status_counts.values, ax=axes[1])
axes[1].set_title('Status Distribution')

# Plot 3: Pie Chart for 'room_type'
room_type_counts = ar_df['room_type'].value_counts()
axes[2].pie(room_type_counts, labels=room_type_counts.index, autopct='%1.1f%%', startangle=90)
axes[2].set_title('Room Type Distribution')

# Adjust layout
plt.tight_layout()

# Show the subplots
plt.show()

##### 1. Why did you pick the specific chart?

1. I picked a box Plot for price column to see the distribution of incomes.
2. For active and inactive, I chose to go with bar chart as here we are looking at the comparison
3. I used a pie chart to understand the people preferred room types.

##### 2. What is/are the insight(s) found from the chart?

1. If we see the boxplot, we can see that people use prefers the listings which are priced between 75 USD to 155 USD. Howoever there are some outliers which I have ignored using the showfliers.
2. From the Bar chart, we can see that around 5000 listings are inactive whereas we do have 30,000 listing active among these neighbourhood groups.
3. From the Pie chart, we can conclude that people prefers entire apartments followed by private room. Shared rooms are less preferred.
4. From the above point, we can also conclude that people are travelling mostly with famiies resulting in the booking of entire apartment is more. However,  there seems to be a very less bagpackers visiting the place as the shared rooms are not booked that much.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The above insight can create a positive impact as follow:

1. We can look for new  listings which are priced below 155 USD listing with these price section is more preferred. Adding new listings in this price section will give more options to travellers resulting in acquring new customers.
2. Looking at the Barchart, we came to know that we have 5000 listings not operating since last 6 months which is a loss to the business as these listings could have heped to yeild more revenue. So, we need to find out the reason for their inactivity and if possible get them start the operation again.
3. From the room type preference, we can assume that most of our bookings for our listings are for entire apartment and it is more likely that gruops or families will be booking an entire apartment. So, we need to check focus on adding entire apartments and private rooms to our listings resulting in a large number of options to seect from.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Explore and understand the Neighbourhood Group

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Box Plot for 'price'
sns.boxplot(x='neighbourhood_group', y='price', data=ar_df, showfliers=False, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Prices')


sum_reviews_by_neighbourhood = ar_df.groupby('neighbourhood_group')['number_of_reviews'].sum().reset_index()



# Plot 2: Bar Plot for 'number_of_reviews'
sns.barplot(x='neighbourhood_group', y='number_of_reviews', data=sum_reviews_by_neighbourhood, ax=axes[0, 1])
axes[0, 1].set_title('Number of Reviews by Neighbourhood Group')

# Plot 3: Bar Plot for 'number_of_reviews' with 'room_type' as hue
sns.barplot(x='neighbourhood_group', y='number_of_reviews', hue='room_type', data=ar_df, ax=axes[1, 0])
axes[1, 0].set_title('Room Type Distribution by Neighbourhood Group')

# Plot 4: Bar Plot for 'status'
sns.countplot(x='neighbourhood_group', hue='status', data=ar_df, ax=axes[1, 1])
axes[1, 1].set_title('Status Distribution by Neighbourhood Group')



# Adjust layout
plt.tight_layout()

# Show the subplots
plt.show()

##### 1. Why did you pick the specific chart?

1. Box Plot: To understand the distribution of price among all neighbourhood groups
2. barchart: to compare reviews received for each of the  neighbourhood group,comparison of room types for each neighbourhood groups and comparison of status for each neighbourhood groups  

##### 2. What is/are the insight(s) found from the chart?

Insights:

1. Manhattan is the most costliest neighbourhood groups among all these. The least costly is Bronx.
2. Brooklyn is the most preferred neighbourhood group followed by Manhattan. The least preferred neighbourhood group is Staten ISland
3. In Brooklyn, staten ISland and Bronx, the most peferred room type is entire apartment whereas in Manhattan and queens, the most preferred room type is private rooms.
4. Shared rooms have very less demand in Staten Island.
5. Inactive listing are more in Brooklyn and Manhattan.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The insight will be creating a positive impact:

1. Searching and adding new affordable listing for Manhattan which will attract the travellers.
2. Focus on adding new entire apartments to our listing in Brooklyn as the demand for enitre apartment are high in Brooklyn.
3. In Staten ISland, we should focus on entire apartments and private rooms as shared rooms are less in demand for thi neighbourhood.    

#### Chart - 4

In [None]:
# Chart - 4 visualization code

average_availability_by_country = ar_df.groupby('neighbourhood_group')['availability_365'].mean().reset_index()

theta = np.linspace(0.0, 2 * np.pi, len(average_availability_by_country), endpoint=False)
width = 2 * np.pi / len(average_availability_by_country)

# Create the radial column chart
plt.figure(figsize=(8, 8))
ax = plt.subplot(111, projection='polar')
bars = ax.bar(theta, average_availability_by_country['availability_365'], width=width, color='skyblue', edgecolor='black', alpha=0.7)

# Adjust the angle of the labels
ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)

# Set the labels
ax.set_xticks(theta)
ax.set_xticklabels(average_availability_by_country['neighbourhood_group'])

plt.title('Average Availability by Country (Radial Column Chart)')
plt.show()

##### 1. Why did you pick the specific chart?

In order to show the availability over the year in an interesting way, I have chosen this chart.

##### 2. What is/are the insight(s) found from the chart?

Insights:

1. Manhattan and Brooklyn have less availability.
2. Staten Island  availability  throughout the years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Below are the action and impacts:

1. A low availabiity in Brooklyn and Manhattan shows that how much these two places are liked by travellers and thereby we need to focus on adding new listings in these region  so that we can provide sufficient availability to the travellers making them book our listings.
2. As of now, it seemed that staten Island has sufficient listing and there is no need to focus on adding new listings.  

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#price variations over time


fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Average Price Over the Years
average_price_by_year = ar_df.groupby('year')['price'].mean().reset_index()
sns.lineplot(x='year', y='price', data=average_price_by_year, marker='o', color='blue', ax=axes[0])
axes[0].set_title('Price Over the Years')
axes[0].set_xlabel('Year')
axes[0].set_ylabel(' Price')
axes[0].grid(True)

# Plot 2: Average Number of Reviews Over the Years
average_reviews_by_year = ar_df.groupby('year')['number_of_reviews'].mean().reset_index()
sns.lineplot(x='year', y='number_of_reviews', data=average_reviews_by_year, marker='o', color='green', ax=axes[1])
axes[1].set_title('Reviews Over the Years')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Number of Reviews')
axes[1].grid(True)


name_count_by_country = ar_df.groupby('year')['id'].count().reset_index()

sns.barplot(x='year', y='id', data=name_count_by_country, palette='viridis')
plt.xlabel('Year')
plt.ylabel('')
plt.title('Listings Over Years')




# Adjust layout
plt.tight_layout()

# Show the subplots
plt.show()

##### 1. Why did you pick the specific chart?

As here my aim to to check the price trend and there is no better option than line chart when doing trend analysis.

##### 2. What is/are the insight(s) found from the chart?

1. From the price over year, we can see that the prices were very high on 2013, however from 2014 it showed a huge dip in the prices.
2. The number of visitors also increased from 2014.
3. The number of listings also increased since 2014.


From the above visuals, we can clearly see that since 2014 we had gradually increased the listing which resulted in a discount in prices that  attracted more people.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above, we can conclude that:

We need to focus on adding more listing which will create a compettition among the host resulting in giving an affordable price which will attract the visitors to book our product.

#### Chart - 6

In [None]:
# Chart - 6 visualization code


ar_df['month_name'] = pd.Categorical(ar_df['month_name'], categories=[
    'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'
], ordered=True)
# Calculate the average price and average number of reviews over each month
average_price_by_month = ar_df.groupby('month_name')['price'].mean().reset_index()
average_reviews_by_month = ar_df.groupby('month_name')['number_of_reviews'].mean().reset_index()

# Create subplots for both line and bar plots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Line graph for Average Price Over Months
sns.lineplot(x='month_name', y='price', data=average_price_by_month, marker='o', color='blue', ax=axes[0])
axes[0].set_title('Price Over Months')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Average Price')
axes[0].grid(True)
axes[0].tick_params(axis='x', rotation=90)
# Plot 2: Bar graph for Average Number of Reviews Over Months
sns.barplot(x='month_name', y='number_of_reviews', data=average_reviews_by_month, palette='viridis', ax=axes[1])
axes[1].set_title('Reviews Over Months')
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Average Number of Reviews')
axes[1].grid(True)
axes[1].tick_params(axis='x', rotation=90)
# Adjust layout
plt.tight_layout()

# Show the subplots
plt.show()

##### 1. Why did you pick the specific chart?

Again, we have done trend analysis for prices against months and a comparison of reviews over month. SO, we have used a line and a bar chart

##### 2. What is/are the insight(s) found from the chart?

From the above, we can conclude that people are more likely to visit these places on the month of June and July and it might be because of the summer vaccation time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. We conclude that the month of June and july sees the highest number of visitor and it might be because of Summer vacations. We can focus on rolling out offers and ads for the booking on this particular period.

#### Chart - 7

In [None]:
# Correlation Heatmap visualization code

heatmap_data = average_reviews_by_month_and_country.pivot(index='neighbourhood_group', columns='month_name', values='number_of_reviews')

# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, cmap='viridis', fmt='.1f', cbar_kws={'label': 'Average Number of Reviews'})
plt.title('Average Number of Reviews by Month and Country')
plt.xlabel('Month')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

I used the heatmap chart to show the variations using colours.

##### 2. What is/are the insight(s) found from the chart?

From the heatmap, we can conclude that:

1) For the month of June and July, the visitors are visiting all of these neighbourhood groups which confirms this period to be the best time to travel. SO, we need to focus on run marketing campaign and roll out offers to attract more and more visitors.

2) For the month of December, it shows that Staten Island is hosting a good amount of visitors in comparison to all other neighbourhood groups. So, for the month of Dec, we can focus on add campaigns  to promote our listing in Staten Island.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it helped to create a positive impact by identifying the correct period when we should focus more on marketing and roll out offers.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

OBJECTIVES:

1) The first Objective would be to understand the reason for the 5000 inactive listings. We have around 5000 listing inactive and most of them are in Brooklyn and Manhattan which are out most preferred locations. Understandng the reason and helping the host to reactivate their listing can help us increase our listing  for the respective location which will be directly helping us in our mission of adding more listings. Also, these places have low availability throughout the years, so this will help the business to tackle the problem of availability to an extent.

2) We need to add new listings as the availability in our 2 more neighbourhood groups are low which is resulting in missing our a good share of revenue. SO, we need to focus on adding more listing for all the neighbourhood groups but mostly for Brooklyn and Manhattan.

3)Need to roll out offers to target the visitors for the month of June and July for all the neighbourhood groups. Also, we need to do the same for staten Island for december month.


# **Conclusion**

# **Conclusion:**



1.   A total of 35,000 listing are present among these four neighbourhood groups, out of which 5,000 listing haven't received any review since last 6 months which indicates that the listings are not active. So, we need to check and understand the reason for the listing to get inactive.

2.   Overall, people prefer entire apartments more followed by Private rooms whereas shared rooms are least preferred. So, we can focus more on listing new  entire apartments and private rooms as the vistors seems to book these room types for there stay.

3. From analysing the prices we found that the prices of listing started declining from 2013 which resulted in attracting more visitors which shown an increase in number of visitors and number of listings.

4. If we check the monthly trend of prices, it show that the prices are at peak from December to February whereas the prices seems to be lowest in the month of august.

5. June and July experience the highest number of visitors and it might be because of the summer vaccations. We can roll out offers during these period to attract more visitors.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***