<a href="https://colab.research.google.com/github/ShebinCZacharia/EDA/blob/main/Copy_of_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Bookings Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Problem Statement**


Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. This dataset has around 49000 rows and 16 columns. Explore and analyze the data to discover key understandings.

#### **Define Your Business Objective?**

To understand customers' and providers' behaviours and performance on platform, to take business decisions, guiding marketing initiatives and implementation of innovative additional services.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
file_path =  'Airbnb NYC 2019.csv'
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f'Number of rows in the dataset are {rows}')
print(f'Number of columns in the dataset are {columns}')

### Dataset Information

In [None]:
# Dataset Info
df.info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
column_wise_missing_values = df.isna().sum()
total_missing_values = column_wise_missing_values.sum()
print(column_wise_missing_values)
print('\n')
print(f'Total missing values are {total_missing_values}')

In [None]:
# Visualizing the missing values
column_wise_missing_values.plot(kind='bar')
plt.title('Missing values per column')
plt.xlabel('columns')
plt.ylabel('number of missing values')

In [None]:
sns.heatmap(df.isna())

### What did you know about your dataset?

1.It is a relatively smaller data set with about 49,000 entries and 16 columns.

2.Above dataset has no duplicate entries.

3.The null values in columns 'last_review' and 'reviews_per_month' probably exits because no reviews were published for that particular Id.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe

df.describe()

### Variables Description 

1.id : It is the unique Id of the listing.

2.name : Name of the listing.

3.host_id : Unique host_id

4.host_name : Name of the host.

5.neighbourhood_group : location

6.neighbourhood : area

7.lattitude : lattitude range.

8.longitude : longitude range

9.price : price of listing.

10.min_nights : minimum nights to be paid for.

11.number_of_reviews : number of reviews.

12.last_review : content of the last review.

13.review_per_month : number of checks per month.

14.calculated_host_listings_count : Total count.

15.availability_365 : availability around the year.

### Check Unique Values for each variable.

In [None]:
column_list = list(df.columns.values)

In [None]:
# Check Unique Values for each variable.
for item in column_list:
  print(f'Unique {item} : {df[item].nunique()}')

## 3. ***Data Wrangling***

### Data Wrangling Code

Dealing with '0' values in price column.

In [None]:
# Removing the listings with price 0, which is an anomaly.
df = df[df['price'] != 0]

Converting dtype of last review column for analysis.

In [None]:
# Converting last_review column to datetime object.
df['last_review'] = pd.to_datetime(df['last_review'], format='%Y-%M-%d')

In [None]:
# Extracting month.
df['last_review_month'] = df['last_review'].dt.strftime('%b')
df['last_review_month'] = df['last_review_month'].fillna('0')

In [None]:
# checking unique values.
df['last_review_month'].unique()

In [None]:
# Extracting year from the modified column to add a new colum named year to the dataframe.
df['year'] = df['last_review'].dt.year

In [None]:
# Changing the dtype to int instead of float.
df['year'] = df['year'].fillna('0').astype(int)

In [None]:
# Dropping the last_review column.
df.drop(['last_review'], axis=1, inplace=True)

In [None]:
# Renaming year to last_review_year
df.rename(columns={'year': 'last_review_year'}, inplace=True)

In [None]:
# First view after the operations.
df.head(100)

Specific Ideas that can be inferred from the dataframe.



In [None]:
#neighbourhood wise and neighbourhood_group wise count of listings
county = df.groupby('neighbourhood_group')['id'].count().reset_index(name='listings').sort_values(by=['listings'])
cities = df.groupby('neighbourhood')['id'].count().reset_index(name='listings')

In [None]:
county

In [None]:
cities

In [None]:
# neighbourhood with maximum listings.
cities[cities['listings'] == cities['listings'].max()]

In [None]:
# neighbourhood_group with maximum listings.
county[county['listings'] == county['listings'].max()]

In [None]:
# Host with maximum listings
max_listings_count = df['calculated_host_listings_count'].max()
max_listings_host_id = df.loc[df['calculated_host_listings_count'] == max_listings_count, 'host_id'].unique()

In [None]:
print(f'Host with id {int(max_listings_host_id)} has maximum listings of {max_listings_count}')

In [None]:
# Exploring different room types.
df['room_type'].unique()

In [None]:
# Determining which room type is most expensive.
avg_room_type_tariff = df.groupby('room_type')['price'].mean().round(2).to_frame()
avg_room_type_tariff[avg_room_type_tariff['price'] == avg_room_type_tariff['price'].max()]

In [None]:
# Comparison between price of room types in different neighbourhood groups.

avg_price = round(df.groupby(['room_type', 'neighbourhood_group'])['price'].agg(['mean', 'median']),2)
avg_price.rename({'mean':'avg_price','median':'median_price'}, axis=1, inplace=True)
avg_price

In [None]:
room_availability = round(df.groupby(['room_type', 'neighbourhood_group'])['availability_365'].agg(['mean', 'median']),2)
room_availability.rename({'mean':'avg_availability','median':'median_availability'}, axis=1, inplace=True)
room_availability

In [None]:
# listing with maximum reviews
df[df['number_of_reviews'] == df['number_of_reviews'].max()]

In [None]:
df_copy = df.drop(['id','name', 'host_id', 'host_name', 'last_review_month'], axis=1 )

### What all manipulations have you done and insights you found?

1. There are some listings with price per minimum night as 0.This can be a price update issue from the server side or client side. These observations are removed for further analysis.
2. The last review column is converted to datatime object and the year part is extracted for further analysis.
3. Brooklyn and Manhattan have the largest number of listings as these are the most populated locations among the neighbourhood groups.
4. Host with id 219517861 has maximum listings of 327 and all these listings are located in Manhattan.
5. There are three different room types namely shared room, private room and Entire home/apartment and the third one is the most expensive among the three.
6. Statistical operations were done on availability_365 and price in different neighbourhood groups.
7. Copy of df is generated after dropping columns that might not be necessary for next part of visualisation. 


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Count plot of neighbourhood groups.
plt.figure(figsize=(10,6))
sns.countplot(df_copy['neighbourhood_group'])
plt.show()

##### 1. Why did you pick the specific chart?

* This chart compares the number of listings in different neighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

 * Manhattan has the most number of listings followed by Brooklyn.
 * Staten Island and Bronx have the lowest.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 * The major chunk of revenue for AirBnb comes from the neighbourhood groups Brooklyn and Manhattan as they have the largest number of listings. Running targetted campaigns to attract clients to Airbnb by showing how lucrative renting through AirBnb can be in Brooklyn and Manhattan.
 * For the rest of the neighbourhood groups, the company could make the people aware of the potential of properties they hold and how they could benefit from AirBnb.    

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# count plot of room types.
plt.figure(figsize=(8,6))
sns.countplot(df_copy['room_type'])
plt.show()

##### 1. Why did you pick the specific chart?

* The above chart shows the number of categorical rooms provided by the listings.

##### 2. What is/are the insight(s) found from the chart?

* The number of private rooms and entire home/apt exceeds the number of shared rooms by a large number.
* It can be assumed that the demand for private rooms and apartments are the highest among the different room types. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Yeah, it helps creating a positive business impact as the chart helps to identify the categories of room which is of high demand.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,6))
x = df_copy.groupby(['room_type', 'neighbourhood_group'])['room_type'].count().reset_index(name='number')
sns.barplot(data=x, x='neighbourhood_group', y='number', hue='room_type')
plt.show()

##### 1. Why did you pick the specific chart?

* This chart helps to enumerate the rooms of different types available in various nerighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

* Manhattan and brooklyn have the most number of rooms among the neighbourhood groups and the most preferred room type is entire home/apartment.
* Bronx and Staten Island have very few rooms available compared to other neighbourhood groups.
* The number of shared rooms are the lowest among different room types.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yeah, this graph helps to identify the most and least preferred room types in different groups and their numbers.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(16,6))
sns.boxplot(data=df, x='room_type', y='availability_365',ax=ax1)
sns.boxplot(data=df, x='neighbourhood_group', y='availability_365',ax=ax2)
plt.show()


##### 1. Why did you pick the specific chart?

* Chart 1 helps to understand the median availability of different room types.
* Chart 2 helps to understand the median availability of rooms in different neighbourhoods.


##### 2. What is/are the insight(s) found from the chart?

* Chart 1 : In more than 50% of the rooms among private room and entire home/apt, the availability around the year is less than 50 while shared rooms are available for nearly 100 days in the year. 
* Chart 2: Brooklyn and Manhattan has least availability of rooms, where as Staten Island and Bronx have rooms available for 150-200 days.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yeah, the above insights helps to create a positive business impact.

#### Chart - 5

In [None]:
# Chart - 4 visualization code: Box plot
new_df = df_copy.loc[:,['price', 'availability_365', 'number_of_reviews', 'reviews_per_month', 'minimum_nights', 'last_review_year' ]]
rows = 2
columns = 3
fig = plt.figure(figsize= (15,5))
for i, column in enumerate(new_df.columns):
  ax=fig.add_subplot(rows, columns, i+1)
  sns.boxplot(data=new_df, x=new_df[column])
fig.tight_layout()
plt.show()



In [None]:
# Box plot after removing outliers.

d = df.loc[df['last_review_year'] > 0, ['last_review_year']].squeeze()
e = df.loc[df['minimum_nights'] < 50, ['minimum_nights']].squeeze()


sns.set_style("whitegrid")
f=plt.figure(figsize=(1,1))
f,axes = plt.subplots(nrows=1, ncols=2, figsize=(20,10))
sns.boxplot(x=d, ax=axes[0])
sns.boxplot(x=e, ax=axes[1])

plt.show()

In [None]:
# Countplot of last_review_year for general idea
plt.figure(figsize=(10,6))
sns.countplot(d)
plt.show()

##### 1. Why did you pick the specific chart?

 * These charts gives the distribution of values and helps to find the outliers and understand distribution frequency.

##### 2. What is/are the insight(s) found from the chart?

* For price per night, the values above 500 seems like outliers.
* For number of reviews boxplot, the outliers are established listings with much popularity among the customers.
* The minimum_nights boxplot helps to identify the outliers which are listings with minimum_nights more than one year.
* the last_review boxplot provides information about actively rented and reviewed listings.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Yeah, this will help in optimizing the business. Communicating with the owners of outlying listings in the above plots will be helpful in normalizing the distribution and optimizing the business of both the company and the host.


#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Neighbourhood group vs prices
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(16,6))
sns.histplot(data=df_copy[df_copy['price'] < 5000], x='neighbourhood_group', y='price', hue='room_type', ax=ax1)

sns.histplot(data=df_copy, x='neighbourhood_group', y='number_of_reviews', hue='room_type',ax=ax2)
plt.show()

##### 1. Why did you pick the specific chart?

This Chart is able to beautifully depict the prices and reviews received for various roomtypes in different neighbourhood groups like a spectrum.
* Chart 1 shows the spectrum of prices tagged for individual room types in various neighbourhood groups.
* Chart 2 shows the spectrum of reviews received by individual room types in various neighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

* In chart 1 the prices of rooms in Manhattan and Brooklyn are distributed in higher values lead by entire home/apt.
* In Chart 2 the number of reviews also shows the same trend. Only the type of room which got the higher reviews shows a significant difference. private rooms are reviewed most.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The following insights helps to create positive business impact:
* The diverse prices of a particular room type can be attributed to the location, population density and quality of life offered by the neighbourhood groups.
* From the second insight it can be assumed that private rooms are most frequently visited by new customers as they are reviewed the most.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
a= df.loc[df['last_review_year'] > 0, ['minimum_nights']].squeeze()
b = df.loc[df['last_review_year'] > 0, ['last_review_year']].squeeze()
c = df.loc[df['last_review_year'] > 0, ['availability_365']].squeeze()

f = plt.figure()
f, axes = plt.subplots(nrows=2, ncols=2, figsize=(20,10))
sns.scatterplot(x=df['reviews_per_month'], y= df['price'], ax=axes[0][0], hue=df['room_type'])
sns.scatterplot(x=df['availability_365'], y= df['price'], ax=axes[0][1], hue=df['room_type'])
sns.scatterplot(x=b, y=a, ax=axes[1][0], hue=df['room_type'])
sns.scatterplot(x=c, y=a, ax=axes[1][1], hue=df['room_type'])
plt.show()

##### 1. Why did you pick the specific chart?

* These charts helps to identify where most observations are accumulated.

##### 2. What is/are the insight(s) found from the chart?

* From chart 1 it can be observed that the reviews per month data of hugely priced listings are values closer to one.
* From chart 2 one can easily understand that most of the rooms have their price in the range of 1-1000
* In chart 3, listings with minimum night constraints above 365 are last reviewed before 2019.
* Chart 4 depicts outliers with minimum night constraint of more than 1000

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* The insight from chart 1 tells us that either the hosts of highly priced rooms expects customers preferring luxury over value for money or they have over valued their property which can impact their and company's business.
* The minimum night constraint can be a reason for less frequent review which probably can be a cause for very few customers. This can be inferred from chart 3 where the listing with minimum night constraint of 1200 was last reviewed on 2014.

Addressing above issues can make the business function better.

#### Chart - 8 - Pair plot

In [None]:
# Chart - 8 visualization code
df_copy1 = df_copy.loc[(df_copy['price'] < 2000) & (df_copy['reviews_per_month'] < 20) & (df_copy['minimum_nights'] < 200)]
df_copy1=df_copy1[['room_type', 'price', 'minimum_nights', 'reviews_per_month', 'availability_365', 'last_review_year']]
sns.pairplot(df_copy1, hue='room_type')
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot helps in comparison of multiple variables in little space.

##### 2. What is/are the insight(s) found from the chart?

* Most of the priced properties have lower minimum night constraints.
* There are some higly priced shared rooms which has lower review per month.
* The last review year before 2015 have 0.5-1 review per month

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Following insights can lead to negative growth:
* Some shared rooms are overpriced.This can affect the business since shared rooms are expected to be cheap.
* For some listings the minimum night constraint inhibits frequent customer visits.This can be inferred from the above pair plot.

Addressing these issues can create a positive business impact.



In [None]:
plt.figure(figsize=(20,10))
sns.scatterplot(data=df_copy, x='latitude', y='longitude', hue='neighbourhood_group')
plt.show()

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12,10))
df_copy2 = df_copy1.iloc[:,1:]
corr = df_copy2.corr()
sns.heatmap(corr, annot=True)
plt.show()


##### 1. Why did you pick the specific chart?

* Helps to establish correlation between different variables.

##### 2. What is/are the insight(s) found from the chart?

Negative Correlation:
* Minimum nights and reviews per month.
* Price and reviews per month.
* Minimum nights and last review year.


* The major chunk of revenue for AirBnb comes from the neighbourhood groups Brooklyn and Manhattan as they have the largest number of listings. Running targetted campaigns to attract clients to Airbnb by showing how lucrative renting through AirBnb can be in Brooklyn and Manhattan.
* For the rest of the neighbourhood groups, the company could make the people aware of the potential of properties they hold and how they could benefit from AirBnb.    
* The preference for difference room types understood from above analysis could be used to pitch them to customers more effectively.
* The diverse prices of a particular room type can be attributed to the location, population density and quality of life offered by the neighbourhood groups. 
* From the above analysis it can be assumed that private rooms are most frequently visited by new customers as they are reviewed the most. So this information could be used to promote more hosts providing private rooms. 
* Reducing the minimum nights constraint can increase the frequency of customer visit. This can be established from the correlation heat map.
* Repricing overpriced shared rooms which are expected to be the cheapest, can bring more business to the hosts and to the company.
* For the listings which were not reviewed after 2016-17, more deeper check on reviews and enquiring with the host could help to find out the reason for this. 

# **Conclusion**

After conducting an exploratory data analysis on Airbnb data, it can be concluded that the Airbnb market is highly dynamic and has a vast range of options for travelers. The findings from this analysis show that the majority of listings are concentrated in a few cities, with the top cities having a higher demand for Airbnb accommodations. Additionally, the data highlights that the price of rentals varies greatly depending on the location, type of property, and time of year. Furthermore, it was observed that a significant portion of listings have high ratings, indicating that Airbnb provides quality experiences for travelers. Overall, the analysis provides valuable insights into the Airbnb market and can be used to inform strategic decisions for both travelers and property owners.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***