# **Project Name**    -



##### **Project Type**    - EDA on AirBNB dataset
##### **Contribution**    - Individual
##### **Name**            - Laxmikant Mukkawar


# **Project Summary -**

This project involves an exploratory data analysis (EDA) of Airbnb listings in New York City for 2019. The goal of the analysis is to uncover key insights into pricing patterns, host behavior, and geographical trends that can assist stakeholders in optimizing their business strategies.

The dataset consists of various features, including listing prices, host information, geographical location (latitude and longitude), room types, availability, and review metrics. By analyzing this data, we aim to answer several business-critical questions:

- What are the price distribution patterns across different neighborhoods and room types?
- Are there geographical clusters of high or low prices in the city?
- How does the number of listings affect host activity in terms of reviews and availability?
- What is the relationship between the number of reviews, price, and availability of listings?

## Key Insights:
- ### Price Distribution: 
The analysis revealed that certain neighborhoods (e.g., Manhattan) tend to have higher average prices for listings, especially for entire apartments/homes. Other neighborhoods, such as parts of Brooklyn, have lower-priced listings, particularly for shared and private rooms.
- ### Host Activity: 
Hosts with fewer listings tend to receive more reviews, suggesting a more personalized or dedicated hosting approach. There is no strong correlation between the number of listings a host manages and the availability of their listings.
- ### Review Trends: 
Listings with higher availability do not necessarily receive more reviews per month. Pricing also does not have a clear linear relationship with the number of reviews—while some expensive listings do receive many reviews, others receive few, suggesting that factors other than price (e.g., location, service quality) influence review volume.

# **GitHub Link -**

https://github.com/Laxmi884/EDA_Airbnb

# **Problem Statement**


The New York City Airbnb market is highly competitive, with thousands of listings spread across different neighborhoods, room types, and price ranges. Hosts and property managers face the challenge of setting optimal prices, maintaining high availability, and attracting positive reviews in a crowded marketplace.

#### **Define Your Business Objective?**

The primary objective of this analysis is to provide Airbnb hosts, property managers, and potential investors with data-driven insights to improve the performance of their listings in New York City. Specifically, the analysis aims to help hosts achieve the following:

1. Optimize Pricing: By analyzing price trends across different neighborhoods and room types, the objective is to help hosts set competitive prices that maximize both occupancy and revenue. Understanding how prices vary by location, room type, and other factors will enable hosts to position their listings more effectively in the market.
2. Increase Listing Visibility and Engagement: Hosts can increase guest engagement by understanding the relationship between reviews and factors like price, availability, and location. The analysis will help identify strategies that drive more reviews and guest satisfaction, leading to higher visibility on the Airbnb platform.
3. Improve Availability Management: The analysis will help hosts manage availability more efficiently by exploring how the number of days a listing is available throughout the year affects booking rates and reviews. This can lead to better resource allocation and potentially higher occupancy rates.
4. Identify High-Demand Locations: By identifying geographical clusters where demand is high (based on pricing, reviews, and availability), the objective is to help hosts target neighborhoods with strong rental demand. This can inform decisions about where to list new properties or where to focus marketing efforts.
5. Support Business Growth: For hosts with multiple listings, the analysis provides insights into how the number of listings affects overall performance, such as reviews and availability. These insights will help hosts scale their operations efficiently while maintaining a high level of guest satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [2]:
# Load Dataset
df = pd.read_csv('Airbnb NYC 2019.csv')

### Dataset First View

In [3]:
# Dataset First Look
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


### Dataset Rows & Columns count

In [4]:
# Dataset Rows & Columns count'
df.shape

(48895, 16)

### Dataset Information

In [5]:
# Dataset Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

#### Duplicate Values

In [6]:
# Dataset Duplicate Value Count
df.duplicated().sum()

0

In [7]:
#checking duplicates for each column 
for col in df.columns:
    print(f"Duplicates in {col} - {df[col].duplicated().sum()}")

Duplicates in id - 0
Duplicates in name - 989
Duplicates in host_id - 11438
Duplicates in host_name - 37442
Duplicates in neighbourhood_group - 48890
Duplicates in neighbourhood - 48674
Duplicates in latitude - 29847
Duplicates in longitude - 34177
Duplicates in room_type - 48892
Duplicates in price - 48221
Duplicates in minimum_nights - 48786
Duplicates in number_of_reviews - 48501
Duplicates in last_review - 47130
Duplicates in reviews_per_month - 47957
Duplicates in calculated_host_listings_count - 48848
Duplicates in availability_365 - 48529


#### Missing Values/Null Values

In [8]:
# Missing Values/Null Values Count
df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

### What did you know about your dataset?

Dataset has 16 columns and 48895 enteries. Most of the columns don't have any null values except columns 'last_review' and 'reviews_per_month' they have 10052 null values and except these 2 columns, 'name' and 'host_name' also have some null values that are very nominal i.e 16 and 21 individually.

## ***2. Understanding Your Variables***

In [9]:
# Dataset Columns
column_names = list(df.columns)
column_names

['id',
 'name',
 'host_id',
 'host_name',
 'neighbourhood_group',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'price',
 'minimum_nights',
 'number_of_reviews',
 'last_review',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

### Variables Description

In [10]:
# Dataset Describe
df.describe()


Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


##### We observe that the minimum value of price is zero which can't be true. We need to fix this 
- Categorical Variables: neighbourhood_group, neighbourhood, room_type
- Numerical Variables: price, minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365, latitude, longitude
- Text Variables: name, host_name
- Date Variables: last_review

### Check Unique Values for each variable.

In [11]:
# Check Unique Values for each variable.
df.nunique() 

id                                48895
name                              47905
host_id                           37457
host_name                         11452
neighbourhood_group                   5
neighbourhood                       221
latitude                          19048
longitude                         14718
room_type                             3
price                               674
minimum_nights                      109
number_of_reviews                   394
last_review                        1764
reviews_per_month                   937
calculated_host_listings_count       47
availability_365                    366
dtype: int64

## 3. ***Data Wrangling***

### Data Wrangling Code

#### Converting data type of date columns to datetime datatype

In [13]:
#converting the date column to datetime datatype
df['last_review'] = pd.to_datetime(df['last_review'],format='%Y-%m-%d',errors = 'coerce')

#this can be done using the below method as well

#df.loc[~df['last_review'].isnull(),'last_review'] = df.loc[~df['last_review'].isnull(),'last_review'].apply(lambda x: pd.to_datetime(x,format='%Y-%m-%d'))

#### Checking if its coverted correctly

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              48895 non-null  int64         
 1   name                            48879 non-null  object        
 2   host_id                         48895 non-null  int64         
 3   host_name                       48874 non-null  object        
 4   neighbourhood_group             48895 non-null  object        
 5   neighbourhood                   48895 non-null  object        
 6   latitude                        48895 non-null  float64       
 7   longitude                       48895 non-null  float64       
 8   room_type                       48895 non-null  object        
 9   price                           48895 non-null  int64         
 10  minimum_nights                  48895 non-null  int64         
 11  nu

#### function to identify most popular room type

In [16]:
def most_pop_room(df):
    # Count occurrences of each room type
    count_df = df['room_type'].value_counts().reset_index()
    count_df.columns = ['room_type', 'count']  # Renaming columns for better readability

    # Find the maximum count of bookings
    max_value = count_df['count'].max()

    # Get all room types with the max booking count
    most_popular_rooms = count_df[count_df['count'] == max_value]

    # Return as a dictionary
    result = dict(zip(most_popular_rooms['room_type'], most_popular_rooms['count']))

    return result

# Example usage
a = most_pop_room(df)

if len(a) > 1:
    print("There are multiple room types with the same preference. They are as follows:")
    for room_type, count in a.items():
        print(f"Room type - {room_type}, No. of times booked - {count}")
else:
    room_type, count = list(a.items())[0]
    print(f"The most popular room type is {room_type} with number of times it was booked as {count}")

The most popular room type is Entire home/apt with number of times it was booked as 25409


#### Function to identify most popular neighbourhood

In [17]:
def most_pop_neighbourhood(df):
    # Count occurrences of each neighbourhood and reset the index for easy handling
    count_df = df['neighbourhood_group'].value_counts()

    # Find the maximum count directly
    max_value = count_df.max()

    # Filter the neighbourhoods with the maximum count and convert them into a dictionary
    result = count_df[count_df == max_value].to_dict()

    return result

# Example usage
pop_neighbourhood = most_pop_neighbourhood(df)

if len(pop_neighbourhood) > 1:
    print("There are multiple neighbourhoods with the same preference. They are as follows:")
    for neighbourhood, count in pop_neighbourhood.items():
        print(f"Neighbourhood - {neighbourhood}, No. of times booked - {count}")
else:
    neighbourhood, count = list(pop_neighbourhood.items())[0]
    print(f"The most popular neighbourhood is {neighbourhood} with number of times it was booked as {count}")

The most popular neighbourhood is Manhattan with number of times it was booked as 21661


#### When we inspected the dataframe we observed that the minimum value in price is 0 which can't be true so we will fix this.

In [18]:
#replacing price values where it is zero with mean value
df['price'] = df['price'].replace(0,df['price'].mean())

#### Checking if zero's are replaced correctly for price column 

In [20]:
df['price'].describe()

count    48895.000000
mean       152.755045
std        240.143242
min         10.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

#### Adding a new column named listing category. This will help us categories the listings count and perform analysis later

In [21]:
df['listings_category'] = pd.cut(df['calculated_host_listings_count'],
                                                  bins=[0, 1, 5, 10, float('inf')],
                                                  labels=['1 listing', '2-5 listings', '6-10 listings', '10+ listings'])

#### getting skewness and Kurtosis for price column

In [22]:
# getting skewness and kurtosis for price column
print(f'The skewness of price is: {df['price'].skew()}')
print(f'The kurtosis of price is: {df['price'].kurt()}')

The skewness of price is: 19.12117771378404
The kurtosis of price is: 585.7690535466286


From the above results of skewness of kurtosis we can infer that the data is positively skeweed i.e the data has a long right tail. And from kurtosis value can say that there are many outliers