# Exploratory Data Analysis

## Introduction
The dataset used in this analysis is the New York City Airbnb Open Data (2019), which provides detailed information about Airbnb listings across the five boroughs of New York City — Manhattan, Brooklyn, Queens, Bronx, and Staten Island. It includes data collected from the Airbnb platform in 2019 and offers insights into various aspects of short-term rental activity in the city.

Each record in the dataset represents a unique Airbnb listing and contains information such as the host’s details, neighborhood, room type, price, number of minimum nights required, availability throughout the year, and the number of reviews.

Analyzing this dataset is valuable because it uncovers trends and patterns in the short-term rental market of one of the world’s most dynamic cities. Through this exploratory data analysis (EDA), we aim to understand how factors like location, pricing, and availability vary across neighborhoods, identify popular areas and property types, and examine how host activity influences the market.

The insights derived from this analysis can be useful for multiple stakeholders:
- Hosts, who can optimize their pricing and improve listing performance based on market trends.
- Travelers, who can make more informed choices regarding affordability and neighborhood preferences.
- Airbnb’s policy and operations teams, who can use the findings to ensure fair usage, monitor saturation levels, and support sustainable tourism practices.


Overall, this analysis aims to provide a data-driven understanding of Airbnb activity across New York City, highlighting both economic and spatial dynamics within the urban short-term rental ecosystem.

## Data Loading & Overview

In [1]:
# Importing necessary libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Loading the dataset.
df = pd.read_csv('../data/AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [3]:
# Shape of the dataset.
print("Dataset Dimensions: ", df.shape)

Dataset Dimensions:  (48895, 16)


In [4]:
# Column names.
print("\nColumn Names:\n", df.columns.tolist())


Column Names:
 ['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']


In [5]:
# Info about datatypes, non-null counts, memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

Identifying missing values

In [6]:
# Missing values per column
print("\nMissing Values: \n", df.isnull().sum())


Missing Values: 
 id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64


In [7]:
# Total Duplicate rows
print("\nNumber of Duplicate rows: \n", df.duplicated().sum())


Number of Duplicate rows: 
 0


#### Understanding Key columns:  
| Column | Description |
|---|---|
| price | The nightly cost (in USD) of the Airbnb listing. Crucial for analyzing affordability and market segmentation. |
| neighbourhood_group | The larger NYC borough the listing belongs to — one of _Manhattan_, _Brooklyn_, _Queens_0, _Bronx_, or _Staten Island_. Useful for comparing areas geographically. |
| room_type | The category of the listing — e.g. _Entire home/apt, Private room, Shared room,_ or _Hotel room_. Influences both price and availability trends. |
| availability_365 | The number of days in a year that the listing is available for booking. Indicates host engagement level and seasonal activity. |