# NYC Airbnb Data Overview

## Objective
Explore and understand the structure, size, and quality of the NYC Airbnb dataset
before performing data cleaning and analysis.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data = pd.read_csv('../data/raw/Airbnb_Open_Data.csv', low_memory = False)

In [4]:
data.head()

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


In [5]:
data.tail()

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
102594,6092437,Spare room in Williamsburg,12312296767,verified,Krik,Brooklyn,Williamsburg,40.70862,-73.94651,United States,...,$169,1.0,0.0,,,3.0,1.0,227.0,No Smoking No Parties or Events of any kind Pl...,
102595,6092990,Best Location near Columbia U,77864383453,unconfirmed,Mifan,Manhattan,Morningside Heights,40.8046,-73.96545,United States,...,$167,1.0,1.0,7/6/2015,0.02,2.0,2.0,395.0,House rules: Guests agree to the following ter...,
102596,6093542,"Comfy, bright room in Brooklyn",69050334417,unconfirmed,Megan,Brooklyn,Park Slope,40.67505,-73.98045,United States,...,$198,3.0,0.0,,,5.0,1.0,342.0,,
102597,6094094,Big Studio-One Stop from Midtown,11160591270,unconfirmed,Christopher,Queens,Long Island City,40.74989,-73.93777,United States,...,$109,2.0,5.0,10/11/2015,0.1,3.0,1.0,386.0,,
102598,6094647,585 sf Luxury Studio,68170633372,unconfirmed,Rebecca,Manhattan,Upper West Side,40.76807,-73.98342,United States,...,$206,1.0,0.0,,,3.0,1.0,69.0,,


In [11]:
data.shape

(102599, 26)

- About 102599 rows are present in the dataset and 26 columns

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   NAME                            102349 non-null  object 
 2   host id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host name                       102193 non-null  object 
 5   neighbourhood group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  object 
 12  cancellation_pol

In [7]:
data.isna().sum()

id                                     0
NAME                                 250
host id                                0
host_identity_verified               289
host name                            406
neighbourhood group                   29
neighbourhood                         16
lat                                    8
long                                   8
country                              532
country code                         131
instant_bookable                     105
cancellation_policy                   76
room type                              0
Construction year                    214
price                                247
service fee                          273
minimum nights                       409
number of reviews                    183
last review                        15893
reviews per month                  15879
review rate number                   326
calculated host listings count       319
availability 365                     448
house_rules     

In [8]:
data.isna().sum().sum()

np.int64(190769)

- Names of the columns are not consistent, used different formatting
- Total of 190769 values are missing from the dataset


In [13]:
(data.isna().sum() / data.shape[0] * 100).sort_values(ascending=False)

license                           99.998051
house_rules                       50.810437
last review                       15.490404
reviews per month                 15.476759
country                            0.518524
availability 365                   0.436651
minimum nights                     0.398639
host name                          0.395715
review rate number                 0.317742
calculated host listings count     0.310919
host_identity_verified             0.281679
service fee                        0.266084
NAME                               0.243667
price                              0.240743
Construction year                  0.208579
number of reviews                  0.178364
country code                       0.127682
instant_bookable                   0.102340
cancellation_policy                0.074075
neighbourhood group                0.028265
neighbourhood                      0.015595
long                               0.007797
lat                             

- `license` has almost complete missingness at 99.998%, meaning it is essentially unusable for analysis.
- `house_rules` has a high missing rate at 50.81%, so half of the entries lack this information.
- `last review` and `reviews per month` both have around 15% missing values, which is moderate and may require imputation or careful handling.
- Most other columns have very low missingness, typically below 1%, including country, availability 365, minimum nights, and many others.

In [None]:
data.head()

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


In [15]:
data.describe()

Unnamed: 0,id,host id,lat,long,Construction year,minimum nights,number of reviews,reviews per month,review rate number,calculated host listings count,availability 365
count,102599.0,102599.0,102591.0,102591.0,102385.0,102190.0,102416.0,86720.0,102273.0,102280.0,102151.0
mean,29146230.0,49254110000.0,40.728094,-73.949644,2012.487464,8.135845,27.483743,1.374022,3.279106,7.936605,141.133254
std,16257510.0,28539000000.0,0.055857,0.049521,5.765556,30.553781,49.508954,1.746621,1.284657,32.21878,135.435024
min,1001254.0,123600500.0,40.49979,-74.24984,2003.0,-1223.0,0.0,0.01,1.0,1.0,-10.0
25%,15085810.0,24583330000.0,40.68874,-73.98258,2007.0,2.0,1.0,0.22,2.0,1.0,3.0
50%,29136600.0,49117740000.0,40.72229,-73.95444,2012.0,3.0,7.0,0.74,3.0,1.0,96.0
75%,43201200.0,73996500000.0,40.76276,-73.93235,2017.0,5.0,30.0,2.0,4.0,2.0,269.0
max,57367420.0,98763130000.0,40.91697,-73.70522,2022.0,5645.0,1024.0,90.0,5.0,332.0,3677.0


Some columns contain invalid or unrealistic values:

- `minimum nights` has a minimum value of **-1223**, which is not logically possible.
- `availability 365` has negative values.
- Extremely high values in `minimum nights` (up to 5645) suggest data entry errors or outliers.
- The construction year ranges from 2003 to 2022, with a median around 2012.
- This suggests most properties are relatively modern, though this field will require validation.
- The number of reviews shows a highly skewed distribution:Many listings have few or no reviews, A small number of listings have very high review counts



In [None]:
data.describe(include='object')    

Unnamed: 0,NAME,host_identity_verified,host name,neighbourhood group,neighbourhood,country,country code,instant_bookable,cancellation_policy,room type,price,service fee,last review,house_rules,license
count,102349,102310,102193,102570,102583,102067,102468,102494,102523,102599,102352,102326,86706,50468,2
unique,61281,2,13190,7,224,1,1,2,3,4,1151,231,2477,1976,1
top,Home away from home,unconfirmed,Michael,Manhattan,Bedford-Stuyvesant,United States,US,False,moderate,Entire home/apt,$206,$41,6/23/2019,#NAME?,41662/AL
freq,33,51200,881,43792,7937,102067,102468,51474,34343,53701,137,526,2443,2712,2


- The `price` and `service fee` columns are stored as strings and include currency symbols. These will need to be converted to numeric values during data cleaning.
- `Country` and `country code` has only 1 value and it is of US. 
- Date is also present for the `last review` column but it is not saved as date rather saved as an object
- `instant_bookable` is a boolean column with only 2 values (0, 1)
- `Licence` has only 1 value and almost all of the values are missing from this column


### Next Steps

The identified data quality issues will be addressed in the `data cleaning phase`,
including handling missing values, correcting invalid entries, and treating outliers.categorical variables will be standardized,
missing values handled appropriately, and low-value or redundant columns
considered for removal.
