## Airbnb Market Analysis and Price Prediction Project

This project explores New York City Airbnb data from Inside Airbnb, with a focus on uncovering market patterns and predicting listing prices using machine learning. We analyze trends in reviews, availability in listing characteristics to help inform pricing strategy and highlight high-performing listings

In [38]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [39]:
#loading dataphrames
calendar=pd.read_csv('/Users/zoewalp/Desktop/DAProjects/airbnb_nyc/nyc_calendar.csv')
reviews=pd.read_csv('/Users/zoewalp/Desktop/DAProjects/airbnb_nyc/nyc_reviews.csv')
listings=pd.read_csv('/Users/zoewalp/Desktop/DAProjects/airbnb_nyc/nyc_listings.csv')

### Examining Calendar Dataframe

In [40]:
#previewing data in calendar dataframe
calendar.head()
calendar.info()
calendar.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   listing_id      999999 non-null  int64  
 1   date            999999 non-null  object 
 2   available       999999 non-null  object 
 3   price           999999 non-null  object 
 4   adjusted_price  0 non-null       float64
 5   minimum_nights  999636 non-null  float64
 6   maximum_nights  999636 non-null  float64
dtypes: float64(3), int64(1), object(3)
memory usage: 53.4+ MB


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
count,999999.0,999999,999999,999999,0.0,999636.0,999636.0
unique,,366,2,291,,,
top,,2025-05-02,f,$150.00,,,
freq,,2741,708182,55845,,,
mean,2511877.0,,,,,33.959834,706.838358
std,1746287.0,,,,,48.877865,550.669339
min,2539.0,,,,,1.0,3.0
25%,798008.0,,,,,30.0,95.0
50%,2298731.0,,,,,30.0,1125.0
75%,4121173.0,,,,,30.0,1125.0


In [41]:
#check for missing values

calendar.isnull().sum()

listing_id             0
date                   0
available              0
price                  0
adjusted_price    999999
minimum_nights       363
maximum_nights       363
dtype: int64

In [53]:
#note that there are 363 'minimum_nights' and 363 'maximum_nights'
#dropping adjusted price column since nearly all entries are missing
if 'adjusted_price' in calendar.columns:
    calendar_cleaned = calendar.drop(columns=['adjusted_price'])
else:
    calendar_cleaned = calendar.copy()  # just copy if column not present


### Examining Reviews Dataframe

In [43]:
#previewing data in reviews dataframe
reviews.head()
reviews.info()
reviews.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971379 entries, 0 to 971378
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     971379 non-null  int64 
 1   id             971379 non-null  int64 
 2   date           971379 non-null  object
 3   reviewer_id    971379 non-null  int64 
 4   reviewer_name  971375 non-null  object
 5   comments       971127 non-null  object
dtypes: int64(3), object(3)
memory usage: 44.5+ MB


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
count,971379.0,971379.0,971379,971379.0,971375,971127
unique,,,5431,,120946,933198
top,,,2023-09-04,,David,.
freq,,,1234,,6887,1181
mean,1.828369e+17,5.23937e+17,,165724500.0,,
std,3.55486e+17,4.967769e+17,,161662100.0,,
min,2539.0,3149.0,,29.0,,
25%,10280550.0,416963600.0,,32277570.0,,
50%,28635040.0,5.724516e+17,,109059000.0,,
75%,52666690.0,9.471487e+17,,258154000.0,,


In [44]:
#convert date to datetime
reviews['date'] = pd.to_datetime(reviews['date'])

In [45]:
#looking at duplicates
reviews.duplicated().sum()
reviews.drop_duplicates(inplace=True)

In [46]:
reviews_cleaned = reviews.copy()

### Examining Listings Dataframe

In [47]:
#previewing data in listings dataframe
listings.head()
listings.info()
listings.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37018 entries, 0 to 37017
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            37018 non-null  int64  
 1   listing_url                                   37018 non-null  object 
 2   scrape_id                                     37018 non-null  int64  
 3   last_scraped                                  37018 non-null  object 
 4   source                                        37018 non-null  object 
 5   name                                          37016 non-null  object 
 6   description                                   36010 non-null  object 
 7   neighborhood_overview                         19575 non-null  object 
 8   picture_url                                   37017 non-null  object 
 9   host_id                                       37018 non-null 

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,37018.0,37018,37018.0,37018,37018,37016,36010,19575,37017,37018.0,...,25352.0,25341.0,25342.0,5396,37018,37018.0,37018.0,37018.0,37018.0,25358.0
unique,,37018,,3,2,35355,30614,14677,36238,,...,,,,2006,2,,,,,
top,,https://www.airbnb.com/rooms/2539,,2025-05-02,city scrape,Water View King Bed Hotel Room,Keep it simple at this peaceful and centrally-...,This furnished apartment is located in the Fin...,https://a0.muscache.com/pictures/6998e77e-4564...,,...,,,,Exempt,f,,,,,
freq,,1,,27161,22056,30,93,101,35,,...,,,,2844,29409,,,,,
mean,4.37292e+17,,20250500000000.0,,,,,,,171420500.0,...,4.826887,4.744965,4.644073,,,72.739802,51.326436,19.406262,0.029094,0.819498
std,5.144149e+17,,15.28536,,,,,,,188865800.0,...,0.408087,0.390465,0.486165,,,235.28625,221.236309,89.511742,0.406795,1.851575
min,2539.0,,20250500000000.0,,,,,,,1678.0,...,0.0,0.0,0.0,,,1.0,0.0,0.0,0.0,0.01
25%,21400000.0,,20250500000000.0,,,,,,,17682110.0,...,4.82,4.66,4.54,,,1.0,0.0,0.0,0.0,0.08
50%,50375050.0,,20250500000000.0,,,,,,,86553620.0,...,4.96,4.85,4.765,,,2.0,1.0,0.0,0.0,0.26
75%,9.261013e+17,,20250500000000.0,,,,,,,303664300.0,...,5.0,5.0,4.95,,,8.0,2.0,2.0,0.0,0.94


In [48]:
#checking for missing values
listings.isnull().sum()

id                                                  0
listing_url                                         0
scrape_id                                           0
last_scraped                                        0
source                                              0
                                                ...  
calculated_host_listings_count                      0
calculated_host_listings_count_entire_homes         0
calculated_host_listings_count_private_rooms        0
calculated_host_listings_count_shared_rooms         0
reviews_per_month                               11660
Length: 79, dtype: int64

In [49]:
#seeing if missing values >50% of columns

missing = listings.isnull().sum()
missing_percent = (missing / len(listings)) * 100
missing_percent[missing_percent>50].sort_values(ascending=False)

calendar_updated    100.000000
license              85.423308
dtype: float64

In [50]:
# Step 1: Calculate missing values and percentage
missing = listings.isnull().sum()
missing_percent = (missing / len(listings)) * 100

# Step 2: Identify columns to drop (>50% missing)
cols_to_drop = missing_percent[missing_percent > 50].index

# Step 3: Drop those columns and create a cleaned copy
listings_cleaned = listings.drop(columns=cols_to_drop)
listings_cleaned = listings_cleaned.copy()

# Step 4: Optional — confirm the shape
print(f"Original shape: {listings.shape}")
print(f"Cleaned shape: {listings_cleaned.shape}")


Original shape: (37018, 79)
Cleaned shape: (37018, 77)


In [56]:
reviews_cleaned.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2539,55688172,2015-12-04,25160947,Peter,Great host
1,2539,97474898,2016-08-27,91513326,Liz,Nice room for the price. Great neighborhood. J...
2,2539,105340344,2016-10-01,90022459,Евгений,Very nice apt. New remodeled.
3,2539,133131670,2017-02-20,116165195,George,Great place to stay for a while. John is a gre...
4,2539,138349776,2017-03-19,118432644,Carlos,.
