In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv("Dataset/airbnb.csv")
df.head()

Unnamed: 0,id,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,...,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,thumbnail_url,zipcode,bedrooms,beds,price
0,6242629,Apartment,Private room,"{TV,""Cable TV"",""Wireless Internet"",""Air condit...",2,2.0,Real Bed,flexible,False,Chicago,...,-87.686821,Beautiful apartment in tri-taylor w/private ba...,Little Italy/UIC,12,100,https://a0.muscache.com/im/pictures/7b8e236e-2...,60612,1,1,53.0
1,18688272,Apartment,Private room,"{Internet,""Wireless Internet"",""Air conditionin...",2,1.0,Real Bed,strict,True,NYC,...,-73.960776,Quintessential W'burg Apartment,Williamsburg,119,92,https://a0.muscache.com/im/pictures/33099113/4...,11211,1,1,103.0
2,15435501,Apartment,Private room,"{TV,Internet,""Wireless Internet"",""Air conditio...",2,1.0,Real Bed,moderate,True,LA,...,-118.320883,Private Room in Hollywood Oasis,Hollywood,47,94,https://a0.muscache.com/im/pictures/978947c3-3...,90028,1,1,58.0
3,15822779,House,Private room,"{""Wireless Internet"",Breakfast,""Pets live on t...",2,1.5,Real Bed,moderate,True,SF,...,-122.408987,Bright Bernal room with garden,Bernal Heights,49,94,https://a0.muscache.com/im/pictures/49742373/3...,94110,1,1,115.0
4,10752305,Apartment,Entire home/apt,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",5,1.0,Real Bed,strict,True,NYC,...,-73.986186,Modern/Spacious 1 Bedroom Next To Times Square,Hell's Kitchen,23,98,https://a0.muscache.com/im/pictures/e3b8a0a7-f...,10019,1,4,179.0


Nearest metro is a possible feature that influences the price. Hypothesis: the nearer an Airbnb is to a metro station, the more convenient the location is and the higher the price charged.

The metro data of each of the 6 cities was found online, with the key information being the latitude and longitude of each station.

In [5]:
cities = ["Boston", "Chicago", "DC", "LA", "NYC", "SF"]
metro_dfs = {}
for city in cities:
    metro_dfs[city] = pd.read_csv(f"Dataset/metro/{city}.csv")

In [11]:
from haversine import haversine

A helper function is used to calculate the haversine distance to the nearest metro station for a given latitude and longitude.

In [16]:
def nearest_metro(lat, lon, metro_df: pd.DataFrame):
    distances = metro_df.apply(lambda row: haversine((row['Y'], row['X']), (lat, lon)), axis=1)
    return distances.min()

This function is applied to every row in the dataset to get the feature "nearest_metro".

In [17]:
df["nearest_metro"] = df.apply(lambda row: nearest_metro(row['latitude'], row['longitude'], metro_dfs[row['city']]), axis=1)
df["nearest_metro"]

0       0.757019
1       0.402480
2       0.466555
3       1.257526
4       0.110683
          ...   
4995    0.230362
4996    0.370254
4997    1.230959
4998    4.682683
4999    0.515101
Name: nearest_metro, Length: 5000, dtype: float64