## Plotting Airbnb prices Boston
### with Lets-Plot


Lets-Plot (https://github.com/JetBrains/lets-plot) is an open-source plotting library for statistical data.

- `ggplot2` like API
- GeoPandas Support (https://github.com/JetBrains/lets-plot/blob/master/docs/geopandas.md)
- Built-in interactive maps (https://github.com/JetBrains/lets-plot/blob/master/docs/interactive_maps.md)

### In this notebook

- Read Airbnb listings data and do simple data preparation.

- Create a simple "price" scatter plot on a map.
  - Read Boston neighbourhoods GeoJSON file.

- Create a choropleph of Boston showing average price per neighbourhood. 

- Create an interactive and more detailed choropleph of Boston showing average price per precinct:
  - Read Boston precincts GeoJSON file.
  - Perform spatial join of Airbnb listings (coordinates) and Boston precincts (boundaries).
  - Add an interactive base-map layer to the plot.

- Create similar interactive choropleph of Boston but showing our proprietary "rating" of Boston precincts. 

### Tools

- Lets-plot: all visualizations.
- Geopandas: opearions with spatial datasets.
- Rtree: handling of spatial indices.
- Pandas: operations with regular datasets.

### Data

https://github.com/csboutique/csboutique.github.io/tree/master/paper_1/data

#### Airbnb listings - Boston

listings.csv contains the property price, geographic coordinates, name of neighbourhood,     number of reviews per month and other information.

neighbourhoods.geojson contains Boston neighbourhoods boundaries. 


#### Boston Precincts

boston_precincts.geojson contains Boston precincts.

In [1]:
import pandas as pd
import geopandas as gpd
from lets_plot import *

LetsPlot.setup_html()
LetsPlot.set(maptiles_lets_plot(theme="dark"))
# LetsPlot.set(maptiles_lets_plot(theme="light"))
import lets_plot
lets_plot.__version__

'4.7.3'

### Airbnb listings

In [2]:
listings = pd.read_csv('boston/boston_listings.csv')
listings.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group               float64
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

#### Data Preparation

In [3]:
# Initial clean-up
listings = listings[[
    "neighbourhood", 
    "latitude", 
    "longitude", 
    "room_type", 
    "price", 
    "reviews_per_month",
    "availability_365"]].dropna()

In [4]:
listings["room_type"].unique()

array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object)

Observations.
There are three listed types of properties:

- Shared room
- Private room
- Entire home/apt

Assumptions. 

The "room" type is better suited for our Exploratory Data Analysis (EDA) because "rooms" are characterized by a lesser number of variables and are more "comparable" than the "Entire home/apt" property type.

Let us take a look at the price histogram summarizing the price distribution for all three property types.

# (Figure 1)

In [5]:
(ggplot() + 
 ggtitle("Airbnb price distribution grouped by the property type.") +
 geom_histogram(aes(x="price", fill="room_type", alpha="room_type"), 
                data=listings,
                tooltips=layer_tooltips().anchor("middle_center").min_width(25),
                position="identity", bins=100) + 
 geom_vline(xintercept=320, color="red", linetype="dashed", size=1) + 
 xlim(0, 600) +
 scale_alpha_manual([0.3, 1, 0.3], guide="none") +
 scale_fill_discrete(name="Property type") +
 ggsize(720, 350) + 
 theme(legend_position='bottom', axis_line_y='blank'))

Observations.

- Very little number of "Shared room" listings.
- "Private room" price distribution looks good.
- A "Private room" which is more expensive than $320 can be considered an outlier. 

Data preparation.

- Remove all "Shared room" and "Entire home/apt" listings.
- Remove outliers (price>320)

In [6]:
listings = listings[(listings["room_type"]=="Private room")]
listings = listings[(listings["price"]>5)]
listings = listings[(listings["price"]<320)]
# Availability at least 1 month a year
listings = listings[(listings["availability_365"]>30)]

In [7]:
listings.head(3)

Unnamed: 0,neighbourhood,latitude,longitude,room_type,price,reviews_per_month,availability_365
3,Roslindale,42.292438,-71.135765,Private room,65,0.65,60
24,South End,42.344957,-71.074857,Private room,148,3.22,89
31,Dorchester,42.298349,-71.059104,Private room,55,2.22,365


### The "price" scatter plot on a map of Boston.

#### Read Boston neighbourhoods GeoJSON file.

Use Geopandas `read_file()` to create GeoDataframe with Boston neighbourhood geometries (boundaries).

In [8]:
# Neighbourhoods GeoJSON
neighbourhoods = gpd.read_file('boston/boston_neighbourhoods.geojson')
neighbourhoods.head(3)

Unnamed: 0,neighbourhood,neighbourhood_group,geometry
0,Roslindale,,"MULTIPOLYGON (((-71.12593 42.27200, -71.12575 ..."
1,Jamaica Plain,,"MULTIPOLYGON (((-71.10499 42.32609, -71.10488 ..."
2,Mission Hill,,"MULTIPOLYGON (((-71.09043 42.33576, -71.09275 ..."


#### Predefine some plot settings.

Predefine color palettes, plot size and theme once to use these settings later on all plots. 

We will use Color Brewer palette named "YlOrBr". 
For more, see [Color Brewer](https://colorbrewer2.org/#type=sequential&scheme=YlOrBr&n=9) palettes on-line.

In [9]:
# Color palettes.
fill_YlOrBr = scale_fill_brewer(palette="YlOrBr", na_value="rgba(0,0,0,0)")
color_YlOrBr = scale_color_brewer(palette="YlOrBr")

# fill_YlOrBr = scale_fill_brewer(palette="Oranges", na_value="rgba(0,0,0,0)")
# color_YlOrBr = scale_color_brewer(palette="Oranges")

# fill_YlOrBr = scale_fill_brewer(palette="Greens", na_value="rgba(0,0,0,0)")
# color_YlOrBr = scale_color_brewer(palette="Greens")

In [10]:
# Plot size and theme.
map_settings = (ggsize(720, 520) +
      coord_fixed() +
      theme(axis_line='blank', axis_text='blank',
            axis_ticks='blank', axis_title='blank',
            axis_tooltip='blank',
            legend_position='bottom'))

#### Airbnb listings scatter plot.

Create plot with two layers:
1. `geom_polygon`: background map layer. 

   The `geom_polygon()` function "understands" GeoDataframe format and all we need is to pass the "neighbourhoods" GeoDataframe to the `map` parameter of `geom_polygon()`.

2. `geom_point`: scatter layer.
  
   The `geom_point()` function "understands" GeoDataframe format as well. But this case, the "listings" is a regular Dataframe containing the "longitude" and "latitude" cloumns. So, in this example we will use the `geom_point()` as a usual `ggplot2` layer and specify 'x' and 'y' aesthetics as "longitude" and "latitude". 


# (Figure 2)

In [27]:
# Price
(ggplot() +
 geom_polygon(map=neighbourhoods, fill="light-gray", color="gray", alpha=.2) +
 geom_point(aes("longitude", "latitude", color="price"), 
            data=listings,            
            size=5) +
 map_settings + color_YlOrBr)

### Choropleph of Boston showing average price per neighbourhood.

#### Compute average listing price per 'neighbourhood'.

In [12]:
neighbourhood_price = listings[[
    'neighbourhood', 
    'price']].groupby('neighbourhood').mean('price').reset_index()
    
neighbourhood_price.head(3)

Unnamed: 0,neighbourhood,price
0,Allston,76.955224
1,Back Bay,115.407407
2,Bay Village,99.0


Will create choropleth using the single `geom_polygon()` layer.

We have "neighbourhood_price" dataframe containing 2 columns: "neighbourhood" and "price".

The "neighbourhoods" GeoDataframe again contains the "neighbourhood" column as well as "geometry" (neighbourhood boundaries).

To create choropleth (i.e. to fill neighbourhood polygons with a color reflecting its average listing price) we need to "join" these two data-structures into a single dataset containing both, the "price" and the "geometry".

The `map_join` parameter is provided for exectly this purpose.

We will join by the "neighbourhood" column: 

`map_join="neighbourhood"`

And pass the "neighbourhood_price" dataframe via the `data` parameter and the "neighbourhoods" GeoDataframe via the `map` parameten.


# (Figure 3)

In [13]:
(ggplot() +
 geom_polygon(aes(fill="price"), 
            data=neighbourhood_price, 
            map=neighbourhoods,
            map_join=['neighbourhood', 'neighbourhood'],
            color="gray",
            tooltips=layer_tooltips()
                .line("@neighbourhood")
                .line("Avg. price ($)|^fill")) +
 map_settings + fill_YlOrBr)

### Interactive and more detailed choropleph of Boston showing average price per precinct.

Neghbourhoods are too large - not very helpfull.


Precincts - better, smaller.

The [Analyze Boston](https://data.boston.gov/) - a trough of spatial data about Boston.

Boston Precincts in various formats, including GeoJSON, can be downloaded from this page: 
https://data.boston.gov/dataset/precincts1

#### Read Boston precincts GeoJSON file.

We are going to use two columns in this dataset: 
- "OBJECTID" - for the purose of spatial join (see later)
- "geometry" - precincts boundaries.

In [14]:
# Precincts GeoJSON
precincts = gpd.read_file('boston/boston_precincts.geojson')

# Drop all unnecessary columns: 
# keep only the object id and geometry columns.
precincts = precincts[["OBJECTID", "geometry"]]
precincts.head(3)

Unnamed: 0,OBJECTID,geometry
0,1,"POLYGON ((-71.01305 42.38532, -71.01328 42.385..."
1,2,"POLYGON ((-71.03131 42.37025, -71.03005 42.369..."
2,3,"POLYGON ((-71.02716 42.37030, -71.02786 42.369..."


#### The first glance at Boston precints.

In [15]:
(ggplot() +
 geom_polygon(aes(fill="OBJECTID"), 
                data=precincts, 
                color="white", show_legend=False) +
 map_settings)

#### Spatial join of Airbnb listings (coordinates) and Boston precincts (boundaries).

The problem.

Our "listings" dataframe does not contain any reference to a precinct where this property is located.

Wherefore we have to find this ourselfes basing exclusively on the listings coordinates and the precincts polygons. 

The solution.

Fortunately, the GeoPandas's `sjoin()` function can perform such a spatial join operation on two GeoDataframes.

See Geopandas docs: https://geopandas.org/docs/user_guide/mergingdata.html

But, to be able to use this function, we have to convert the "listings" Dataframe into a GeoDataframe.

We do it by creating the "geometry" column using the "longitude" and "latitude" columns:

```
listings_gdf = gpd.GeoDataFrame(
    listings,
    geometry=gpd.points_from_xy(listings["longitude"], listings["latitude"]),
    crs="EPSG:4326")
```

In [16]:
# Convert 'listings' Dataframe to GeoDataframe
listings_gdf = gpd.GeoDataFrame(
    listings,
    geometry=gpd.points_from_xy(listings["longitude"], listings["latitude"]),
    crs="EPSG:4326")
listings_gdf = listings_gdf.drop(columns=["longitude", "latitude"])
listings_gdf = listings_gdf.reset_index(drop=True)
listings_gdf.head(3)

Unnamed: 0,neighbourhood,room_type,price,reviews_per_month,availability_365,geometry
0,Roslindale,Private room,65,0.65,60,POINT (-71.13577 42.29244)
1,South End,Private room,148,3.22,89,POINT (-71.07486 42.34496)
2,Dorchester,Private room,55,2.22,365,POINT (-71.05910 42.29835)


Now we are ready to spatial join of these two GeoDataframes: 

```
listings_with_precincts = gpd.sjoin(listings_gdf, precincts, 
                                    how="left", op='within')

```

As result of this step we have the "OBJECTID" column added to the "listings_gdf". We will use this column later as a `key` connecting prices and precinct boundaries.

In [17]:
# Join presincts with listings spatially.
# 
listings_with_precincts = gpd.sjoin(listings_gdf, precincts, 
                                    how="left", predicate='within')
listings_with_precincts.head(3)

Unnamed: 0,neighbourhood,room_type,price,reviews_per_month,availability_365,geometry,index_right,OBJECTID
0,Roslindale,Private room,65,0.65,60,POINT (-71.13577 42.29244),181.0,182.0
1,South End,Private room,148,3.22,89,POINT (-71.07486 42.34496),11.0,12.0
2,Dorchester,Private room,55,2.22,365,POINT (-71.05910 42.29835),120.0,121.0


#### Compute average listing price per precinct.

The "precinct_price" dataset will contain columns: "OBJECTID" (or precinct key), "price" (avg. listing price per precinct) and "neighbourhood".

In [18]:
# Compute mean price by presinct.
precinct_price = listings_with_precincts[[
    'neighbourhood',
    'OBJECTID', 
    'price']].groupby(['neighbourhood', "OBJECTID"]).mean('price').reset_index()
precinct_price.head(3)

Unnamed: 0,neighbourhood,OBJECTID,price
0,Allston,83.0,70.0
1,Allston,84.0,116.285714
2,Allston,85.0,69.714286


#### Iteractive choropleth map of prices.

Create plot using two layers:

1. `geom_livemap()`: adds an interactine base-map layer which you can zoom and pan.

2. `geom_polygon()`: choropleth map layer.
   
   We already know how to create a choropleth using Lets-Plot's `geom_polygon()`.

   This time we again will be using `data`, `map`, `map_join` parameters:
   ```
            data=precinct_price, 
            map=precincts,
            map_join="OBJECTID",
   ```

# (Figure 4)

In [25]:
(ggplot() +
 geom_livemap(zoom=12, location = [-71.085, 42.334]) +
 geom_polygon(aes(fill="price"), 
            data=precinct_price, 
            map=precincts,
            map_join=["OBJECTID", "OBJECTID"],
            alpha=0.7,  
            tooltips=layer_tooltips()
                .format("price", "d")
                .line("@neighbourhood")
                .line("Price ($)|^fill")
  ) +
 map_settings + fill_YlOrBr)

### Extras (fun) 
### Interactive choropleph of Boston showing "rating" of Boston precincts.

Formula.

Our proprietary "rating" formula is based on the following assumptions:

- The more expensive location, the lower the rating.
- The more popular location, the higher the rating.  

We don't have a "popularity" figure in the "listings" dataset, so we are goung to use the "Reviews per month" value as a proxy for the "populatity".


![](https://latex.codecogs.com/svg.latex?\Large&space;R=Re-Pr) 



Before computing the "rating" value we will normalize the review counts and prices (simply scale to the range [0..1]).

The resulting "rating" value we will normalize as well and then show it on the map formatted as percentage, where 100% would indicate the highest possible "rating".

In [20]:
def norm(series):
    """
    Simple normalization. 
    """
    return (series - series.min()) / (series.max() - series.min())

In [21]:
normalized_price = norm(listings_with_precincts['price'])
normalized_reviews = norm(listings_with_precincts['reviews_per_month'])

# Use a simple formule: reviews - price 
# to compute the "rating".
rating = norm(normalized_reviews - normalized_price)

In [22]:
(ggplot() + geom_histogram(aes(rating)) + 
 ggtitle('Distribution of Airbnb listings "rating" in Boston'))

#### Compute both: average listing price and average "rating" per precinct.

In [23]:
# Add the "rating" column to the "listings_with_precincts" GeoDataframe. 
listings_with_precincts["rating"] = rating

# Compute mean rating, price by presinct.
precinct_rating_price = listings_with_precincts[[
    'neighbourhood',
    'OBJECTID', 
    'rating', 
    'price']].groupby(['neighbourhood', "OBJECTID"]).mean(['rating','price']).reset_index()
precinct_rating_price.head(3)

Unnamed: 0,neighbourhood,OBJECTID,rating,price
0,Allston,83.0,0.456099,70.0
1,Allston,84.0,0.400209,116.285714
2,Allston,85.0,0.474077,69.714286


#### Iteractive choropleth map of prices and "ratings".

# (Figure 5)

In [26]:
(ggplot() +
 geom_livemap(zoom=12, location = [-71.085, 42.334]) +
 geom_polygon(aes(fill="rating"), 
            data=precinct_rating_price, 
            map=precincts,
            map_join=["OBJECTID", "OBJECTID"],
            alpha=0.7,  
            # show_legend=False,  
            tooltips=layer_tooltips()
                .format("rating", ".0%")
                .format("price", "d")
                .line("@neighbourhood")
                .line("Rating|@rating")
                .line("Price ($)|@price")
  ) +
 map_settings + scale_fill_brewer(palette="Greens", na_value="rgba(0,0,0,0)"))

### Conclusion

We have illustrated how Lets-plot (and ggplot2 API) makes it easy to create interactive geospatial charts in Python.