Description: the implementation of "The Giraffe System"

Version: 1.2.0.20210721

Group name: YYDS

Authors: Haodong Liu and Jichen Zhao

Airbnb has become a popular platform among holidaymakers and tourists for lodging and rental houses. A host could manage his/her listings, and a guest could select one to fulfill his/her unique and personalised travelling plans. A public Airbnb dataset would be discovered for visualisation tasks. It regards the summary info and metrics of some listings in New York City (NYC), New York, USA for 2019. The data table is stored in the CSV file `Airbnb_NYC_2019.csv`, which is downloaded from [the corresponding dataset info page on Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data).

Two information visualisation "systems" have been implemented - "The Giraffe System" (hereinafter called **Giraffe**) and "The Zebra System" (hereinafter called **Zebra**). It is because the visualisation tasks would be defined in the same context but different contents. For example, both Giraffe and Zebra would explore a task to consume information by analysing the data, but the specifications would be various. Anyway, we would expect that both of them could provide general insights into the NYC listings for 2019 since the visualisation tasks should help visualise and understand the primary data features and correlations.

# The Giraffe System

## Importing Modules

**NOTE:** Please ensure that no exception is thrown in this section before executing the other sections.

In [1]:
import json
import altair as alt
import pandas as pd

## Preparing Data

In [2]:
data = pd.read_csv('Airbnb_NYC_2019.csv')  # Load raw data from the data file.
print('The number of listings:', len(data))

The number of listings: 48895


The dataset contains too many records, and we do not want to bypass Altair's `MaxRows` check. Hence, we would like to randomly select 5000 listings as the items pending investigation for demonstration purposes. Missing values would be examined for further data processing.

In [3]:
data = data.sample(n = 5000, random_state = 0)
data.isnull().sum()  # List the number of null values for each column.

id                                   0
name                                 2
host_id                              0
host_name                            5
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       1036
reviews_per_month                 1036
calculated_host_listings_count       0
availability_365                     0
dtype: int64

Each item (i.e., a listing) originally has 16 attributes as follows. We would keep relevant attributes for visualisation tasks.

| Attribute | Description | Kept |
| -- | -- | :--: |
| `id` | The listing ID | √ |
| `name` | The listing name | |
| `host_id` | The host ID | √ |
| `host_name` | The host name | |
| `neighbourhood_group` | One of the 5 boroughs in NYC | √ |
| `neighbourhood` | One of the neighbourhoods in NYC | √ |
| `latitude` | The latitude coordinate | √ |
| `longitude` | The longitude coordinate | √ |
| `room_type` | One of the room types defined by Airbnb | √ |
| `price` | The price in US dollars for a night stay | √ |
| `minimum_nights` | The minimum number of nights that a guest can book | |
| `number_of_reviews` | The number of reviews | |
| `last_review` | The date of the latest review | |
| `reviews_per_month` | The number of reviews per month | √ |
| `calculated_host_listings_count` | The number of different listings for a particular host | √ |
| `availability_365` | The number of days for which a particular listing is available in a year | |

**NOTE:**

1. The attributes `name` and `host_name` would be removed. We already have unique IDs for listings and hosts, and we are not interested in their names. Hence, they would be dropped to also avoid any potential ethical issue.
2. The attributes `neighbourhood_group`, `neighbourhood`, and `room_type` are categorical. This attribute type could be vital for information visualisation.
3. The attributes `minimum_nights` and `availability_365` would be removed. These attributes could be significantly subject to the host preferences, and we are not interested in such future data.
4. The attribute `number_of_reviews` would be removed. The listings could be added at different time, and we reckon that the attribute `reviews_per_month` would be more meaningful. It contains missing values because a particular listing could have no review. In this case, we could simply fill these values with 0.
5. The attribute `last_review` would be removed. We would focus on the generic trend, distribution, etc. This attribute could contribute little for visualisation tasks, since its value could be null and we do not have another clear date for comparison.

In [4]:
data.drop(
    ['name', 'host_name', 'minimum_nights', 'number_of_reviews', 'last_review', 'availability_365'],
    axis = 1,
    inplace = True)
data.fillna({'reviews_per_month': 0}, inplace = True)

In [5]:
data

Unnamed: 0,id,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,reviews_per_month,calculated_host_listings_count
43813,33893655,138798990,Manhattan,Tribeca,40.72430,-74.01110,Entire home/apt,225,0.00,1
32734,25798461,195803,Manhattan,NoHo,40.72555,-73.99283,Entire home/apt,649,0.40,1
25276,20213045,2678122,Brooklyn,Williamsburg,40.71687,-73.95012,Entire home/apt,300,0.35,3
36084,28670432,115993835,Brooklyn,Sunset Park,40.64036,-74.00822,Private room,26,1.36,5
17736,13920697,29513490,Brooklyn,Bedford-Stuyvesant,40.68370,-73.93325,Entire home/apt,125,0.12,1
...,...,...,...,...,...,...,...,...,...,...
28768,22226156,14103679,Manhattan,Murray Hill,40.74623,-73.97404,Private room,199,4.04,1
30348,23463017,9241743,Manhattan,Harlem,40.80899,-73.94225,Private room,80,0.06,1
3812,2297192,11732741,Manhattan,Financial District,40.70825,-74.00495,Entire home/apt,1000,0.00,1
38366,30212188,224414117,Manhattan,Hell's Kitchen,40.75523,-73.99827,Private room,107,3.52,30


## Visualising Data

Let us first define some common variables.

In [6]:
borough_label = 'Borough'
host_label = 'Host'
init_max_price = 500
listing_count_label = 'The number of listings'
max_price = data['price'].max()
min_price = data['price'].min()
neighbourhood_label = 'Neighbourhood'
per_cent_label = 'Per cent'
price_label = 'Price'
reviews_label = 'The number of reviews per month'
room_type_label = 'Room type'
select_all_label = 'All'

price_selection = alt.selection_single(
    bind = alt.binding_range(
        max = max_price,
        min = min_price,
        name = 'Max price: ',
        step = 1
    ),
    fields = ['max_price_filter'],
    init = {'max_price_filter': init_max_price},
    name = 'price_selection'
)  # A price filter.

Before visualisation, it is necessary to understand [the interactive nature](https://altair-viz.github.io/user_guide/interactions.html) of charts created using Altair. In plain English, it is essential for you to take advantage of the following features.

- If tooltips are enabled, the corresponding value of a chart element should be displayed when the mouse hovers on the element.
- If legend selection is enabled, the values of a specific category could be highlighted when the corresponding legend is clicked. The values of multiple categories could also be highlighted.
> A multi selection is similar to a single selection, but it allows for multiple chart objects to be selected at once. By default, chart elements can be added to and removed from the selection by clicking on them while holding the `Shift` key.
- If scale binding is enabled, the panning and zooming along the chart is allowed.
- If filters are provided, the specific input elements could be used for filtering.

The 7 visualisation tasks are defined as follows. Giraffe shares almost the same sections as Zebra from the start till here because they are necessary preparations. However, the following sections could vary considerably from those of Zebra since we perform the same visualisation tasks using different design decisions.

| Task | Action | Specification |
| -- | -- | -- |
| #1 | Analyse and consume | Discover the number of listings by borough and room type to find a borough with the most listings and entire rooms/apartments. |
| #2 | Analyse and produce | Derive the per cent of room type by borough to compare between the 2 categories. |
| #3 | Search | Look up the number of Manhattan's neighbourhoods in the top 10 neighbourhoods by the number of listings. |
| #4 | Search | Browse the host ranking by the number of reviews per month and the number of listings to find the host ranking first in each case. |
| #5 | Search | Locate the most popular price range for each borough/room type. |
| #6 | Search | Explore any noticeable pattern in the price distribution by room type. |
| #7 | Query | Identify, compare, and summarise the correlations among prices, locations, the number of listings, boroughs, and room types. |

### Bar Charts: Stacked or Grouped?

People might be interested in the question like "who has the most...?" when it comes to comparisons. Bar charts would be a good choice. However, if there are multiple categories for grouping, we had better consider whether the charts need to be stacked or grouped.

**NOTE:**

1. Tasks associated: **#1**, **#2**. The per cent is calculated based on the existing data.
2. The design decision for Giraffe here is to use stacked bar charts to visualise the number of listings by borough and the per cent of room types by borough.
3. Tooltips are enabled for each bar.
4. Legend selection is enabled.

In [7]:
# Plot the stacked bar charts.
legend_selection = alt.selection_multi(bind = 'legend', fields = ['room_type'])

base = alt.Chart(data).mark_bar().encode(
    x = alt.X(
        'neighbourhood_group:N',
        axis = alt.Axis(title = borough_label),
        sort = '-y'
    ),
    color = alt.Color('room_type:N', legend = alt.Legend(title = room_type_label)),
    opacity = alt.condition(legend_selection, alt.value(1), alt.value(0.2))
).properties(width = alt.Step(40)).add_selection(legend_selection)

subplot_left = base.encode(
    y = alt.Y('count():Q', axis = alt.Axis(title = listing_count_label)),
    tooltip = [alt.Tooltip('room_type:N', title = room_type_label), alt.Tooltip('count(room_type):Q', title = listing_count_label)]
).properties(title = listing_count_label + ' by ' + borough_label.lower())  # Visualise the number of listings by borough.

subplot_right = base.transform_aggregate(count = 'count():Q', groupby = ['neighbourhood_group', 'room_type']).transform_joinaggregate(
    total = 'sum(count):Q', groupby = ['neighbourhood_group']
).transform_calculate(
    per_cent = alt.datum.count / alt.datum.total
).encode(
    y = alt.Y(
        'count:Q',
        axis = alt.Axis(format = '%', title = per_cent_label + ' of ' + room_type_label.lower()),
        stack = 'normalize'
    ),
    tooltip = [alt.Tooltip('room_type:N', title = room_type_label), alt.Tooltip('per_cent:Q', format = '.2%', title = listing_count_label)]
).properties(title = per_cent_label + ' of ' + room_type_label.lower() + 's by ' + borough_label.lower())  # Visualise the per cent of room types by borough.

subplot_left | subplot_right

### Bar Charts: Horizontal or Vertical?

Sometimes bar charts are used for visualising the rank. We would not say that one surpasses the other, but which one might be more suitable for a specific scenario?

**NOTE:**

1. Task associated: **#3**.
2. The design decision for Giraffe here is to use a horizontal bar chart to visualise the top 10 neighbourhoods by the number of listings.
3. Tooltips are enabled for each bar.
4. Legend selection is enabled.

In [8]:
print('The number of unique neighbourhoods:', data['neighbourhood'].nunique())

The number of unique neighbourhoods: 184


In [9]:
neighbourhood_top10 = data['neighbourhood'].value_counts().head(10).index  # Get the top 10 neighbourhoods by the number of listings.
data_neighbourhood_top10 = data.loc[data['neighbourhood'].isin(neighbourhood_top10)]  # Keep data of the top 10 neighbourhoods.

# Plot the horizontal bar chart.
legend_selection = alt.selection_multi(bind = 'legend', fields = ['neighbourhood_group'])
alt.Chart(
    data_neighbourhood_top10,
    title = 'Top 10 ' + neighbourhood_label.lower() + 's by ' + listing_count_label.lower()
).mark_bar().encode(
    x = alt.X('count():Q', axis = alt.Axis(title = listing_count_label)),
    y = alt.Y(
        'neighbourhood:N',
        axis = alt.Axis(title = neighbourhood_label),
        sort = '-x'
    ),
    color = alt.Color('neighbourhood_group:N', legend = alt.Legend(title = borough_label)),
    opacity = alt.condition(legend_selection, alt.value(1), alt.value(0.2)),
    tooltip = 'count():Q'
).properties(height = alt.Step(40)).add_selection(legend_selection)

### Bar Charts: Ordered or Unordered?

Still for a ranking bar chart, it usually consists of a categorical attribute and a quantitative attribute which could be ordered. Is it always a good practice to visualise the data in a specific order?

**NOTE:**

1. Task associated: **#4**.
2. The design decision for Giraffe here is to use ordered bar charts to visualise the top 10 hosts by the number of reviews per month and the number of listings.
3. Tooltips are enabled for each bar.

In [10]:
print('The number of unique hosts:', data['host_id'].nunique())

The number of unique hosts: 4630


In [11]:
# Get the top 10 hosts by the number of reviews per month.
hosts = data['host_id'].unique()
n_reviews = []

for host in hosts:
    data_reviews = data.loc[data['host_id'] == host]
    n_reviews.append(data_reviews['reviews_per_month'].sum())

data_host_reviews_top10 = pd.DataFrame({'host_id': hosts, 'reviews_per_month': n_reviews})
data_host_reviews_top10 = data_host_reviews_top10.nlargest(10, 'reviews_per_month')

# Get the top 10 hosts by the number of listings.
host_top10 = data['host_id'].value_counts().head(10).index

# Keep the specific columns of the data of the top 10 hosts by the number of listings.
data_host_listings_top10 = data.loc[data['host_id'].isin(host_top10)]
data_host_listings_top10 = data_host_listings_top10[['host_id', 'calculated_host_listings_count']].drop_duplicates()

In [12]:
# Plot the ordered bar charts.
base = alt.Chart().mark_bar().encode(
    y = alt.Y(
        'host_id:N',
        axis = alt.Axis(title = host_label + ' ID'),
        sort = '-x'
    )
).properties(height = alt.Step(40))

subplot_left = base.encode(
    x = alt.X('reviews_per_month:Q', axis = alt.Axis(title = reviews_label)),
    tooltip = 'reviews_per_month:Q'
).properties(
    data = data_host_reviews_top10,
    title = 'Top 10 ' + host_label.lower() + 's by ' + reviews_label.lower()
)  # Visualise the top 10 hosts by the number of reviews per month.

subplot_right = base.encode(
    x = alt.X('calculated_host_listings_count:Q', axis = alt.Axis(title = listing_count_label)),
    tooltip = 'calculated_host_listings_count:Q'
).properties(
    data = data_host_listings_top10,
    title = 'Top 10 ' + host_label.lower() + 's by ' + listing_count_label.lower()
)  # Visualise the top 10 hosts by the number of listings.

subplot_left | subplot_right

### Relationship: Histograms or Line Charts？

Line charts might be preferred when we try to visualise any relationship or trend. But we should admit that histograms could be versatile. Why not just try and compare them?

**NOTE:**

1. Task associated: **#5**.
2. The design decision for Giraffe here is to use histograms to visualise the relationship between the number of listings and prices, by room type and borough.
3. Legend selection is enabled.
4. Scale binding is enabled.
5. A price filter is provided.

In [13]:
# Plot the area charts in a histogram way.
legend_selection_left = alt.selection_multi(bind = 'legend', fields = ['room_type'])
legend_selection_right = alt.selection_multi(bind = 'legend', fields = ['neighbourhood_group'])

base = alt.Chart(data).transform_filter(alt.datum['price'] <= price_selection['max_price_filter']).encode(
    x = alt.X(
        'price:Q',
        axis = alt.Axis(title = price_label + ' (binned)'),
        bin = alt.Bin(step = 20)
    ),
    y = alt.Y(
        'count():Q',
        axis = alt.Axis(title = listing_count_label),
        stack = None)
).add_selection(price_selection, alt.selection_interval(bind = 'scales'))

subplot_left = base.mark_area(interpolate = 'step').encode(
    color = alt.Color(
        'room_type:N',
        legend = alt.Legend(symbolStrokeWidth = 5,
        title = room_type_label)
    ),
    opacity = alt.condition(legend_selection_left, alt.value(0.6), alt.value(0.2))
).properties(title = 'By ' + room_type_label.lower()).add_selection(legend_selection_left)  # Visualise the relationship between the number of listings and prices, by room type.

subplot_right = base.mark_area(interpolate = 'step').encode(
    color = alt.Color(
        'neighbourhood_group:N',
        legend = alt.Legend(symbolStrokeWidth = 5, title = borough_label),
        scale = alt.Scale(scheme = 'dark2')
    ),
    opacity = alt.condition(legend_selection_right, alt.value(0.6), alt.value(0.2))
).properties(title = 'By ' + borough_label.lower()).add_selection(legend_selection_right)  # Visualise the relationship between the number of listings and prices, by borough.

(subplot_left | subplot_right).properties(
    title = 'The relationship between ' + listing_count_label.lower() + ' and ' + price_label.lower() + 's'
).resolve_scale(color = 'independent').configure_title(anchor = 'middle')

### Distribution: Box Plots or Violin Plots？

Both plots could provide insights into the distribution of a quantitative attribute. Violin plots could also tell about the density. It does not mean that the violin plots are better. But in the context of distribution, which one would be preferred?

**NOTE:**

1. Task associated: **#6**.
2. The design decision for Giraffe here is to use a box plot to visualise the primary distribution of prices by room type.
3. Tooltips are enabled for each box.

In [14]:
# Have some preliminary foundings about the extreme values and the distribution.
room_types = data['room_type'].unique()
prices_room_type_stats = pd.DataFrame()

for prices_room_type in [data.loc[data['room_type'] == room_type] for room_type in room_types]:
    prices_room_type_stats = pd.concat([prices_room_type_stats, prices_room_type.describe()['price']], axis = 1)

prices_room_type_stats.columns = room_types
prices_room_type_stats

Unnamed: 0,Entire home/apt,Private room,Shared room
count,2595.0,2290.0,115.0
mean,212.536416,87.529258,71.817391
std,324.402129,93.504706,85.790502
min,10.0,0.0,20.0
25%,120.0,50.0,30.0
50%,160.0,70.0,47.0
75%,229.5,95.0,79.0
max,10000.0,2000.0,725.0


In [15]:
# Plot the box plot.
alt.Chart(
    data,
    title = 'Primary distribution of ' + price_label.lower() + 's by ' + room_type_label.lower()
).transform_filter(
    (alt.datum['price'] >= min_price) & (alt.datum['price'] <= init_max_price)
).mark_boxplot().encode(
    y = alt.Y('price:Q', axis = alt.Axis(title = price_label)),
    column = alt.Column(
        'room_type:N',
        header = alt.Header(
            labelOrient = 'bottom',
            labelPadding = 0,
            title = room_type_label,
            titleOrient = 'bottom'
        ),
        spacing = 0
    )
).properties(width = 100).configure_title(anchor = 'middle').configure_view(stroke = None)

### Heatmap: Hue or Saturation?

It is incredibly convenient to generate a heatmap based on geo-location for this dataset due to the `latitude` and `longitude` attributes. Selecting a suitable colour scheme would be vital for successful visualisation. We reckon that it is better to use saturation of the same hue. However, we would like to pretend forgetting it and perform the specific visualisation task. XD
> You live, and you learn.

**NOTE:**

1. Task associated: **#7**. Some other charts are also created in addition to the heatmap to complete the visualisation task.
2. The design decision for Giraffe here is to use hue to visualise the price distribution by location.
3. Tooltips are enabled for almost all chart elements.
4. Scale binding is enabled for the sub-chart illustrating the number of listings by price.
5. A price filter and a borough filter are provided.

In [16]:
# Plot the map part using hue.
nyc_geojson = open('NYC.geojson')
boroughs = data['neighbourhood_group'].unique().tolist()

borough_selection = alt.selection_single(
    bind = alt.binding_select(
        labels = [select_all_label] + boroughs,
        name = 'Borough: ',
        options = [None] + boroughs),
    fields = ['properties.\\boro_name'], 
    init = {'properties.\\boro_name': select_all_label},
    name = 'borough_selection'
)

subplot_left_base = alt.Chart(alt.Data(values = json.load(nyc_geojson)['features'])).mark_geoshape(stroke = 'white').encode(
    color = alt.condition(borough_selection, alt.value('#e4e4e4'), alt.value('#f4f4f4')),
    tooltip = 'properties.boro_name:N'
).properties(height = 500, width = 600).add_selection(borough_selection)  # Plot the map.

nyc_geojson.close()

base = alt.Chart(data).transform_filter(
    r"datum['price'] <= price_selection['max_price_filter'] & (borough_selection['properties\\.boro_name'] == null | datum['neighbourhood_group'] == borough_selection['properties\\.boro_name'])"
).add_selection(price_selection)

loc_listing = base.mark_circle(size = 15).encode(
    latitude = 'latitude:Q',
    longitude = 'longitude:Q',
    color = alt.Color(
        'price:Q',
        legend = alt.Legend(orient = 'none', title = price_label),
        scale = alt.Scale(scheme = 'rainbow')
    ),
    tooltip = [alt.Tooltip('room_type:N', title = room_type_label), alt.Tooltip('price:Q', title = price_label)]
).properties(title = price_label + ' distribution by location')  # Plot the price points to generate a heatmap.

subplot_top_right = base.mark_bar().encode(
    x = alt.X('count():Q', axis = alt.Axis(title = listing_count_label)),
    y = alt.Y('room_type:N', axis = alt.Axis(title = room_type_label), sort = '-x'),
    tooltip = 'count():Q'
).properties(
    height = alt.Step(40),
    title = room_type_label + 's by ' + listing_count_label.lower()
).add_selection(borough_selection)  # Visualise the room types by the number of listings under current filters.

subplot_bottom_right = base.mark_bar().encode(
    x = alt.X(
        'price:Q',
        axis = alt.Axis(title = price_label + ' (binned)'),
        bin = alt.Bin(step = 20)),
    y = alt.Y('count():Q', axis = alt.Axis(title = listing_count_label)),
    tooltip = 'count():Q'
).properties(
    title = listing_count_label + ' by ' + price_label.lower()
).add_selection(borough_selection, alt.selection_interval(bind = 'scales'))  # Visualise the number of listings by price under current filters.

((subplot_left_base + loc_listing) | (subplot_top_right & subplot_bottom_right)).resolve_scale(color = 'independent').configure_view(strokeWidth = 0)