In [None]:
!pip install -r https://raw.githubusercontent.com/EluciDATALab/elucidatalab.starterkits/main/notebooks/SK_3_2_Data_Exploration/requirement.txt

In [None]:
from starterkits.starterkit_3_2.support import *
from starterkits.starterkit_3_2.visualizations import *

from pathlib import Path
DATA_PATH = Path('../../data/')

import logging
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

# Starter Kit 3.2: Data Exploration

<div id="description"></div>

## Description

Before starting to develop complex algorithms to solve a data-driven problem, you need to gain a thorough understanding of the corresponding data. **Data exploration** is the process through which you will gain such an understanding and will eventually be able to derive viable working hypotheses related to the problem at hand, useful for the further processing of the data.

Throughout this process, different types of variables (e.g. numerical, categorical, etc.) will require different types of treatment. In addition, by looking at how different variables are related to each other, you will understand that it is probably not enough to consider each of them individually, but that you will have to take into account the, often complex, interactions between them. Furthermore, you will identify which are the relevant variables in your dataset and which ones, rather than improving the quality of your analysis, will worsen its outcome (e.g. because they will introduce noise or bias to your algorithm). Finally, data exploration will help pointing out which data quality issues your dataset may have and which actions you should undertake to mitigate them.

<div id="business-goal"></div>

## Business Goal

The goal behind this Starter Kit is to lay out a **series of analyses** that will teach you how to **explore individual variables**, look at **how pairs of variables are related** and study the **complex interaction between groups of variables**. You will learn how to conduct a quantitative and visual inspection of your dataset and you will be prepared for the next step in your advanced analytics routine.

<div id="application-context"></div>

## Application Context

Data exploration is useful to
- identify usual and unusual values for a variable
- study the evolution of a variable over time
- uncover significant relationships between different variables
- assess data quality and suitability
- ...

<div id="data-requirements"></div>

## Data Requirements

The type of data that we target in this Starter Kit has the following characteristics:
- The data consists of multiple variables
- These variables are of different types, such as:
  - Numerical, i.e. values or observations that can be measured and ordered, such as age
  - Categorical, i.e. values or observations that can be sorted into groups or categories, such as gender
  - Time-based, i.e. a series of data points indexed in time order; most commonly, a time series is a sequence taken at successive equally spaced points in time
- Data is not expected to be perfect, since some values might be missing

<div id="starter-kit-outline"></div>

## Starter Kit Outline

In this Starter Kit we follow a systematic approach to explore a dataset, based on the number of variables under consideration. We start by exploring individual variables, such as the marks obtained by the students in an exam; this is the simplest form of analysis and is called univariate analysis. We continue by studying pairs of variables, such as the marks obtained by students in combination with their gender; this analysis is a little more complex than univariate analysis and is called bivariate analysis. Finally, we study the relationship between several variables, such as the marks obtained by students, their gender, their age, their address, their number of siblings, etc.; this is called multivariate analysis.

<div id="pronto-dataset"></div>

## Pronto Dataset

This exploration will be showcased on the **Pronto dataset**. It refers to the bike trips made in Seattle using the city's bike sharing system (Pronto). The dataset can be downloaded from https://www.kaggle.com/pronto/cycle-share-dataset and contains two types of data: trips and stations. In this notebook, we will mainly explore the trip data and to a lesser extent the station data.

Some of the exploratory analyses presented here were inspired by both an analysis Sirris applied to a private dataset and [the analysis presented by VanderPlas](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/).

The dataset contains details about:
- The trips
- The stations

In the trip data, the following information is reported for each trip:
- `trip_id`: numeric ID of the bike trip
- `starttime`: day and time when the trip started, in Pacific Standard Time (PST)
- `stoptime`: day and time when the trip ended, in PST
- `bikeid`: ID attached to the bike used for the trip
- `tripduration`: duration of the bike trip, in seconds
- `from_station_name`: name of the station where the trip started
- `to_station_name`: name of the station where the trip ended
- `from_station_id`: ID of the station where the trip started
- `to_station_id`: ID of the station where the trip ended
- `usertype`: type of ticket used by the biker; two values are possible:
  - `Short-Term Pass Holder` for riders who purchased a 24-Hour or a 3-Day Pass
  - `Annual Member` for riders who purchased an Annual Membership
- `gender`: gender of the rider (ternary: male, female or other)
- `birthyear`: birth year of the bike rider

An excerpt of the data is shown in the table below.

In [None]:
df_trips = get_trips(DATA_PATH)
df_trips.head(3)

Station data contains the list of stations where the bikes are hosted. For each station, the following information is reported:
- `name`: name of the street intersection where the bike station is located
- `terminal`: code which univocally identifies the station
- `lat`: latitude where the station is located
- `lon`: longitude where the station is located
- `dockcount`: number of docks available in the station
- `online`: date when the station was placed in service
- `elevation`: the elevation from the sea level

An excerpt of the data is shown in the table below.

In [None]:
df_stations = get_stations(DATA_PATH)
df_stations.head(3)

To familiarise ourselves with this data, we start by plotting on map of Seattle the location of the Pronto stations.

In [None]:
Map().add_markers(df_stations, marker=folium.Circle)

From the map above we discover that there seems to be two groups of stations. One is located in the University district (north) and the other is located downtown and in its surroundings (e.g., Capitol Hill).

In this Starter Kit we will not perform any further exploration of the station data. However, we will use the location and elevation of the different stations in the exploration of the trip data. For this reason, we merge the elevation and the latitude and longitude coordinates of the different stations with the trip data. An excerpt of the merged data that we will use in the rest of this Starter Kit is shown in the table below.

In [None]:
df_trips = get_merged_data_with_elevation(DATA_PATH)
df_trips.head(3)

<div id="data-exploration"></div>

## Data exploration

A typical data exploration process goes by the following stages:
- Exploration of the data availability, i.e. the missing data
- Data pre-processing, e.g. to infer the missing data, clean some fields
- Exploration of univariate fields, i.e. exploration of the different fields independently
- Exploration of bivariate fields, i.e. exploration of two fields together and their possible relationship
- Exploration of multivariate fields, i.e. exploration of multiple fields together

Note that these steps can be performed iteratively, e.g., the results of a multivariate analysis can lead to hypothesis that requires more data pre-processing or an univariate exploration.


<div id="data-availability"></div>

### Data Availability

We start by analysing whether the dataset contains missing values in order to verify its quality.

In [None]:
df_trips.info()

We can see that there are 142.832 entries in total. Most variables seem to be complete, except `gender` and `birthyear`, which only contain 87.348 entries. The reason is that `Short-Term Pass Holders` do not need to specify their age and gender, so this information is just missing. This is confirmed by the percentage of data present for these two categories of `usertype`, for each of these two variables:

In [None]:
df_trips.groupby('usertype')[['gender', 'birthyear']].apply(lambda x: x.count() / len(x) * 100)

<div id="data-preprocessing"></div>

### Data Preprocessing

We perform some data transformations to facilitate the subsequent exploration. First, we impute the missing values for `gender` with "Unknown". In this way, we keep them separate from the value "Other".

In [None]:
df_trips['gender'] = df_trips['gender'].fillna('Unknown')

In [None]:
df_trips['gender'].unique()

In addition, we
- convert `tripduration` (in seconds, stored as a float) to a `datetime.timedelta` object and store it as new variable `tripdurationTimeDelta`;
- add a new variable `tripdurationMinutes` indicating the duration of a trip in minutes (stored as a float);
- convert `birthyear` to an integer and create a new variable `age` (stored as an integer);
- extract from the `starttime` variable the following new variables: `month` (as a textual abbreviation), `day` (as a textual abbreviation of the day of the week), and `hour` (as an integer);
- use the `starttime` variable to index the data.

The table below presents an excerpt of the dataset after these transformations.

In [None]:
df_trips=preprocess_trips_dataset(df_trips)
df_trips.head(3)

<div id="univariate-analysis"></div>

### Univariate Analysis

Univariate analysis is the simplest form of analysis and considers a single attribute at a time. In this section, we will consider several of the variables individually, in order to illustrate the different ways of analysing single variables.

The dataset contains nearly 150.000 trips performed during the time span of 1 year.

In [None]:
get_data_stats(df=df_trips)

<div id="trip-duration"></div>

#### Trip Duration

Trip duration directly affects bike availability and the type of service that can be provided by Pronto. For this reason we start by focusing on this field. As first analysis, we visually check whether there are outliers in trip *duration*, i.e., we plot the distribution of trip durations to spot trips that are much shorter or longer than the rest. Please note that the y-axis is reported in logarithmic scale to improve the readability of the graph.

In [None]:
ax = plot_trip_duration_values(df=df_trips)

We can observe that bike usage can vary from a few minutes to 8h. The latter is probably a limitation due to the service hours of the Pronto service, i.e., the maximum amount of time that a bike can be rented during one single day. By just looking at the distribution we cannot notice any anomaly. 

Specific types of plots can be used to highlight the distribution/shape of the data, namely boxplots and violin plots.

Using boxplots is a standard way of showing the distribution of data based on a five number summary which are minimum, first quartile (Q1), median, third quartile (Q3), maximum. Boxplots can also help us to identify the outliers and their values.

Violin plots show the shape of a data set by using a Probability Density Function (PDF), or a density plot. The width of PDF describes the frequency of the values occurances in the data set. The wider regions of the density plot indicate values that occure more often and the narrower regions of the density plot indicate values that occur less often. 

In [None]:
plot_trips_duration_statistics(df=df_trips)

Since trip duration in the dataset is mostly short, we observe a compact boxplot where long trip durations are considered as extreme values. Similarly, the wide region on the left side of the violin plot shows that bikes are mostly used for short trips.

<div id="user-age"></div>

#### User Age

> PDAG: QUESTION FOR THE REVIEWER: WE ALREADY HAVE A DISCUSSION ON THE EXPLORATION OF THE DISTRIBUTION ABOVE. IS THIS SECTION NEEDED? I WOULD SAY NOT, EVEN IF THE RESULTS ARE NICE. I'M NOT SURE THAT IT BRINGS REALLY RELEVANT CONTENT AND WE ALREADY HAVE A VERY LONG NOTEBOOK.

User age should influence bike usage (e.g., due to health and working status). Consequently, it is relevant to explore the typical users' age. Similarly to the previous analysis, we analyse the age distribution to spot outliers and let emerge potentially interesting insights.

In [None]:
print(f'The youngest subscribed user is {df_trips.age.min():.0f} years old.')
print(f'The oldest subscribed user is {df_trips.age.max():.0f} years old.')

Again, let's start by plotting the age distribution.

In [None]:
plot_user_age(df=df_trips)

From this distribution we can see that the age of the majority of users is centered around 30 years. More interestingly, we can notice that the distribution of the ages is not smooth: there is an unexpected peek at 28 years.

To inspect this anomaly, we proceed with an analysis of the birthday years.

In [None]:
plot_user_birth_year(df=df_trips)

From the graph above we discover that the anomaly may depend on the declared users' birth year. The histogram indicates that bike rental seems to be quite common for users born in 1985 and 1987, but not for those born in 1986. It is unlikely that these peaks come from a 'baby-boom' in the Seattle area in these specific years. This may depend on a limited number of users (born in those years) which used intensively the service during the year (e.g., a sport club with frequent commuters, a reoccurring city trip organized by some youth association, etc.). Unfortunately, the data does not include that type of information. Still, the interested reader can further inspect this anomaly by, for example, analysing whether there are bikes that are frequently moving back and forth from the same locations in the same period of time for users that were born in 1985 and 1987.

<div id="gender"></div>

#### Gender

Besides age, it might be also interesting to see whether gender also has an influence on bike usage. Looking at this variable, it seems that men do almost 4 times more trips than women.

In [None]:
plot_trip_count_per_gender(df=df_trips)

Also in this case we should note that we do not have the identity of the users of each ride. Consequently, we cannot exclude that this bar plot is biased by a limited number of men that used the service more frequently. To that end, we can verify during the trip duration vs. gender bivariate analysis whether men performed shorter trips more often.

<div id="stations"></div>

#### Stations

The Pronto rental service relies on bikes available at stations. For this reason, it is interesting to explore how bikers used these stations. The bar plots below provide the top 10 most and least popular stations of arrival.

In [None]:
plot_station_ranks_by_popularity(df=df_trips, n_stations=10)

As we can see, there is a big difference in terms of trips toward these stations. This can lead to unbalanced stations, where extra bikes need to be continuously moved toward stations with missing bikes. This aspect will be further inspected later in this Starter Kit.

<div id="bivariate-analysis"></div>

### Bivariate Analysis

Bivariate analysis explores the relations between 2 variables. In the following section, we will analyse two variables of the dataset at a time, exploring different approaches for performing bivariate analyses.

<div id="number-of-trips-vs-hour-in-the-day"></div>

#### Number of Trips vs Hour in the Day

A first hypothesis that can be made is that there are more trips during rush hours for the weekdays and during the afternoon for the weekend. Let's try to validate this hypothesis by looking at the number of trips per hour, distinguishing between weekdays and weekends.

In [None]:
plot_trips_per_hour(df=df_trips)

From the plots above it becomes clear that there are different patterns between weekdays and weekends: 
- during weekdays there are two peaks in the number of trips, occurring during the morning and evening rush hours
- on weekends the distribution is much more uniform throughout the day (or at least throughout the period comprised between the late morning and the early evening)

<div id="number-of-trips-during-rush-hours-vs-stations-of-arrival"></div>

#### Number of Trips During Rush Hours vs Stations of Arrival

We can further inspect the rush hours during weekdays to understand whether the places of gathering share some similarities.
As a first step, we divide trips in two groups:
- morning rush hours: trips done on weekdays during 7:00 and 10:00
- evening rush hours: trips done on weekdays during 16:00 and 19:00

Then, we retain for both groups the top 10 stations of arrival. 

The resulting data is shown in the bar plot below. The y-axis describes how many times a given station was reached. A dashed line at y=1400 is drawn to ease the comparison between the bar plots.

In [None]:
plot_station_of_arrivals_in_rush_hours(df=df_trips)

Comparing the bar plots above we discover that there are more stations of arrival in morning rush hours than in the evening rush hours that are above 1400 visits (see dashed line as a reference). This may indicate that bikers tend to converge in the morning rush hours toward a limited number of stations (e.g., close to working places), whereas in the evening rush hours bikers go in different directions (e.g., in different residential areas). This hypothesis can also justify why stations that are quite popular in the morning rush hours (e.g., Republican St & Westlake Ave N, 9th Ave N & Mercer St) do not appear in the top ten stations of arrival in the evening rush hours.


To further compare the difference between stations of arrival in the morning and evening hours, we plot them on the map of Seattle. As markers we use:
- green circles for stations that were reached in the morning rush hours
- red circles for stations that were reached in the evening rush hours

The size of each circle is proportional to the number of times that the station was reached.

In [None]:
plot_difference_between_morning_evening_arrival(df=df_trips)

From the map above we can see that:
- In the area of South Lake Union (just below the Lake Union) there are green stations but no red stations. This may mean that this location is more commonly reached by working bikers.
- Green stations seem to be more 'condensed' in downtown than red stations. This may be due to the fact that red stations are reached after working hours. Part of the bikers may decide from 16:00 to 19:00 to go back home (in more residential areas).
- None of the top 10 stations is located in the University district.

<div id="trip-duration-vs-age"></div>

#### Trip Duration vs Age

We are interested in verifying whether trip duration is influenced by the age of a biker. For this purpose, we visualise via a scatterplot the joint distribution of these two variables and compute their Pearson correlation.

In [None]:
plot_trip_duration_age_joint_distribution(df_trips)

As we can see, there is a broad age dispersion over a relatively narrow distribution of trip durations. Longer-lasting trips occur for many ages. Not surprisingly, the Pearson correlation is close to zero, namely the age does not seem to play an important role in the duration of trips.

Note that driving speed might also play a role, but as we only know where each trip started and ended, we cannot estimate the speed (since we have only the air distance between the stations, not the actual length of the route each person took), and hence no further analysis can be done.

<div id="trip-duration-vs-gender"></div>

#### Trip Duration vs Gender

We can check whether the gender influences the trip duration.

In [None]:
plot_gender_influence_on_trip_duration(df_trips)

As we can see, women bike longer (12,14 min) on average than men (9,55 min). The fact that mean duration of the `Unknown` group is much longer (36 min) suggests that short-term users do not use the bikes for commuting, but rather for leisure purposes.

<div id="age-vs-gender"></div>

#### Age vs Gender

In the univariate analysis, we highlighted a peak of 28 year old users. In the present analysis, we can further investigate it by seeing whether or not it is consistent across genders. The graph below reports the age distribution per gender.

In [None]:
plot_age_distribution_per_gender(df_trips)

We can see that the age distribution is relatively similar between men and women, with a more important usage around 30 years compared to younger and older ages.

The peak at 28 years only clearly stands out for men (and `Other`), and remains difficult to interpret.

<div id="user-type-vs-day-of-the-week"></div>

#### User Type vs Day of the Week

Bike users can be distinguished based on the fact that they have an Annual Membership or a Short-Term Pass. Let's analyse their trips over the days of the week.

In [None]:
plot_trips_over_days(df_trips)

We can notice that Short-Term Pass users are more frequently riding a bike on weekends and much less on weekdays, as opposed to annual members. This suggests that Pronto's choice of offering two types of subscriptions meets the needs of two types of users. The hypothesis is that leisure bikers may often subscribe to a Short-Term Pass whereas bikers with an annual membership are probably commuters that need a bike for commuting to work.

<div id="gender-vs-day-of-the-week"></div>

#### Gender vs Day of the Week

From the univariate analysis we know that bike usage is more frequent for men than for women. Here we explore how  this usage is over the days of the week.

In [None]:
plot_gender_over_days(df_trips)

In the plot we can see that there is no obvious difference in the gender ratio. The most noteworthy difference is the increase of the `Unknown` gender category over the weekend, which, as we already saw, can be explained by the fact that this group represents Short-Term Pass holders, which travel mostly on weekends.

<div id="gender-vs-day-of-the-week"></div>

#### Stations vs Gender

It is possible that there is a relationship between a biker's gender and the location where the biker is going. In this section we analyse whether there exist stations of arrival that are more popular for one gender than for the others. To this end, we split the trips according to gender. For each group, we compute how frequently each station was reached (station of arrival). Finally, for each station, we compute the ratio between the frequency of arrival for men and women. More specifically, we define:
- `ratio_f2m` the ratio between the visit frequency for women w.r.t. men. 
- `ratio_m2f` the ratio between the visit frequency for men w.r.t. women (i.e., the inverse of `ratio_f2m`)

In [None]:
plot_stations_popularity_per_gender(df_trips)

The bar plot above reports the top 5 and bottom 5 stations with the highest difference in ratio. If we focus our attention on the dashed line we can see that the top 5 stations (on the left) are at least twice as popular among women than among men. This popularity is quite striking especially for the first two stations (12th Ave & E Yesler Way, and Fred Hutchinson Cancer Research Center / Fairview Ave N & Ward St). Similar considerations are valid for the bottom 5 stations (on the right) that are more popular among men than among women.

The existence of such gender-related bias in the reached destinations is interesting to explore since it allows to make further assumptions on how city services or workplaces are organized in Seattle. These aspects will be  investigated later during the multivariate analysis.

<div id="arrival-vs-departure-unbalanced-stations"></div>

#### Arrival vs. Departure: Unbalanced Stations

From the univariate analysis we know that some stations are more popular than others in terms of arrivals. In this section we further inspect this aspect by checking for each station how many bikes arrive and depart. This analysis is important to identify **unbalanced stations**, namely stations that Pronto needs to take care of when redistributing bikes among stations.

In the plot below we plot for each station how many times it was the origin (x-axis) and the destination (y-axis) of a trip. We also include a dashed-line to easily identify hub stations where the number of arrivals and departures is similar.

In [None]:
plot_trips_per_station(df_trips)

We can observe the existence of unbalanced stations, namely stations that are depicted far away from the diagonal. A station can be unbalanced because many users pick up bikes from there, but few drop them off, or vice versa.

From the graph above it is hard to extract further information on these stations. For this reason we show on top of the map of Seattle where these stations are located. To spot insights related to (**_unbalanced_**) stations, we draw stations using different colors and sizes:
- green circle: unbalanced station with more arrivals than departures; the bigger its size, the higher the unbalance 
- red circle: unbalanced station with more departures than arrivals; the bigger its size, the higher the unbalance

To ease the visualisation, we only plot the top 10 (**_unbalanced_**) stations.

In [None]:
plot_unbalanced_stations(df=df_trips, n_stations=10)

As we can strikingly see on the map:
- the unbalanced stations with highest amount of departures (red circles) are all located on Capitol Hill
- the unbalanced stations with highest amount of arrivals (green circles) are all located downtown

This last finding opens another possible explanation to justify why there are many stations of arrival downtown: bikers may be more prone to use bikes to move downhill. To explore this hypothesis, we plot below the previous unbalanced station using for the size of the circle the elevation from the sea level.

In [None]:
plot_unbalanced_stations_with_elevation(df=df_trips)

Comparing this map to the one before, we can confirm that:
-  unbalanced stations with more departures than arrivals are the ones located uphill
-  unbalanced stations with more arrivals than departures are the ones located downhill

This observation provides useful insights for the Pronto system. Indeed, in case of the opening of new stations, we can already foresee whether the station would be unbalanced or not and take this into account for handling the logistics.

<div id="multivariate-analysis"></div>

### Multivariate Analysis

We can extend the previous analyses by analyzing multiple variables simultaneously in order to explore the dataset deeper and obtain more insights.

<div id="number-of-trips-and-trip-duration-vs-day-of-the-week-and-month-of-the-year"></div>

#### Number of Trips and Trip Duration vs Day of the Week and Month of the Year

As a starting point, let's inspect how bikes are used through the different days of the week and months of the year by drawing a bar plot where the bars represent the number of trips performed on each different day/month (see scale on the left side). On top of the bar plot, we plot a line that depicts the median duration of trips (see scale on the right side).

In [None]:
plot_trip_duration_over_time(df_trips, time='both')

The bar plot shows that there are slightly less trips in the weekend, but that they tend to last longer than during weekdays. Across weekdays, the number of trips is similar.

We draw the hypothesis that there are two types of users. On the one hand, we have leisure bikers who use the bike recreationally, essentially during the weekend. On the other hand, we have weekday users which use the bike for commuting to work.

It also appears that users tend to use bikes more during warmer days, i.e., summer, than during the winter. Probably as the recreational users are only biking during warmer days.

<div id="trip-durations-for-unbalanced-stations"></div>

####  Trip Durations for Unbalanced Stations

In a previous analysis, we validated that the elevation plays a major role in explaining why some stations are unbalanced. In other terms, the elevation embodies the 'cost' that a user has to face when moving between stations that are at different levels. This aspect, i.e., the cost of reaching nodes within a network, is a general problem that is faced in network analysis for identifying the best routing. For this reason, in this section, we repeat the previous analysis by focusing on the 'cost' of reaching stations without the need of using the elevation. By making this problem more generic, the proposed analysis can be applied also to other networks where there are 'costs' for traveling between nodes which influence routing.

We redefine hence the previous problem setting as follows:
- each bike station represents a node of a network;
- each node is connected through an edge to all other nodes of the network;
- edges are bidirectional and each direction has associated a cost.

In our case, the cost is represented by the time required for traveling between two stations. We assume that given the same distance, the time required for moving between two stations is determined by the difference in elevations, i.e., biking uphill requires more time than biking downhill.

As a first step to perform this analysis, we compute the trip duration among (unbalanced) stations. An excerpt of these trip durations is provided in the table below.

In [None]:
df_trip_duration=get_unbalanced_trip_duration(df=df_trips)
df_trip_duration.sort_values('count', ascending=False).iloc[1:3, :]

The table above shows the cost (i.e., the time it takes) to reach node Cal Anderson Park / 11th Ave & Pine St from node E Harrison St & Broadway Ave E, and to reach the latter station from the former. We see that the time difference is 0,6 min. Namely, it is (slightly) faster to go in one direction than in the opposite direction.

From the (full version of the) table above, we plot a matrix depicting the difference of trip durations between each couple of (unbalanced) stations. The color scale indicates how much positive or negative the difference in the median trip duration is. To facilitate the visualization of the stations located in Capitol Hill (high elevation), these are reported in bold.

In [None]:
plot_trip_duration_difference_between_stations(df=df_trips)

We can observe that the connection from 12th Ave & E Mercer St to 2nd Ave & Spring St has the largest negative difference (most intense blue color). In addition, we can see that the former station is located in Capitol Hill (its name is marked in bold), while the latter is located downtown (no bold). Similarly, the connection between Cal Anderson Park / 11th Ave & Pine St, for example, which is also located in Capitol Hill, to Republican St & Westlake Ave N, located downtown, also shows a large negative difference.

In fact, all stations located in Capitol Hill show a significant negative difference w.r.t. the stations located downtown (the 5 leftmost columns). This is consistent with the previous observation that the first group of stations are located at a higher elevation than the second group.

Between the stations located downtown (5 topmost rows and 5 leftmost columns) we can observe light (positive and negative) differences, which might be due to the particular relief characteristics and traffic organization (e.g., one way roads) of Seattle's downtown.

Finally, between the stations located in Capitol Hill (5 bottommost rows and 5 rightmost columns), we observe a nearly null or significant positive difference, again likely revealing particular relief and traffic organization characteristics of Capitol Hill. An exception is the large positive difference observed between 12th Ave & E Mercer St and 15th Ave E & E Thomas St, which cannot be merely explained by such Capitol Hill's characteristics. Indeed, these two stations are close by and share similar elevations, yet they present a big difference in terms of trip duration when bikers move back and forth. To explain this result we have to keep in mind that the path chosen by bikers might not necessarily be the shortest one, as they might just decide to undertake longer paths for shopping or sightseeing.

<div id="influence-of-gender-on-stations-of-arrival-during-the-week"></div>

#### Influence of Gender on Stations of Arrival During the Week

During the bivariate analysis we discovered that there is a relationship between gender and the stations of arrival. In this analysis we want to further explore how this relationship evolves during the week.

As a first step, we take the top ten stations of arrival on weekdays during the morning rush hour. We focus on this time window since we assume that it is used by bikers to go to work. For these stations, we check whether they are more popular among men or women.

In [None]:
plot_stations_of_arrival_on_weekdays_during_morning_rush_hour(df=df_trips)

We can see that Pier 69 / Alaskan Way & Clay St is twice more popular for women than for men, whereas 9th Ave N & Mercer St is twice more popular for men than women. For all other stations, the difference is smaller, i.e., the popularity of a station of arrival in the morning rush hours is similar.

If we compare this bar plot with the previous one in section [Stations vs Gender](#Stations-vs-Gender) we realize that the difference in ratio is significantly lower. This can signify that in weekdays, during rush hours, both men and women are going toward the same gathering places (i.e., where they work).

To inspect this hypothesis, we now focus on stations of arrival during the weekend. The assumption is that during weekends bikers may go to different places than their work location (following their personal interests and hobbies, for example). For this reason, we take all stations of arrival that are reached on weekends. For these stations, we check whether they are more popular among men or women.

In [None]:
plot_arrival_station_frequency_weekend(df=df_trips, gender='both')

We discover that during the weekend men and women have different favourite stations of arrival. Moreover, if we compare the last three bar plots, we discover that the ratio men vs women of the top stations of arrival changes during the week. Indeed, we can see that on weekends there are in total seven stations where the ratio women to men or the ratio men to women is above 2. During weekdays this ratio was above 2 only for two stations. This seems to imply that, in absence of constraints (location of the working place), gender plays a role on where bikers go.

We complete this analysis by plotting on the map the previously identified stations. As markers we use:
- blue circles for the top 10 stations which are reached in the morning rush hours during weekdays
- green circles for the top 10 stations which are reached by women during weekends
- red circles for the top 10 stations which are reached by men during weekends

(Note that red circles appear in a light red tone, close to orange; superposition of blue and red circles results in a dark red color; and superposition of green and blue circles results in a dark green color.)

In [None]:
plot_stations(df=df_trips, n_stations=10)

The map above further confirms that:
- The popular arrival stations during weekdays do not match with the popular stations reached by men and women during weekends (as the previously top-10 listings indicated)
- There are popular locations of arrival (during weekends) in the University district. These stations do not appear in the top 10 of reached stations during the morning rush hour during weekdays.

In order to further explain the differences observed during weekends, a more in-depth knowledge about the city and the shared habits of its inhabitants would be required, which falls outside the scope of this Starter Kit.

However, it might still be interesting to further investigate how the 10+ stations in the University district are used during the week, as the larger group of stations in downtown Seattle and its surroundings seems to dominate over the rest. This is left as an exercise for the reader.

<div id="conclusion"></div>

## Conclusion

The data exploration process is by definition heavily influenced by the type of data and the goal of the data exploration. Hence, it is complex or even impossible to provide clear and detailed instructions on how to always perform this process. There are no steps that always need to be achieved in always the same order.

However, the intuition is always the same. You start from the simple questions to the more complex ones. The data is first analyzed one feature at a time (with univariate analysis) to gain a first understanding of the data and the domain. Then more complex (bi or multi-variate) analysis can be done. Over the course of the exploration, the complexity of the analysis and the hypothesis built should then increase. Nevertheless, this approach is not linear but rather iterative with often multi-variate analysis that lead to e.g. other pre-processing or univariate analysis.   

In this Starter Kit we have shown how to perform data exploration on a concrete industrial use case. We have demonstrated how appropriate statistical and visualisation techniques can help a data scientist in discovering interesting patterns and exploring insights without applying any complex algorithm. This example can then be used as a typical exploration that can inspire exploration of other datasets.

<div id="#additional-information"></div>

## Additional Information

Copyright © 2022 Sirris

This Starter Kit was developed in the context of the EluciDATA project (http://www.elucidata.be). For more information, please contact info@elucidata.be.

 
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Notebook"), to deal in the Notebook without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Notebook, and to permit persons to whom the Notebook is provided to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies of the Notebook and/or copies of substantial portions of the Notebook.

THE NOTEBOOK IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL SIRRIS, THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, DIRECT OR INDIRECT, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE NOTEBOOK OR THE USE OR OTHER DEALINGS IN THE NOTEBOOK.