# Exercise: EDA

This exercise is a continuation of the exploratory data analysis of the Citibike Trip Histories dataset. The first section shows code to prepare the data, followed by a section of the initial analysis. Your task is complete the EDA of the Citibike dataset. The instructions are stated in the last section of this notebook.

**IMPORTANT:** Copy this notebook and make changes in that copy. Do not push changes to this notebook.

In [1]:
import os
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore

%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Data - CitiBike Trip Histories

CitiBike provides the data of the bike share through this website: https://www.citibikenyc.com/system-data

For this exercise, we'll be using their trip history data which may be found [here](https://s3.amazonaws.com/tripdata/index.html). 

In [2]:
data = pd.read_csv('./202102-citibike-tripdata.csv')
data.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,304,2021-02-01 00:04:23.0780,2021-02-01 00:09:27.7920,3175,W 70 St & Amsterdam Ave,40.77748,-73.982886,4045,West End Ave & W 60 St,40.77237,-73.99005,27451,Subscriber,1996,2
1,370,2021-02-01 00:07:08.8080,2021-02-01 00:13:19.4670,3154,E 77 St & 3 Ave,40.773142,-73.958562,3725,2 Ave & E 72 St,40.768762,-73.958408,35000,Subscriber,1991,1
2,635,2021-02-01 00:07:55.9390,2021-02-01 00:18:31.0390,502,Henry St & Grand St,40.714211,-73.981095,411,E 6 St & Avenue D,40.722281,-73.976687,49319,Subscriber,1980,2
3,758,2021-02-01 00:08:42.0960,2021-02-01 00:21:20.7820,3136,5 Ave & E 63 St,40.766368,-73.971518,3284,E 88 St & Park Ave,40.781411,-73.955959,48091,Customer,1969,0
4,522,2021-02-01 00:09:32.6820,2021-02-01 00:18:15.4100,505,6 Ave & W 33 St,40.749013,-73.988484,3687,E 33 St & 1 Ave,40.743227,-73.974498,48596,Subscriber,1988,1


## Feature Extraction

In [3]:
data['starttime'] = pd.to_datetime(data['starttime'])
data['stoptime'] = pd.to_datetime(data['stoptime'])

data['dayofweek'] = data['starttime'].dt.dayofweek
data['hourofday'] = data['starttime'].dt.hour
data['year'] = data['starttime'].dt.year

## Feature Transformation

In [4]:
data['duration_min'] = data['tripduration']/60

## Feature Generation

In [5]:
data['age'] = data['starttime'].dt.year - data['birth year']
data.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,dayofweek,hourofday,year,duration_min,age
0,304,2021-02-01 00:04:23.078,2021-02-01 00:09:27.792,3175,W 70 St & Amsterdam Ave,40.77748,-73.982886,4045,West End Ave & W 60 St,40.77237,-73.99005,27451,Subscriber,1996,2,0,0,2021,5.066667,25
1,370,2021-02-01 00:07:08.808,2021-02-01 00:13:19.467,3154,E 77 St & 3 Ave,40.773142,-73.958562,3725,2 Ave & E 72 St,40.768762,-73.958408,35000,Subscriber,1991,1,0,0,2021,6.166667,30
2,635,2021-02-01 00:07:55.939,2021-02-01 00:18:31.039,502,Henry St & Grand St,40.714211,-73.981095,411,E 6 St & Avenue D,40.722281,-73.976687,49319,Subscriber,1980,2,0,0,2021,10.583333,41
3,758,2021-02-01 00:08:42.096,2021-02-01 00:21:20.782,3136,5 Ave & E 63 St,40.766368,-73.971518,3284,E 88 St & Park Ave,40.781411,-73.955959,48091,Customer,1969,0,0,0,2021,12.633333,52
4,522,2021-02-01 00:09:32.682,2021-02-01 00:18:15.410,505,6 Ave & W 33 St,40.749013,-73.988484,3687,E 33 St & 1 Ave,40.743227,-73.974498,48596,Subscriber,1988,1,0,0,2021,8.7,33


#### Distance

Another feature we can generate from the data is distance. Although the provided values are in longitude and latitudes and they're measured in degrees, the distance calculated from these points would also be in degrees (and not meters). 

There's actually a library that specifically handles geospatial data called `geopy` ([Link](https://geopy.readthedocs.io/en/stable/#module-geopy.distance)). For simplicity sake in this tutorial, we use an existing function that calculates the geodesic distance using the Haversine formula given the starting and ending longitude and latitudes: `calculate_distance(lat1, lon1, lat2, lon2)`

Credits to [Wayne Dyck](https://gist.github.com/rochacbruno/2883505) for the function.

In [6]:
def calculate_distance(lat1, lon1, lat2, lon2):
    """
    Calculates the distance provided a pair of longitudes and latitudes
    using the Haversine formula
    
    Returns the distance in kilometers.
    """
    radius = 6371 # km

    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d

In [0]:
data['distance_km'] = data.apply(lambda x: calculate_distance(x['start station latitude'], x['start station longitude'],
                                        x['end station latitude'], x['end station longitude']), axis=1)

# Exploratory Data Analysis

We will do the following:
1. Examine the size and structure of the data
2. Examine each field individually
3. Examine relationships/correlations
4. Identify anomalies/outliers

## 1. Size and structure of the data

In [0]:
data.shape

After feater transformations, the dataset now has 634,631 observations with 21 variables.

In [0]:
data.info()

The dataset has a variety of datatypes: integer and float values, date and time, and strings. There are no variables with null/missing values.

In [0]:
# Get descriptive statistics of quantitative variables
data.describe()

Although these variables are all quantitative, it doesn't really make sense to look at the statistics for unique IDs and spatial data like `start station id`, `start station latitude`, `start station longitude`, and `bikeid`, among others. Here, we will just focus on the values for `birth year`, `duration_min`, `age`, and `distance_km`.

Few insights from the `data.describe()` function:

1. The minimum `birth year` is 1885 which has a remarkably large difference from the 25th percentile value. Can this be anomalous data?
2. There is a very large difference between the maximum and 75th percentile values of `duration_min` and `age`.
3. Observations #1 & #2 indicate possible outliers in the data. 

In [0]:
data.describe(include=np.object)

These are the descriptive stats for the categorical variables. 

## 2. Examining individual variables

Now, we can start generating simple visualizations to help us better understand the values in each variable.

#### User type

In [0]:
sns.set_theme(style="whitegrid")

usertype_plot = sns.catplot(x="usertype", kind="count", order=["Customer", "Subscriber"], data=data)

There are more one-time users of CitiBike than there are subscribers.

#### Age

In [0]:
age_plot = sns.catplot(x="age", kind="count", data=data)
age_plot.set(ylim=(0,7500)) # Limit the maximum y-axis value because of one outlier count for age 52

In [0]:
data["age"].value_counts()

#### Starting Stations

In [0]:
start_stations = data["start station name"].value_counts().rename_axis('Station name').reset_index(name='counts')
start_stations = start_stations.nlargest(10, 'counts')
start_stations

In [0]:
start_station_plot = sns.catplot(y="Station name", x="counts", orient="h", kind="bar", data=start_stations)

#### Ending stations

In [0]:
end_stations = data["end station name"].value_counts().rename_axis('Station name').reset_index(name='counts')
end_stations = end_stations.nlargest(10, 'counts')
end_stations

In [0]:
end_station_plot = sns.catplot(y="Station name", x="counts", orient="h", kind="bar", data=end_stations)

We can see from both plots that the top 10 start and end stations are consistent with each other. This indicates high taffic areas which can be potential locations for adding more bikes and bike docks. 

#### Origin-Destination Pairs

Here we create `od_trips` which contains the origin-destination pairs derived from the unique pairs of `start station name` and `end station name`.

In [0]:
## Count of rides per OD
od_trips = data.groupby(['start station name', 'end station name'], as_index=False)['bikeid'].count()
od_trips = od_trips.rename(columns={"start station name": "start", "end station name": "end", "bikeid": "total_trips"}, errors="raise")
od_trips.head()

In [0]:
od_trips["od"] = od_trips["start"] + " to " + od_trips["end"]
od_trips

In [0]:
od_rank_plot = sns.catplot(y="od", x="total_trips", orient="h", kind="bar", data=od_trips.nlargest(10, "total_trips"))

#### Origin-Destination Matrix

In this part, we focus on analyzing the number of trips between the top 10 `start` and `end` stations. It would be impossible to visually analyze for all possible pairs because the dataset is too big.

In [0]:
# Get the trips between the top 10 stations.

od_topten = od_trips[od_trips.start.isin(start_stations["Station name"]) & od_trips.end.isin(end_stations["Station name"])]
od_topten

In [0]:
# Create a matrix

od_matrix = pd.pivot_table(od_topten, index='start', columns='end', values='total_trips', aggfunc=np.sum, fill_value=0)
od_matrix

In [0]:
# Generate a heatmap

od_heatmap = sns.heatmap(od_matrix)

In [0]:
# Change color palette

od_heatmap = sns.heatmap(od_matrix, cmap="YlGnBu", annot=True, fmt="d")

#### Gender

In [0]:
gender_plot = sns.catplot(x="gender", kind="count", order=[1, 2, 0], data=data)

It seems that the dataset contains mostly zero (0) values for the gender. We can continue our analysis by removing those trips with unknown gender.

In [0]:
gender_plot = sns.catplot(x="gender", kind="count", order=[1, 2], data=data[data["gender"] > 0])

#### Day of week

In [0]:
dow_plot = sns.catplot(x="dayofweek", kind="count", data=data)

The values for `dayofweek` starts with `0` or `Monday` and ends with `6` or `Sunday`. Based on the bar plot above, most trips happen on Wednesdays, Fridays and Saturdays.

#### Hour of day

In [0]:
hod_plot = sns.catplot(x="hourofday", kind="count", data=data)

From the plot, the number of trips starts increasing from 12 noon and peaks at 5PM. 

#### Duration in minutes

In [0]:
data["duration_min"].describe()

#### Detect and remove outliers

Outliers are defined as values that is more than 3 standard deviations away from the mean. Here, we detect them by computing the z score of each value, which is relative to the mean and standard deviation.

In [0]:
z_scores = zscore(data["duration_min"]) 

# Get their absolute values for easy filtering
abs_z_scores = np.abs(z_scores)

# An array of boolean values with same length as the original dataset. 
# True if value is less than 3 standard deviations from the mean or not an outlier. Otherwise, False.
filtered_entries = (abs_z_scores < 3) 

# Array of boolean values where value is True if it is an outlier, otherwise False.
duration_min_outliers = (abs_z_scores >= 3)
data[duration_min_outliers].duration_min # Show outlier values

#### Histograms with KDE

In [0]:
duration_plot = sns.displot(data=data[filtered_entries], x="duration_min", kde=True)

In [0]:
# Zoom in a little closer. Let's limit the x-axis to only show values up to 200

duration_plot = sns.displot(data=data[filtered_entries], x="duration_min", kde=True)
duration_plot.set(xlim=(0, 200))

In [0]:
# Zoom in more. Let's limit the x-axis to only show values up to 100

duration_plot = sns.displot(data=data[filtered_entries], x="duration_min", kde=True)
duration_plot.set(xlim=(0, 100))

#### ECDF with Rug Plots

In [0]:
duration_plot = sns.displot(data=data[filtered_entries], x="duration_min", kind="ecdf", rug=True)
duration_plot.set(xlim=(0, 100))

## 3. Examine relationships/correlations

### User type and gender

In [0]:
usertype_gender_plot = sns.catplot(x="gender", 
                                   kind="count", 
                                   hue="usertype", 
                                   palette={"Customer": "g", "Subscriber": "m"}, 
                                   data=data)

- Users with 24-hour pass or 3-day passes did not have their gender information recorded.
- More male subscribers than females. 

### Trip distance, duration and user type

In [0]:
dur_dist_user_plot = sns.jointplot(data=data[filtered_entries], 
                                   x="duration_min", 
                                   y="distance_km", 
                                   hue="usertype", 
                                   palette={"Customer": "g", "Subscriber": "m"}, 
                                   alpha=0.5)
dur_dist_user_plot.set_axis_labels("Trip duration (min)", "Trip distance (km)", labelpad=10)
dur_dist_user_plot.fig.set_size_inches(10.5, 6.5)

In [0]:
z_scores = zscore(data["distance_km"]) 

# Get their absolute values for easy filtering
abs_z_scores = np.abs(z_scores)

# An array of boolean values with same length as the original dataset. 
# True if value is less than 3 standard deviations from the mean or not an outlier. Otherwise, False.
filtered_dist = (abs_z_scores < 3) 

# Array of boolean values where value is True if it is an outlier, otherwise False.
dist_outliers = (abs_z_scores >= 3)
data[dist_outliers].distance_km # Show outlier values

In [0]:
# Plot with outliers for both duration and distance
dur_dist_user_plot = sns.jointplot(data=data[filtered_entries & filtered_dist], 
                                   x="duration_min", 
                                   y="distance_km", 
                                   hue="usertype", 
                                   palette={"Customer": "g", "Subscriber": "m"}, 
                                   alpha=0.5)
dur_dist_user_plot.set_axis_labels("Trip duration (min)", "Trip distance (km)", labelpad=10)
dur_dist_user_plot.fig.set_size_inches(10.5, 6.5)

### Get correlation between quantitative variables

In [0]:
data_to_corr = data[["gender", "duration_min", "age", "distance_km"]]
data_to_corr

In [0]:
# Get z scores of the following variables
z_scores = zscore(data_to_corr[["duration_min", "age", "distance_km"]]) 

# Get their absolute values for easy filtering
abs_z_scores = np.abs(z_scores)

# An array of boolean values with same length as the original dataset. 
# True if value is less than 3 standard deviations from the mean or not an outlier. Otherwise, False.
filtered_rows = (abs_z_scores < 3).all(axis=1) 

# Array of boolean values where value is True if it is an outlier, otherwise False.
data_to_corr_outliers = (abs_z_scores >= 3)
data_to_corr[data_to_corr_outliers] # Show outlier values

In [0]:
# Remove outliers
data_to_corr = data_to_corr[filtered_rows]

#### Correlation Heatmap

In [0]:
data_corr_heatmap = sns.heatmap(data_to_corr.corr(), 
                                center=0, 
                                cmap="YlGnBu", 
                                annot=True, 
                                vmin=-1, 
                                vmax=1)

In [0]:
mask = np.triu(np.ones_like(data_to_corr.corr(), dtype=bool))

data_corr_heatmap = sns.heatmap(data_to_corr.corr(), 
                                center=0, 
                                cmap="YlGnBu", 
                                annot=True, 
                                mask=mask,
                                vmin=-1, 
                                vmax=1)

Variables `distance_km` and `duration_min` showed some positive correlation. We can still include them as features in a modeling task later.

----
----

# Exercise Proper: Continuation of EDA

We are already done in examining the individual characteristics of each variable in the dataset. You're task is to continue examining the remaining relationships (bivariate/multivariate) between variables. Here are some that you can prioritize:

- Bivariate: Trip duration & time of day
- Bivariate: User type & age
- Bivariate: Start station & user type
- Bivariate: End station & user type
- Bivariate: End station & gender
- Multi: Start station, end station & trip duration
- Multi: Start station, end station & unique users
- Multi: Start station, end station & user type
- Multi: Trip distance, duration & gender
- Multi: Average trip distance per unique user, average duration per unique user & user type
- Multi: Average trip distance per unique user, average duration per unique user, gender & user type

After creating simple visualizations for each, write down your observations in a separate cell. You do not have to interpret yet why those relationships appear. We're still doing EDA. Relax ;)

## Bivariate: Trip duration & time of day

Create a df with only `duration_min` and `hourofday`.

In [0]:
dur_hour_df = data[["duration_min", "hourofday"]]

In [0]:
dur_hour_sum_df = dur_hour_df[filtered_entries].groupby("hourofday").sum()
dur_hour_sum_df

In [0]:
dur_hour_mean_df = dur_hour_df[filtered_entries].groupby("hourofday").mean()
dur_hour_mean_df

In [0]:
dist_dur_id_type_mean_df = data[filtered_entries & filtered_dist]
dist_dur_id_type_mean_df = dist_dur_id_type_mean_df[dist_dur_id_type_mean_df["distance_km"] != 0]
dist_dur_id_type_mean_df = dist_dur_id_type_mean_df[["bikeid", "usertype", "duration_min", "distance_km"]].groupby(by=["bikeid", "usertype"]).mean()
dist_dur_id_type_mean_df

In [0]:
dur_hour_mean_plot = sns.catplot(x=dur_hour_sum_df.index, y="duration_min", data=dur_hour_mean_df, kind="bar")
dur_hour_mean_plot.fig.set_size_inches(15.5, 6.5)

## Bivariate: User type & age

In [0]:
# get the top 10 age of bike users
user_age = data["age"].value_counts().rename_axis('age').reset_index(name='counts')
user_age = user_age.nlargest(10, 'counts')
user_age

In [0]:
data_user_age = data[data["age"].isin(user_age["age"])]
data_user_age

In [0]:
age_usertype_plot = sns.catplot(y="age",
                                kind="count",
                                hue="usertype",
                                orient="h",
                                data=data_user_age)
plt.title("Top 10 ages in relation to usertype")

In [0]:
not_52 = data[data["age"].isin(user_age["age"])]
not_52 = not_52[not_52["age"] != 52]
not_52["age"].unique()

In [0]:
age_usertype_plot = sns.catplot(y="age",
                                kind="count",
                                hue="usertype",
                                orient="h",
                                data=not_52)
plt.title("Top 10 ages in relation to usertype (without Age 52)")

#### Observations
- Most of the riders of February 2021 has an age of 52 and are mostly Customers with a total of 450k rides.
- Other than the top 1 spot (52 years old), the other top 9 ages are within the range of late 20s to early 30s and are mostly Subscribers that has a total ranging from 0 to 7000 rides

## Bivariate: Start station & user type

In [0]:
data_start_stations = data[data["start station name"].isin(start_stations["Station name"])]
data_start_stations

In [0]:
start_station_usertype_plot = sns.catplot(y="start station name",
                                          kind="count",
                                          hue="usertype",
                                          orient="h",
                                          data=data_start_stations)
plt.title("Subscriber vs. Customer count comparison in the top 10 start stations")

### Observations:
- All stations from the top 10 start stations with the most counts have more customers than subscribers
- 1 Ave & E 68 St has the most Customer users
- W 21 St & 6 Ave has the most Subscriber users
- Grand St & Elizabeth St has the least Customer users
- E 33 St & 1 Ave has the least Subscriber users

## Bivariate: End station & user type

Relationship between the 10 end stations with the most counts and the user type for each station.

In [0]:
# Get the data that has an end station value in the top 10 end stations.
data_end_stations = data[data["end station name"].isin(end_stations["Station name"])]
data_end_stations

In [0]:
# Plot the relationship
end_station_usertype_plot = sns.catplot(y="end station name",
                                        kind="count",
                                       hue="usertype",
                                        orient="h",
                                       data=data_end_stations)
plt.title("Trips Made for Each User Type in the Top 10 End Stations")

### Observations:
- Most of the users from the top 10 end stations with the most counts are only customers of the service given.
- The highest and lowest counts for each of the user types are distributed among different stations
    - 1 Ave & E 68 St has the highest customer user type count.
    - Grand St & Elizabeth St has the lowest customer user type count.
    - W 21 St & 6 Ave has the highest subscriber usertype count.
    - E 33 St & 1 Ave has the lowest subscriber usertype count.

## Bivariate: End station & gender

In [0]:
end_stations = data["end station name"].value_counts().rename_axis('Station name').reset_index(name='counts')
end_stations = end_stations.nlargest(10, 'counts')
end_stations

In [0]:
data_end_station = data[data['end station name'].isin(end_stations['Station name'])]
data_end_station

In [0]:
gender_endstation_plot = sns.catplot(y = "end station name", kind = "count", hue = "gender", data = data_end_station)
plt.title("Top 10 End Stations for Each Gender (With Unknown)")

In [0]:
#Remove entries with '0' gender
data_without_0 = data_end_station.loc[data_end_station['gender'] != 0] 

In [0]:
gender_endstation_plot = sns.catplot(y = "end station name", kind = "count", hue = "gender", data = data_without_0)
plt.title("Top 10 End Stations for Each Gender")

In [0]:
male_counter = data_without_0.loc[data_without_0['gender'] ==1]
male_counter = male_counter["end station name"].value_counts().rename_axis('Station name').reset_index(name='counts')
male_counter

In [0]:
female_counter = data_without_0.loc[data_without_0['gender'] ==2]
female_counter = female_counter["end station name"].value_counts().rename_axis('Station name').reset_index(name='counts')
female_counter

#### Observations:
- All stations from the top 10 end stations have more males than females
- All stations from the top 10 end stations have significantly more riders with no registered gender
- W 21 St & 6 Ave has the most male riders
- 1 Ave & E 68 St has the most female riders
- E 33 St & 1 Ave has the least male riders
- Pershing Square North has the least female

## Multi: Start station, end station & trip duration

The relationship between the top 10 start and end stations concerning their average duration in minutes.

In [0]:
# Get the combinations of the origin and destination stations that are not outliers.
od_trips_duration = data[filtered_entries & filtered_dist].groupby(['start station name', 'end station name'], as_index=False)['duration_min'].mean()
od_trips_duration = od_trips_duration.rename(columns={"start station name": "start", "end station name": "end", "duration_min": "avg_duration"}, errors="raise")
od_trips_duration.head()

In [0]:
# Get the combinations that are in the top 10 start and end stations
od_topten_duration = od_trips_duration[od_trips_duration.start.isin(start_stations["Station name"]) & od_trips_duration.end.isin(end_stations["Station name"])]
od_topten_duration

In [0]:
od_matrix_duration = pd.pivot_table(od_topten_duration, index='start', columns='end', values='avg_duration', aggfunc=np.sum, fill_value=0)
od_matrix_duration

In [0]:
# Generate a heatmap for the matrix
od_heatmap_duration = sns.heatmap(od_matrix_duration)
plt.title("The Average Trip Duration in Minutes of the Top 10 Start and End Stations")

### Observations:
- The trip from Broadway & W 60 St to Grand St & Elizabeth St has the highest duration in minutes.
- Trips from Broadway & W 60 St and 1 Ave & 68 St tend to have durations longer than 20 minutes.

## Multi: Start station, end station & unique users

In [0]:
od_bikeid = data.groupby(['start station name', 'end station name'], as_index=False).agg({"bikeid": "nunique"})
od_bikeid = od_bikeid.rename(columns={"start station name": "start", "end station name": "end", "bikeid": "bikeid_count"}, errors="raise")
od_bikeid.sort_values(by=['bikeid_count'])

In [0]:
od_unique_topten = od_bikeid[od_bikeid.start.isin(start_stations["Station name"]) & od_bikeid.end.isin(end_stations["Station name"])]
od_unique_matrix = pd.pivot_table(od_unique_topten, index='start', columns='end', values='bikeid_count', aggfunc=np.sum, fill_value=0)
od_unique_matrix

In [0]:
od_heatmap = sns.heatmap(od_unique_matrix, cmap="YlGnBu", annot=True, fmt="d")
plt.title("Number of unique bikeid's for the top 10 start-end station pair")

### Observations:
- The most number of unique bikeid's are on the W 33 St & 7 Ave to E 33 St & 1 Ave.
- There are multiple start-end pairs that do not have bikeid's.
- There is very minimal difference compared to the heatmap of total rides.
- Most values of unique bikeid's are the same with the total number of rides.

## Multi: Start station, end station & user type

In [0]:
od_trips = data.groupby(['start station name', 'end station name','usertype'], as_index=False)['bikeid'].count()
od_trips = od_trips.rename(columns={"start station name": "start", "end station name": "end", "bikeid": "total_trips"}, errors="raise")
od_trips.head()

In [0]:
od_trips["od"] = od_trips["start"] + " to " + od_trips["end"]
customer_trips = od_trips[od_trips["usertype"] == "Customer"]
customer_trips

In [0]:
customer_rank_plot = sns.catplot(y="od", x="total_trips", orient="h", kind="bar", data=customer_trips.nlargest(10, "total_trips"))
plt.title("Top 10 trips of Customers")

In [0]:
subscriber_trips = od_trips[od_trips["usertype"] == "Subscriber"]
subscriber_trips

In [0]:
subscriber_rank_plot = sns.catplot(y="od", x="total_trips", orient="h", kind="bar", data=subscriber_trips.nlargest(10, "total_trips"))
plt.title("Top 10 trips of Subscribers")

#### Observations
- Most of the trips consisted of Customers 
    - The trip from 1 Ave & E62 St. to 1 Ave & E 68 St has the highest trips from customers with over 140 trips
- The trip with the most Subscribers is from W 21 St & 6 Ave to 9 Ave & W 22 St having over 100 rides.
    - Customers from this trip has over 120 rides.

## Multi: Trip distance, duration & gender

In [0]:
#Add distance to data
data['distance_km'] = data.apply(lambda x: calculate_distance(x['start station latitude'], x['start station longitude'],
                                        x['end station latitude'], x['end station longitude']), axis=1)

In [0]:
data['duration_min'] = data['tripduration']/60

In [0]:
# Remove outliers
z_scores = zscore(data["duration_min"])

abs_z_scores = np.abs(z_scores)

filtered_duration = (abs_z_scores < 3)

In [0]:
# Remove outliers
z_scores = zscore(data["distance_km"]) 

abs_z_scores = np.abs(z_scores)

filtered_distance = (abs_z_scores < 3)

In [0]:
dist_dur_gender_plot = sns.jointplot(data=data[filtered_duration & filtered_distance], x = "duration_min",
                                     y = "distance_km",
                                     hue = "gender",
                                     palette={0: "green", 1: "violet", 2:"red"},
                                    alpha =0.5)
dist_dur_gender_plot.set_axis_labels("Trip duration (min)", "Trip distance (km)", labelpad=10)
dist_dur_gender_plot.fig.set_size_inches(12.5, 6.5)
dist_dur_gender_plot.fig.suptitle("Trip Duration and Distance for Each Gender (With Unknown)")

In [0]:
data_without_0 = data.loc[data['gender'] != 0]

In [0]:
# Remove outliers
z_scores = zscore(data_without_0["distance_km"])

abs_z_scores = np.abs(z_scores)

filtered_distance_without_0 = (abs_z_scores < 3)

In [0]:
# Remove outliers
z_scores = zscore(data_without_0["duration_min"])

abs_z_scores = np.abs(z_scores)

filtered_duration_without_0 = (abs_z_scores < 3)

In [0]:
dist_dur_gender_plot_without0 = sns.jointplot(data=data_without_0[filtered_duration_without_0 & filtered_distance_without_0], x = "duration_min",
                                     y = "distance_km",
                                     hue = "gender",
                                     palette={1: "green", 2:"red"},
                                     alpha =0.5)
dist_dur_gender_plot_without0.set_axis_labels("Trip duration (min)", "Trip distance (km)", labelpad=10)
dist_dur_gender_plot_without0.fig.set_size_inches(12.5, 6.5)
dist_dur_gender_plot_without0.fig.suptitle("Trip Duration and Distance for Each Gender")

#### Observations:
- Riders with no registered gender have higher trip distances
- Riders with no registered gender have longer trip durations
- Male riders have higher trip distances than female riders
- Male riders have longer trip durations than female riders

## Multi: Average trip distance per unique user, average duration per unique user & user type

There are multiple bikeids that are both customers and subscribers.

In [0]:
dist_dur_id_type_mean_df = data[filtered_entries & filtered_dist]
dist_dur_id_type_mean_df = dist_dur_id_type_mean_df[dist_dur_id_type_mean_df["distance_km"] != 0]
dist_dur_id_type_mean_df = dist_dur_id_type_mean_df[["bikeid", "usertype", "duration_min", "distance_km"]].groupby(by=["bikeid", "usertype"]).mean()
dist_dur_id_type_mean_df

In [0]:
dist_dur_id_type_mean_plot = sns.jointplot(data=dist_dur_id_type_mean_df,
                                     x = "distance_km",
                                     y = "duration_min",
                                     hue = "usertype",
                                     alpha =0.5,
                                    ylim=(0, 140))
dist_dur_id_type_mean_plot.set_axis_labels("Trip distance (km)", "Trip duration (min)", labelpad=10)
dist_dur_id_type_mean_plot.fig.set_size_inches(12.5, 6.5)
dist_dur_id_type_mean_plot.fig.suptitle("Average trip distance versus average trip duration per user per type of user")

## Multi: Average trip distance per unique user, average duration per unique user, gender & user type

### Gender 0 - Unknown

In [0]:
dist_dur_id_type_mean_df = data[filtered_entries & filtered_dist]
dist_dur_id_type_mean_df = dist_dur_id_type_mean_df[["bikeid", "usertype", "duration_min", "distance_km", "gender"]].groupby(by=["bikeid", "usertype"]).mean()
dist_dur_id_type_mean_df = dist_dur_id_type_mean_df[dist_dur_id_type_mean_df["distance_km"] != 0]

dist_dur_id_type_mean_df_0 = dist_dur_id_type_mean_df[dist_dur_id_type_mean_df["gender"] == 0]

In [0]:
dist_dur_id_type_mean_plot = sns.jointplot(data=dist_dur_id_type_mean_df_0,
                                     x = "distance_km",
                                     y = "duration_min",
                                     hue = "usertype",
                                     alpha =0.5,
                                    ylim=(0, 140))
dist_dur_id_type_mean_plot.set_axis_labels("Trip distance (km)", "Trip duration (min)", labelpad=10)
dist_dur_id_type_mean_plot.fig.set_size_inches(12.5, 6.5)
dist_dur_id_type_mean_plot.fig.suptitle("Average trip distance versus average trip duration per user for each type of user with unknown gender")

### Gender 1 - Male

In [0]:
dist_dur_id_type_mean_df_1 = dist_dur_id_type_mean_df[dist_dur_id_type_mean_df["gender"] == 1]

In [0]:
dist_dur_id_type_mean_plot_1 = sns.jointplot(data=dist_dur_id_type_mean_df_1,
                                     x = "distance_km",
                                     y = "duration_min",
                                     hue = "usertype",
                                     alpha =0.5,
                                    ylim=(0, 140))
dist_dur_id_type_mean_plot_1.set_axis_labels("Trip distance (km)", "Trip duration (min)", labelpad=10)
dist_dur_id_type_mean_plot_1.fig.set_size_inches(12.5, 6.5)
dist_dur_id_type_mean_plot.fig.suptitle("Average trip distance versus average trip duration per user for each type of user with male gender")

### Gender 2 - Female

In [0]:
dist_dur_id_type_mean_df_2 = dist_dur_id_type_mean_df[dist_dur_id_type_mean_df["gender"] == 2]

In [0]:
dist_dur_id_type_mean_plot_2 = sns.jointplot(data=dist_dur_id_type_mean_df_2,
                                     x = "distance_km",
                                     y = "duration_min",
                                     hue = "usertype",
                                     alpha =0.5,
                                    ylim=(0, 140))
dist_dur_id_type_mean_plot_2.set_axis_labels("Trip distance (km)", "Trip duration (min)", labelpad=10)
dist_dur_id_type_mean_plot_2.fig.set_size_inches(12.5, 6.5)
dist_dur_id_type_mean_plot.fig.suptitle("Average trip distance versus average trip duration per user for each type of user with female gender")