# Project: Communicate Data Findings
> Exploring Ford-GoBike Data

<a id='sources'></a>
## Data Sources

>1. **Name:** result.csv
><ul>   
>    <li><b>Definition:</b> Ford GoBike System - Data</li>
>    <li><b>Source:</b> <a href ="https://www.fordgobike.com/system-data">https://www.fordgobike.com/system-data</a></li>    
>    <li><b>Version:</b>Files from 01.2018 - 02.2019</li>
></ul>


#### Import of the needed libraries:

In [1]:
#Import important libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import zipfile
import requests
import geopy.distance
from sklearn.cluster import KMeans

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

ModuleNotFoundError: No module named 'seaborn'

<a id='analysis'></a>
## Explanatory Data Visualization

In [None]:
df = pd.read_csv("result_clean.csv")
df_station_names = pd.read_csv("df_station_names.csv")
#wrong datatypes again, maybe change the datatype to HDF5
for col in ["start_time", "end_time"]:
    df[col] = pd.to_datetime(df[col])
    
for col in ["member_birth_year"]:
    df[col] = df[col].astype("int")

for col in ["start_station_id", "end_station_id", "member_birth_year", "bike_id"]:
    df[col] = df[col].astype("str")
    
kmeans = KMeans(n_clusters=3).fit(df_station_names[["station_longitude", "station_latitude"]])
df_station_names["label"] = kmeans.labels_
mapping = {0: "San Francisco", 1: "San José", 2: "East Bay"}
df_station_names["label_name"] = df_station_names["label"].map(mapping)
df_station_names.drop_duplicates(subset = ["new_id"], inplace = True)
df = df.merge(df_station_names[["new_id", "label"]], left_on = "start_station_id_new", right_on = "new_id", how = "outer")
df["label_name"] = df["label"].map(mapping)
df["age"] = df["member_birth_year"].apply(lambda x: 2018 - int(x))
df['month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('M')
df['day_month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('D')
df["dayofweek"] = df["start_time"].apply(lambda x: x.dayofweek)
df["start_hr"] = df["start_time"].apply(lambda x: x.hour)
df["end_hr"] = df["end_time"].apply(lambda x: x.hour)
df_age = df.query("age != 2018 and age < 100").copy()
bins = [x for x in range(10,101, 10)]
df_age["age_bins"] = pd.cut(df_age.age, bins=bins,precision=0,include_lowest=False)

**Source:** <a href = https://kepler.gl/>kepler.gl</a>
> This is on what we will take a closer look on - San Francisco, East Bay and San José:

![All Stations](Images/stations_kepler.png)


> San Francisco and East Bay

![Upper Two Cluster](Images/stations_1.png)


> San José

![Lower Cluster](Images/stations_2.png)

### Who are the people that are using this service? Let's find out - at first we will look on the average trip duration.

In [None]:
#https://stackoverflow.com/questions/42741687/python-histogram-outline
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
n = 1
for i, x in enumerate(["San Francisco", "East Bay", "San José"]):
    df_new = df.query(f"label_name == '{x}'")

    bin_size = 100
    bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)   

    plt.hist(df_new.duration_sec, bins = bins, label = x, color = sns.color_palette("viridis")[n], edgecolor = "black", lw = 0.4);
    n += 2
    
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,3500);
plt.title("Frequency of trip durations per area in seconds")
plt.xlabel("Seconds")
plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)

### Looking at these, trends are looking similar to each other, although it seems like trips in East Bay are usually a little bit shorter in duration. What about gender?

In [None]:
fig, ax = plt.subplots(figsize =(12,5), dpi =110)
sns.countplot(x = "label_name", data = df,  order=df.label_name.value_counts().index,palette="viridis",
              hue = "member_gender", edgecolor = "black", lw = 0.5);
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)

n_1 = ax.patches[0].get_height() + ax.patches[3].get_height() + ax.patches[6].get_height() + ax.patches[9].get_height()
n_2 = ax.patches[1].get_height() + ax.patches[4].get_height() + ax.patches[7].get_height() + ax.patches[10].get_height()
n_3 = ax.patches[2].get_height() + ax.patches[5].get_height() + ax.patches[8].get_height() + ax.patches[11].get_height()

for i, p in enumerate(ax.patches):
    if i in [0,3,6,9]:
        n = n_1
        
    elif i in [1,4,7,10]:
        n = n_2
    
    else:
        n = n_3
    
    ax.annotate('{:10.0f}%'.format(p.get_height()/n*100), (p.get_x()-0.05, p.get_height()+45000))

plt.title("Relative Userfrequency by gender and area ");
plt.xlabel("");

### Looking at the plot and the relative frequencies, the male percentage is > 60% for all three areas. But are their users between them who are willing to subscribe to this service or are the most of them 'normal' cutsomers?

In [None]:
value_ct = df.user_type.value_counts().iloc[:31]
fig, ax = plt.subplots(figsize = (12,5), dpi = 110)
sns.countplot(x = "user_type", data = df, order=value_ct.index, palette = "viridis", lw = 0.5, edgecolor = "black");
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
    ax.annotate('{:10.0f}%'.format(p.get_height()/(1906966+320033)*100), (p.get_x()+0.31, p.get_height()+40000))
plt.title("Users By Type");
plt.xlabel("");

### And how is the age of the users distributed?

In [None]:
fig, ax = plt.subplots(figsize = (15,4), dpi = 110)
color = sns.color_palette("viridis")[2]
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 0").sort_values("age"), color = sns.color_palette("viridis")[1], label = "San Francisco", lw = 0.5, edgecolor = "black");
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 2").sort_values("age"), color = sns.color_palette("viridis")[3], label = "East Bay", lw = 0.5, edgecolor = "black");
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 1").sort_values("age"), color = sns.color_palette("viridis")[5], label = "San José", lw = 0.5, edgecolor = "black");
plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
plt.title("Age distribution per area")
plt.legend();

### The East Bay age structure is broader than the one of San Francisco and San José has the youngest average group of users. The next plots will focus on time components of our data. What about trips per day?

In [None]:
fig, ax = plt.subplots(figsize = (12,4), dpi = 110)
sns.countplot(x = "dayofweek", data = df, palette = "viridis", lw = 0.5, edgecolor = "black");

plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
plt.title("Relative frequency of trips per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);
plt.xlabel("")
plt.ylim(0,500000)
for p in ax.patches:
    ax.annotate('{:10.0f}%'.format(p.get_height()/len(df)*100), (p.get_x()+0.1, p.get_height()+20000))


### It looks like the users use the bikes more frequently during the week than during the weekend. And when do they start their trips?

In [None]:
fig, ax = plt.subplots(figsize = (11,5), dpi = 110)

sns.countplot(x = "start_hr", data = df, palette = "viridis", ax = ax, lw = 0.5, edgecolor = "black");

plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
plt.title("Relative frequency of trips per starting hour")
plt.xlabel("Starting hour")
plt.ylim(0,400000)
for p in ax.patches:
    ax.annotate('{:10.1f}%'.format(p.get_height()/len(df)*100), (p.get_x()-0.8,p.get_height()+15000))

ax.text(0-1.15, ax.patches[0].get_height()+13000, '{:10.1f}%'.format(ax.patches[0].get_height()/len(df)*100));


### The most frequent starting hours are at 8 and at 17. Maybe people use it before and after work, which would make sense, because we have a lot of subscribers in working age in our dataset. You only subscribe to something, if you want to use it regulary. The integration into the working/study life would make sense here! 

### Now we will see if the average duration is dependent on the weekday.

In [None]:
#creating the legend object for the next plot
legend_obj = []
colors = [sns.color_palette("viridis")[0],
          sns.color_palette("viridis")[1],
          sns.color_palette("viridis")[2],
          sns.color_palette("viridis")[3],
          sns.color_palette("viridis")[4],
          sns.color_palette("viridis")[5],
          (163/255, 199/255, 70/255)]
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for i, s in enumerate(days):
    legend_obj.append(plt.scatter([],[],color = colors[i]));

In [None]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot

fig, ax = plt.subplots(figsize = (12,7), dpi = 110)
sns.boxplot(x = "dayofweek", y = "duration_sec", data = df.groupby(["dayofweek", "month_year"], as_index = False).mean(), palette = "viridis")

plt.tight_layout()
#cur_axes = plt.gca()
#cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
plt.xlabel("")
plt.ylabel("Duration in seconds")
plt.title("Average trip duration per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);

### This means, that the frequency of bike usage at the weekend is lower, but the average duration of each trip is greater than during the week!

In [None]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (12,7), dpi = 110)
sns.boxplot(x = "dayofweek", y = "duration_sec", data = df.groupby(["dayofweek", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
plt.tight_layout()
#cur_axes = plt.gca()
#cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left =True)
plt.xlabel("")
plt.ylabel("Duration in seconds")
plt.title("Average trip duration per day per area")
ax.set(xticklabels=["Mon", "Tue", "Wed","Thu", "Fri","Sat", "Sun"]);
box = ax.get_position()
ax.set_position([box.x0, box.y0,box.width * 0.8,box.height])

### This trend applies for all areas, while we can also see that the users of San Francisco have, on average, the longest duration of trips, followed by East Bay and then San José. And what is the average starting hour per day?

In [None]:
fig, ax = plt.subplots(figsize = (12,7), dpi = 110)
sns.boxplot(x = "dayofweek", y = "start_hr", data = df.groupby(["dayofweek","month_year"], as_index =False).mean(), palette = "viridis")
plt.tight_layout()
#cur_axes = plt.gca()
#cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
plt.xlabel("")
plt.ylabel("Starting hour")
plt.title("Average starting hour per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);

In [None]:
fig, ax = plt.subplots(figsize = (12,7), dpi = 110)
sns.boxplot(x = "dayofweek", y = "start_hr", data = df.groupby(["dayofweek", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")

plt.tight_layout()
#cur_axes = plt.gca()
#cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
plt.xlabel("")
plt.ylabel("Starting hour")
plt.title("Average starting hour per day per area")

ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);


### Looking at each area is interesting, because Users from East Bay and San José are not only have shorter trip durations on average, but also they start their trips later than San Francisco on average.

### For the final visualizations, let's visualize the trips.

### At first we will look at San Francisco. To get some insight, the visualization will only contain routes with more than 1000 trips:

![San Francisco Trips with more than 1000 trips](Images/san_francisco_1000.png)

### We can see that most of the trips are close to the beach. Now for East Bay with routes with more than 500 trips:

![East Bay Trips with more than 500 trips](Images/east_bay_500.png)

### Here the main routes are much more spread than in San Francisco. Also it looks like people use this service to quickly overcome smaller distances. For San José we will take a look on routes that have more than 200 trips.

![San Jose Trips with more than 200 trips](Images/san_jose_200.png)

### For San José it looks spread over most of the stations.