## **Problem Statement**
- 🔍 analyze the users' bike usage pattern wrt day and week, weather, and ride-distances to build a visualization.
- 🔮 suggest the approaches to build the prediction model to optimize bikes allocation based on the visualization above.

In [None]:
# Imports
import pandas as pd
import numpy as np
import plotly.graph_objects as go

from datetime import datetime
from sklearn.preprocessing import MinMaxScaler

## Understanding the weather ⛅️
The weather data would help us give insights on ***how exactly is the weather event in the particular region affects the bike usage on a daily basis***. Before diving deeper into bike usage patterns, we would be performing exploratory data analysis on weather data.

In [None]:
df_weather = pd.read_csv("/kaggle/input/sf-bay-area-bike-share/weather.csv")

# We only require date, zip_code, and events for our analysis
df_weather = df_weather[["date", "zip_code", "events"]]

# Convert date to datetime format for uniform analytic exercises
df_weather["date"] = pd.to_datetime(df_weather["date"])
df_weather["dd"] = df_weather["date"].dt.day
df_weather["mm"] = df_weather["date"].dt.month
df_weather["yyyy"] = df_weather["date"].dt.year

# Sort the weather dataframe by zip_code ascending and date ascending
df_weather.sort_values(by=["zip_code", "date"], inplace=True)
df_weather = df_weather[["zip_code", "dd", "mm", "yyyy", "events"]]

# Filling missing values in event with Sunny
df_weather["events"].fillna("Sunny", inplace=True)

# Normalizing the noise i.e. 'rain' -> 'Rain', 'Fog-Rain' -> 'Rain', 'Rain-Thunderstorm' -> 'Rain'
# Also, creating a zip_code_str as type 'str' for visualization
df_weather["events"].replace(["rain", "Fog-Rain", "Rain-Thunderstorm"], ["Rain", "Rain", "Rain"], inplace=True)
df_weather["zip_code_str"] = df_weather["zip_code"].astype(str)

Eventually, the weather data looks somewhat like this ...

In [None]:
df_weather.head(10)

Now we would be grouping the weather data to get insights on the number of weathers events per ZIP code for the given year ...

In [None]:
# Aggregate the weather based on number of events per year per zip_code
df_weather_eda = pd.DataFrame(
    df_weather[["yyyy", "zip_code_str", "events"]].value_counts()
).sort_values(by=["yyyy", "zip_code_str"])

In [None]:
# Data for visualization
weather_data = {}

for k, v in df_weather_eda[0].items():
    if not k[0] in weather_data:
        weather_data[k[0]] = {}
    if k[2] in weather_data[k[0]]:
        weather_data[k[0]][k[2]]["zip_code"] += [k[1]]
        weather_data[k[0]][k[2]]["count"] += [v]
    else:
        weather_data[k[0]][k[2]] = {
            "zip_code": [k[1]],
            "count": [v]
        }

Once we have the data prepared for visualization, we will be going forward with plotting the bar graphs depicting the yearly weather patterns (i.e. **Sunny** 🌞, **Rain** 🌧, **Fog** ☁️) for every ZIP code.

In [None]:
colour = {"Sunny": "darkorange", "Rain": "mediumturquoise", "Fog": "grey"}
for year, events in weather_data.items():
    data = []
    for event, event_data in events.items():
        data += [
            go.Bar(
                x=event_data["zip_code"],
                y=event_data["count"],
                name=event,
                marker_color=colour[event]
            )
        ]
    fig = go.Figure(data=data)
    fig.update_layout(barmode='group',
                      title=f"{year} Weather Forecast",
                      xaxis={"title": "ZIP Codes"},
                      yaxis={"title": "No. of days"})
    fig.show()

Based on the data above, we can easily conclude that...
> "It is always (mostly) sunny in SF bay area 🌞"

## Understanding the stations 🚏
Now that we have clearer understanding of the distribution of weather data, we can now move forward with understanding patterns in the bike station dataset. 

📝 **NOTE:** We would be laying our emphasis on *exploring the distribution of dock capacity per station* out here.

In [None]:
df_station = pd.read_csv("/kaggle/input/sf-bay-area-bike-share/station.csv")

# Choose relevant columns
df_station = df_station[["id", "name", "city", "dock_count"]]

# Sort the values based on city and dock_count in ascending order
df_station.sort_values(by=["city", "dock_count"], inplace=True)

Once, we have performed the necessary pre-processing on station data, the station data looks something like ...

In [None]:
df_station.head(10)

The dock count shows the following analogies:

In [None]:
pd.DataFrame(df_station.dock_count.describe())

Based on the no. of dock_count, we found the occurence of no. of stations having similar dock_counts.

In [None]:
pd.DataFrame(df_station.dock_count.value_counts()).set_axis(["No. of stations"], axis="columns", inplace=False)

The following are the no. of stations present in each city.

In [None]:
pd.DataFrame(df_station.city.value_counts()).set_axis(["No. of stations"], axis="columns", inplace=False)

Now, we visualize the dock capacity of each station in each and every city using the pie charts .

In [None]:
# Data for visualization
cities = df_station.city.unique()

for city in cities:
    _df = df_station[df_station.city == city]
    data = [go.Pie(labels=_df.name.to_list(), values=_df.dock_count.to_list(), hole=.4)]
    fig = go.Figure(data=data)
    fig.update_traces(hoverinfo='label+percent', textinfo='value')
    fig.update_layout(title=f"Max. dock capacity in {city.strip()}")
    fig.show()

## Understanding the trips 🛣
Now that we understand the patterns in the station and weather data, we now proceed to understand each and every transaction of the trips data. This dataset should help us make inferences on -
- 🚲 Patterns for total trips per bikes.
- 🚉 Patterns for average trip duration and total trip duration per weather event for each bike.
- 🚉 Patterns for total no. of trips, average daily trip duration and total trip duration per route.
- ⛅️ Patterns for average daily trips and total trips per weather event per start station
- 🚏 Patterns for popular route distribution based on number of trips and total number of trip duration

📝 **NOTE:** We can generally find deeper patterns in the trips data for each and every city, but, for limiting our scope of this research, we will only be diving deeper into trips from and within San Francisco 🌉.

In [None]:
df_trip = pd.read_csv("/kaggle/input/sf-bay-area-bike-share/trip.csv")

# Filter the trips for only the ZIP Codes where weather data is available
zip_code = df_weather.zip_code_str.unique()
df_trip = df_trip[df_trip["zip_code"].isin(zip_code)]

# Choose relevant columns
df_trip = df_trip[["bike_id", "duration", "start_date", "start_station_id", "end_station_id", "subscription_type", "zip_code"]]

# Normalize the duration to minutes and convert start_date to datetime type
df_trip["duration"] = df_trip["duration"] // 60
df_trip['duration'].quantile(0.9)
df_trip = df_trip[df_trip["duration"] <= 60 * 24]  # Filter trips with duration of more than 24 hours
df_trip["start_date"] = pd.to_datetime(df_trip["start_date"])

# Create individual values for date for efficient querying
df_trip["dd"] = df_trip["start_date"].dt.day
df_trip["mm"] = df_trip["start_date"].dt.month
df_trip["yyyy"] = df_trip["start_date"].dt.year

# Drop 'start_date' column since we have already splitted it into 'dd', 'mm', 'yyyy'
df_trip.drop(columns=["start_date"], inplace=True)

# Join the weather data with the trip data to gain contextual insights on the effects of weather 
df_trip = pd.merge(df_trip, 
                   df_weather, 
                   left_on=["zip_code", "dd", "mm", "yyyy"], 
                   right_on=["zip_code_str", "dd", "mm", "yyyy"], 
                   how="left").drop(["zip_code_y"], axis=1)

# Filtering trips other than San Francisco (SF)
# NOTE: There are 95272 trips overall, out of which:
# 1. 83857 trips originated from one of the stations in SF
# 2. 83858 trips ended at one of the stations of SF
# 3. 83856 trips were between stations in SF
# 4. one trip originated from one of the stations in SF but did not end in one of the stations in SF
# 5. two trips originated from the stations outside of SF but ended in one of the stations in SF
sf_station_ids = df_station[df_station["city"] == "San Francisco"]["id"].values
df_trip = df_trip[df_trip["start_station_id"].isin(sf_station_ids)]
df_trip = df_trip[df_trip["end_station_id"].isin(sf_station_ids)]

Eventually, the following is the rough schematics of the trips data...

In [None]:
df_trip.head(10)

In [None]:
subs = df_trip["subscription_type"].value_counts()
subs_per_event = df_trip.groupby(["subscription_type"])["events"].value_counts()
subs_per_event_per_bike = df_trip.groupby(["subscription_type", "events"])["bike_id"].value_counts()

ids = []
labels = []
parents = []
values = []

for k, v in subs.items():
    ids += [k.strip()]
    labels += [k.strip()]
    parents += [""]
    values += [v]
    
for k, v in subs_per_event.items():
    ids += ["|".join(k)]
    labels += [k[1]]
    parents += [k[0]]
    values += [v]
    
for k, v in subs_per_event_per_bike.items():
    ids += ["{}|{}|{}".format(*k)]
    labels += [f"B-{k[2]}"]
    parents += ["|".join(k[:2])]
    values += [v]

fig = go.Figure(
    go.Sunburst(
        ids=ids,
        labels=labels,
        parents=parents,
        values=values,
        branchvalues="total",
        maxdepth=3
    )
)
fig.update_layout(margin=dict(t=30, l=30, r=30, b=30),
                  title="Total trips per bike per event per subscription")
fig.show()

Since the distribution of data is uneven *(i.e. **91759 Subscribers** and **3513 Customers**)*. The distribution is biased for having many instances of Subscribers. Regardless of the `subscription_type`, there are higher number fo instances for the trips during `Sunny` followed by `Rain` and then at last the trips during `Fog`.


Similar to the visualization above, we would try to explore the distribution of trips wrt total trip duration and average trip duration per bike.

In [None]:
subs = df_trip.groupby(["subscription_type"])["duration"].sum()
subs_per_event = df_trip.groupby(["subscription_type", "events"])["duration"].sum()
subs_per_event_per_bike = df_trip.groupby(["subscription_type", "events", "bike_id"])["duration"].sum()

ids = []
labels = []
parents = []
values = []

for k, v in subs.items():
    ids += [k.strip()]
    labels += [k.strip()]
    parents += [""]
    values += [v]
    
for k, v in subs_per_event.items():
    ids += ["|".join(k)]
    labels += [k[1]]
    parents += [k[0]]
    values += [v]
    
for k, v in subs_per_event_per_bike.items():
    ids += ["{}|{}|{}".format(*k)]
    labels += [f"B-{k[2]}"]
    parents += ["|".join(k[:2])]
    values += [v]

fig = go.Figure(
    go.Sunburst(
        ids=ids,
        labels=labels,
        parents=parents,
        values=values,
        branchvalues="total",
        maxdepth=3
    )
)
fig.update_layout(margin=dict(t=30, l=30, r=30, b=30),
                  title="Total trip duration per bike per event per subscription")
fig.show()

In [None]:
subs = df_trip.groupby(["subscription_type"])["duration"].mean()
subs_per_event = df_trip.groupby(["subscription_type", "events"])["duration"].mean()
subs_per_event_per_bike = df_trip.groupby(["subscription_type", "events", "bike_id"])["duration"].mean()

ids = []
labels = []
parents = []
values = []

for k, v in subs.items():
    ids += [k.strip()]
    labels += [k.strip()]
    parents += [""]
    values += [v]
    
for k, v in subs_per_event.items():
    ids += ["|".join(k)]
    labels += [k[1]]
    parents += [k[0]]
    values += [v]
    
for k, v in subs_per_event_per_bike.items():
    ids += ["{}|{}|{}".format(*k)]
    labels += [f"B-{k[2]}"]
    parents += ["|".join(k[:2])]
    values += [v]

fig = go.Figure(
    go.Sunburst(
        ids=ids,
        labels=labels,
        parents=parents,
        values=values,
        maxdepth=3
    )
)
fig.update_layout(margin=dict(t=30, l=30, r=30, b=30),
                  title="Avg. trip duration per bike per event per subscription")
fig.show()

In [None]:
route_duration = df_trip.groupby(["yyyy", "events", "start_station_id", "end_station_id"])["duration"].sum()

stations = df_station[["name", "id"]].sort_values(by=["id"])[df_station["id"].isin(sf_station_ids)]
station_names = stations.name.values
station_ids = stations.id.values

route_data = {}
for k, v in route_duration.items():
    if not k[0] in route_data:
        route_data[k[0]] = {}
    if not k[1] in route_data[k[0]]:
        route_data[k[0]][k[1]] = None
        
for year, v in route_data.items():
    for event, data in v.items():
        starts = []
        for start in station_ids:
            ends = []
            for end in station_ids:
                try:
                    ends += [route_duration[(year,event,start,end)]]
                except KeyError:
                    ends += [None]
            starts += [ends]
        route_data[year][event] = np.transpose(np.array(starts))
        
for year, year_data in route_data.items():
    for event, data in year_data.items():
        fig = go.Figure(
            go.Heatmap(
                z=data,
                y=list(station_names),
                x=list(station_names),
                type='heatmap',
                colorscale='Viridis'
            )
        )
        fig.update_layout(
            title=f"{year} total trip duration per route ({event})",
            xaxis={"title": "Start Station"},
            yaxis={"title": "End Station"},
            margin=dict(t=30, r=10, b=10, l=10), 
            width=700, 
            height=600,
            autosize=False
        )
        fig.show()

In [None]:
route_duration = df_trip.groupby(["yyyy", "events", "start_station_id", "end_station_id"])["duration"].mean()

stations = df_station[["name", "id"]].sort_values(by=["id"])[df_station["id"].isin(sf_station_ids)]
station_names = stations.name.values
station_ids = stations.id.values

route_data = {}
for k, v in route_duration.items():
    if not k[0] in route_data:
        route_data[k[0]] = {}
    if not k[1] in route_data[k[0]]:
        route_data[k[0]][k[1]] = None
        
for year, v in route_data.items():
    for event, data in v.items():
        starts = []
        for start in station_ids:
            ends = []
            for end in station_ids:
                try:
                    ends += [route_duration[(year,event,start,end)]]
                except KeyError:
                    ends += [None]
            starts += [ends]
        route_data[year][event] = np.transpose(np.array(starts))
        
for year, year_data in route_data.items():
    for event, data in year_data.items():
        fig = go.Figure(
            go.Heatmap(
                z=data,
                y=station_names,
                x=station_names,
                type='heatmap',
                colorscale='Viridis'
            )
        )
        fig.update_layout(
            title=f"{year} avg. trip duration per route ({event})",
            xaxis={"title": "Start Station"},
            yaxis={"title": "End Station"},
            margin=dict(t=30, r=10, b=10, l=10), 
            width=700, 
            height=600,
            autosize=False
        )
        fig.show()

In [None]:
route_count = df_trip.groupby(["yyyy", "events", "start_station_id"])["end_station_id"].value_counts()

stations = df_station[["name", "id"]].sort_values(by=["id"])[df_station["id"].isin(sf_station_ids)]
station_names = stations.name.values
station_ids = stations.id.values

route_data = {}
for k, v in route_count.items():
    if not k[0] in route_data:
        route_data[k[0]] = {}
    if not k[1] in route_data[k[0]]:
        route_data[k[0]][k[1]] = None
        
for year, v in route_data.items():
    for event, data in v.items():
        starts = []
        for start in station_ids:
            ends = []
            for end in station_ids:
                try:
                    ends += [route_count[(year,event,start,end)]]
                except KeyError:
                    ends += [None]
            starts += [ends]
        route_data[year][event] = np.transpose(np.array(starts))
        
for year, year_data in route_data.items():
    for event, data in year_data.items():
        fig = go.Figure(
            go.Heatmap(
                z=data,
                y=station_names,
                x=station_names,
                type='heatmap',
                colorscale='Viridis'
            )
        )
        fig.update_layout(
            title=f"{year} trips per route ({event})",
            xaxis={"title": "Start Station"},
            yaxis={"title": "End Station"},
            margin=dict(t=30, r=10, b=10, l=10), 
            width=700, 
            height=600, 
            autosize=False
        )
        fig.show()

In [None]:
avg_sstation_trips = df_trip.groupby(["start_station_id", "events", "yyyy", "mm", "dd"])["duration"].mean()

sstation_data = {}
for k, v in avg_sstation_trips.items():
    if not k[0] in sstation_data:
        sstation_data[k[0]] = {}
    if k[1] in sstation_data[k[0]]:
        sstation_data[k[0]][k[1]]["date"] += [f"{k[4]}-{k[3]}-{k[2]}"]
        sstation_data[k[0]][k[1]]["avg"] += [v]
    else:
        sstation_data[k[0]][k[1]] = {
            "date": [f"{k[4]}-{k[3]}-{k[2]}"],
            "avg": [v]
        }

colour = {"Sunny": "darkorange", "Rain": "mediumturquoise", "Fog": "grey"}
for sub_type, event_data in sstation_data.items():
    fig = go.Figure()
    category_array = []
    for event, data in event_data.items():
        fig.add_trace(
            go.Scatter(
                x=data["date"],
                y=data["avg"],
                name=f'{event}',
                mode="markers",
                marker_color=colour[event]
            )
        )
        category_array += data["date"]
    category_array.sort(key=lambda date: datetime.strptime(date, '%d-%m-%Y'))
    fig.update_xaxes(rangeslider_visible=True, categoryorder="array", categoryarray=category_array)
    fig.update_layout(yaxis={"title": f"Avg. duration of bikes used ({df_station[df_station.id == sub_type].name.values[0]})"})
    fig.show()

In [None]:
max_sstation_trips = df_trip.groupby(["start_station_id", "events", "yyyy", "mm", "dd"])["duration"].sum()
        
max_sstation_data = {}
for k, v in max_sstation_trips.items():
    if not k[0] in max_sstation_data:
        max_sstation_data[k[0]] = {}
    if k[1] in max_sstation_data[k[0]]:
        max_sstation_data[k[0]][k[1]]["date"] += [f"{k[4]}-{k[3]}-{k[2]}"]
        max_sstation_data[k[0]][k[1]]["sum"] += [v]
    else:
        max_sstation_data[k[0]][k[1]] = {
            "date": [f"{k[4]}-{k[3]}-{k[2]}"],
            "sum": [v]
        }
        
colour = {"Sunny": "darkorange", "Rain": "mediumturquoise", "Fog": "grey"}
for sub_type, event_data in max_sstation_data.items():
    fig = go.Figure()
    category_array = []
    for event, data in event_data.items():
        fig.add_trace(
            go.Scatter(
                x=data["date"],
                y=data["sum"],
                name=f'{event}',
                mode="markers",
                marker_color=colour[event]
            )
        )
        category_array += data["date"]
    category_array.sort(key=lambda date: datetime.strptime(date, '%d-%m-%Y'))
    fig.update_xaxes(rangeslider_visible=True, categoryorder="array", categoryarray=category_array)
    fig.update_layout(yaxis={"title": f"Total duration of bikes used ({df_station[df_station.id == sub_type].name.values[0]})"})
    fig.show()

## Understanding the status of bike station 🚏
Now that eventually we have got a deeper understanding of the trip dataset. We will now try to make some more detailed inferences on bike's docking status.

🚨 **ALERT:** This dataset is huge with ***~71 Million records*** since this is more like a cron polling bike station's current dock status

In [None]:
df_status = pd.read_csv("/kaggle/input/sf-bay-area-bike-share/status.csv")

# Convert the 'time' to type 'pd.datetime'
df_status["time"] = pd.to_datetime(df_status["time"])

# Splitting the 'time' into 'dd', 'mm', 'yyyy', 'hh'
df_status["dd"] = df_status["time"].dt.day
df_status["mm"] = df_status["time"].dt.month
df_status["yyyy"] = df_status["time"].dt.year
df_status["hh"] = df_status["time"].dt.hour

# Dropping 'time' and 'bikes_available'
df_status.drop(columns=["time", "bikes_available"], inplace=True)

# We take distinct records for each hourly data 
# since in the current format the dataset is recorded every minute
df_status.drop_duplicates(inplace=True)
df_status.drop_duplicates(subset=["station_id", "dd", "mm", "yyyy", "hh"], inplace=True, keep="last")

# Filter the status for stations in SF
df_status = df_status[df_status.station_id.isin(sf_station_ids)]

# Join the trip data and bike status data to gain contextual insights on the effects of weather on bike station status
df_trip = df_trip[["start_station_id", "dd", "mm", "yyyy", "events"]]
df_trip.drop_duplicates(inplace=True)
df_status = pd.merge(df_status, 
                     df_trip, 
                     left_on=["station_id", "dd", "mm", "yyyy"], 
                     right_on=["start_station_id", "dd", "mm", "yyyy"], 
                     how="left").drop(["start_station_id"], axis=1)
df_status.drop_duplicates(inplace=True)

In [None]:
df_status.head(10)

In [None]:
avg_docks_vacant = df_status.groupby(["station_id", "yyyy", "mm", "dd"])["docks_available"].mean()

In [None]:
businesses = df_status.station_id.unique()

business_pulse = {}
for business in businesses:
    subset = df_status[df_status.station_id == business][["dd", "mm", "yyyy"]]
    business_days = list(subset.drop_duplicates().itertuples(index=False))
    business_pulse[business] = {
        "xaxis": {
            year: [f"{day.dd}-{day.mm}-{day.yyyy}" for day in business_days if day.yyyy == year]
            for year in subset.yyyy.unique()
        },
        "yaxis": {
            year: [avg_docks_vacant[(business,day.yyyy,day.mm,day.dd)] for day in business_days if day.yyyy == year]
            for year in subset.yyyy.unique()
        }
    }
    all_dock_count = [avg_docks_vacant[(business,day.yyyy,day.mm,day.dd)] for day in business_days]
    business_pulse[business]["hline_y_abs"] = sum(all_dock_count) // len(all_dock_count)
    business_pulse[business]["hline_y"] = sum(all_dock_count) / len(all_dock_count)

In [None]:
for business, pulse in business_pulse.items():
    data = [
        go.Scatter(x=xdata,y=pulse["yaxis"][year],mode='lines',name=f"{year}")
        for year, xdata in pulse["xaxis"].items()
    ]
    fig = go.Figure(data=data)
    fig.update_xaxes(rangeslider_visible=True)
    fig.add_hline(y=pulse["hline_y"],
                  line_dash="dot",
                  annotation_text=f"Baseline Average = {pulse['hline_y_abs']}/Hour", 
                  annotation_position="bottom right")
    fig.update_layout(title=f"Avg. Hourly Active Usage (HAU) for Bikes in {df_station[df_station['id'] == business].name.values[0]}",
                      yaxis={"title": "HAU"})
    fig.show()

In [None]:
df_pulse_avg = [
    {
        "station_name": df_station[df_station.id == station_id].name.values[0],
        "avg_bikes_used_per_hour_per_day": v["hline_y"],
        "is_performing_above_avg": v["hline_y"] > (df_station[df_station.id == station_id].dock_count.values[0] / 2)
    }
    for station_id, v in business_pulse.items()
]
pd.DataFrame(df_pulse_avg)

In [None]:
df_data = []
for station_id, v in business_pulse.items():
    bikes_available = []
    
    station_name = df_station[df_station.id == station_id].name.values[0]
    dock_count = df_station[df_station.id == station_id].dock_count.values[0]
    
    for year, aba in v["yaxis"].items():
        bikes_available += aba
    bikes_available = np.array(bikes_available) / dock_count
    
    min_max_scaler = MinMaxScaler()
    norm_bikes_available = min_max_scaler.fit_transform(bikes_available.reshape(-1,1))
    
    bins = [0, 0.5, 0.7]
    names = ["Bad", "Normal", "Good"]
    d = dict(enumerate(names, 1))
    
    cat_bikes_available = np.vectorize(d.get)(np.digitize(norm_bikes_available[:,0], bins))
    all_count = len(cat_bikes_available)
    
    df_data += [
        {
            "station_name": station_name,
            "underperformed_days": (cat_bikes_available == "Bad").sum(),
            "normal_days": (cat_bikes_available == "Normal").sum(),
            "optimal_days": (cat_bikes_available == "Good").sum(),
            "underperformed_days_percent": ((cat_bikes_available == "Bad").sum() / all_count) * 100,
            "normal_days_percent": ((cat_bikes_available == "Normal").sum() / all_count) * 100,
            "optimal_days_percent": ((cat_bikes_available == "Good").sum() / all_count) * 100
        }
    ]

pd.DataFrame(df_data).sort_values(by=["underperformed_days_percent", "normal_days_percent"], ascending=False)

In [None]:
pd.DataFrame(df_data).describe()

In [None]:
df_status.groupby(["events","station_id","dd","mm","yyyy"])["docks_available"].sum()