Students: s1807333 Julie L. Clausen, s180317 Helle Achari, s174129 Frederik Warberg

# Introduction

This project centers on the theme of mobility, aiming to address today's pressing global challenges with sustainable transportation solutions. Throughout the project the focusarea will be on shared mobility services, offering emission-free and efficient options to combat climate change, social disparities, and more.
Doing the project we conduct analysis based on predicting bike-sharing demand for clusters of stations using data from Citi Bike, a prominent bike-sharing system in the United States. Tasks include spatial clustering, demand prediction modeling, and fleet repositioning. Furthermore, we elaborate the analysis by exploring multiple unique research questions, uncovering new insights in mobility data using the data science cycle. 

(Extensions: The report may include optional extensions such as expanding the dataset, creating visualizations, and employing advanced data science techniques to enhance the project's value.)

As a final result we aim for this project to leverage data science to optimize mobility operations, foster innovation, and contribute to a sustainable future in urban transportation."

In [8]:
import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

In [9]:
data = pd.read_csv("Trips_2018.csv")

In [10]:
data.head()

Unnamed: 0.1,Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
0,0,970,2018-01-01 13:50:57.4340,2018-01-01 14:07:08.1860,72.0,40.767272,-73.993929,505.0,40.749013,-73.988484,31956,Subscriber,1992,1
1,1,723,2018-01-01 15:33:30.1820,2018-01-01 15:45:33.3410,72.0,40.767272,-73.993929,3255.0,40.750585,-73.994685,32536,Subscriber,1969,1
2,2,496,2018-01-01 15:39:18.3370,2018-01-01 15:47:35.1720,72.0,40.767272,-73.993929,525.0,40.755942,-74.002116,16069,Subscriber,1956,1
3,3,306,2018-01-01 15:40:13.3720,2018-01-01 15:45:20.1910,72.0,40.767272,-73.993929,447.0,40.763707,-73.985162,31781,Subscriber,1974,1
4,4,306,2018-01-01 18:14:51.5680,2018-01-01 18:19:57.6420,72.0,40.767272,-73.993929,3356.0,40.774667,-73.984706,30319,Subscriber,1992,1


# Data cleaning 

The given data consists of multiple variables, in order to make sure the data is ready for in depth analysis we make sure that the data used do not consist of any missing information or lack of measurements as this could lead to outliers. 

As an initial beginning we strip the data for variables that we will exclude from the analysis. 
This includes:
Unnamed, bikeid and usertype. 

We list the columns that are being excluded from the raw data.

In [None]:
excluded_variables = ["Unnamed: 0", "bikeid", "usertype"]

Now excluding the multiple specified columns.

In [None]:
data.drop(excluded_variables, axis = 1, inplace = True)

The new data is saved as a new csv file named data1.

In [None]:
data.to_csv("data1.csv", index = False)
data1 = pd.read_csv("data1.csv")

We check the new data to see if the columns are excluded correctly.

In [None]:
data1.head()

The new data is checked for any missing informations and measurements and thereby clean the data. We use  .isnull() methods from the Panda package to create a Boolean mask specifying where and if there exists any missing values in our data. 

In [None]:
missing_value = data1.isna()
rows_w_nan = data1[missing_value.any(axis = 1)]
print(rows_w_nan)

In [None]:
data1_cleaned = data1.dropna()

We now convert the starttime and stoptime from strings to datetime in order to make the data in numeric values. However, in order to be able to do so we first of all need to we comvert the original dataframe to a copy called data2.

In [None]:
data2 = data1_cleaned.copy()
data2["starttime"] = pd.to_datetime(data2["starttime"], format="%Y-%m-%d %H:%M:%S.%f")
data2["stoptime"] = pd.to_datetime(data2["stoptime"], format="%Y-%m-%d %H:%M:%S.%f")

In [None]:
data2.head()

We look at the starttime and stopttime and elimate any impossible trips denoted by trips quicker than 1 minut and or longer than 5 hours. 

In [None]:
data2 = data2[(data2['stoptime'] - data2['starttime']).dt.total_seconds() >= 60]  
data2 = data2[(data2['stoptime'] - data2['starttime']).dt.total_seconds() <= 5 * 60 * 60]

We have now cleaned and updated our dataset trough different steps. In order to check if some steps should be modified, excluded or if we need further cleaning and adjustments we evaluate the current dataset by descriptive statistics.

In [None]:
data2_descriptive_stats = data2.describe()
pd.options.display.float_format = '{:.2f}'.format
print(data2_descriptive_stats)

Tripduration:
The dataset contains approximately 17.5 million entries.
The average trip duration is about 824.58 seconds (approximately 13.7 minutes).
The standard deviation is around 808.80 seconds, indicating a fair amount of variability in trip durations.
The shortest trip duration is 61 seconds, while the longest trip duration is 20,106 seconds (about 5 hours and 35 minutes).

Start_station_id and end_station_id:
The columns "start_station_id" and "end_station_id" have similar summary statistics.
They both have a similar count of approximately 17.5 million entries.
The statistics include the mean, standard deviation, minimum, and maximum values of station IDs.

Birth_year:
The dataset contains approximately 17.5 million entries.
The average birth year is approximately 1978.99, suggesting that the majority of users were born around 1979.
The standard deviation of approximately 11.93 indicates some variability in birth years.
The minimum birth year is 1885, and the maximum birth year is 2002.

Gender:
The dataset contains approximately 17.5 million entries.
The "gender" column seems to have three unique values represented as 0, 1, and 2.
The mean of approximately 1.15 suggests that there is a dominant gender value (likely 1).
The standard deviation of approximately 0.54 indicates some variability in gender values.

DESCRIPTIVE STATS: 
Tendency (Central Tendency):

Mean (Average):
For "tripduration," the average trip duration is about 824.58 seconds (approximately 13.7 minutes).
For "start_station_id" and "end_station_id," the average values indicate central station IDs.
For "birth_year," the average birth year is around 1978.99, suggesting that the majority of users were born around 1979.
For "gender," the mean of approximately 1.15 indicates a dominant gender value (likely 1).

Spread (Variability):
Standard Deviation (std):
For "tripduration," the standard deviation is approximately 808.80 seconds, indicating a fair amount of variability in trip durations.
For "start_station_id" and "end_station_id," the standard deviations are relatively high, showing variability in station IDs.
For "birth_year," the standard deviation of approximately 11.93 suggests some variability in birth years.
For "gender," the standard deviation of approximately 0.54 indicates variability in gender values.
Distribution:

Minimum and Maximum:
The "tripduration" column has a wide distribution with a minimum of 61 seconds and a maximum of 20,106 seconds, indicating trips of varying lengths.
The "start_station_id" and "end_station_id" columns show the range of station IDs in your dataset.
The "birth_year" column spans from 1885 to 2002, with some users having birth years that fall outside the typical range.
The "gender" column has three unique values (0, 1, and 2), which suggests that it's not binary and may represent different gender categories.

To gain a deeper understanding of your dataset, we create visualizations. 

In [None]:
# Plot histogram for arrivals
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(data2['end_station_id'], bins=20, edgecolor='k')
plt.xlabel('Number of Arrivals')
plt.ylabel('Frequency')
plt.title('Distribution of Arrivals')

# Plot histogram for departures
plt.subplot(1, 2, 2)
plt.hist(data2['start_station_id'], bins=20, edgecolor='k')
plt.xlabel('Number of Departures')
plt.ylabel('Frequency')
plt.title('Distribution of Departures')

plt.tight_layout()
plt.show()

# Prediction challenge 

The optimal number of clusters are found by the value of k where the distortion in the Elbow Method plot starts to level off or reach an "elbow" point, indicating the optimal number of clusters.

CHAT GPT: 
Determine the Optimal Number of Clusters: The code aims to find the optimal number of clusters by performing K-Means clustering with a range of different cluster numbers (from 20 to 49). It uses the Elbow Method to do this.

Elbow Method Calculation: Within a loop, the code initializes K-Means clustering for each value of k in the specified range. It fits the K-Means model to the station coordinates and calculates the distortion (Sum of Squared Distances, SSD) for each k. The distortion is a measure of how close the data points within each cluster are to the cluster's centroid. The distortion is stored in the distortions list for each k.

Elbow Method Plot: After calculating the distortions for different values of k, the code plots the Elbow Method graph. The x-axis represents the number of clusters (k), and the y-axis represents the distortion (SSD). The graph typically shows a curve where the distortion decreases as the number of clusters increases. The "elbow" point on the graph, where the rate of decrease starts to slow down, is often considered the optimal number of clusters.

In [None]:
import pandas as pd
from sklearn.cluster import KMeans

# Extract the latitude and longitude features
stations = data2[['start_station_latitude', 'start_station_longitude']]

# Find the optimal number of clusters using the Elbow Method
distortions = []
K = range(20, 50)  # Adjust the range based on your minimum required clusters (at least 20)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(stations)
    distortions.append(kmeans.inertia_)

# Plot the Elbow Method graph
plt.plot(K, distortions, 'bx-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Distortion (SSD)')
plt.title('The Elbow Method for Optimal k')
plt.show()

In [None]:
import pandas as pd
from sklearn.cluster import KMeans

# Extract relevant columns
stations = data[['start_station_latitude', 'start_station_longitude']]

# Number of clusters (at least 20)
num_clusters = 20

# Initialize K-Means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=0)

# Fit the K-Means model to your data
kmeans.fit(stations)

# Add cluster labels to your dataset
data['cluster'] = kmeans.labels_

# Now 'data' contains a new column 'cluster' with cluster labels

# Print the number of stations in each cluster
print(data['cluster'].value_counts())

# Optional: Visualize the clusters on a map
import matplotlib.pyplot as plt
plt.scatter(data['start_station_longitude'], data['start_station_latitude'], c=data['cluster'], cmap='viridis')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Spatial Clustering of Stations')
plt.show()


In [7]:
import pandas as pd

# Assuming 'df' is your DataFrame

# Ensure the 'starttime' column is in datetime format
df['starttime'] = pd.to_datetime(df['starttime'])

# Extract the hour from the 'starttime' column
df['hour'] = df['starttime'].dt.hour

# Group the data by 'cluster' and 'hour'
grouped_data = df.groupby(['cluster', 'hour'])

# Iterate through each group
for (cluster, hour), group_df in grouped_data:
    print(f"Cluster {cluster}, Hour {hour}:")
    print(group_df[['starttime', 'longitude', 'latitude']])
    print("\n")

Cluster 0, Hour 0:
                    starttime  longitude   latitude
1916  2018-04-14 00:56:31.876 -73.999744  40.716021
4025  2018-03-09 00:52:49.445 -73.997047  40.714131
5524  2018-03-18 00:26:15.638 -73.992663  40.718939
6863  2018-03-01 00:05:30.556 -73.991908  40.716059
14279 2018-09-22 00:28:03.831 -73.992663  40.718939
16619 2018-11-04 00:10:04.602 -73.999733  40.719105
17140 2018-09-11 00:41:05.645 -73.991930  40.711731
19329 2018-02-28 00:11:56.053 -73.989900  40.714275
22435 2018-12-04 00:30:04.472 -73.995960  40.718822
22461 2018-02-20 00:03:33.809 -73.989900  40.714275
22832 2018-08-31 00:39:45.693 -73.994224  40.715815
24109 2018-04-24 00:33:13.532 -73.995960  40.718822
24473 2018-09-08 00:15:58.181 -73.999733  40.719105
32828 2018-06-23 00:09:19.581 -73.999733  40.719105
37265 2018-08-27 00:11:58.859 -73.999744  40.716021
41122 2018-06-28 00:14:14.196 -73.992663  40.718939
46591 2018-07-30 00:11:38.200 -73.991930  40.711731
49720 2018-08-12 00:22:19.082 -73.989900  40.

# Explanatory component 1

In this initial exploratory component, our first objective is to consider impact of age and gender on bike rentals. The original dataset provides information regarding users' birth years and gender, allowing us to employ a variety of analytical methods. 
Our goal is to segmenting users based on their ages, additionally, we employ gender-based stratification to explore variations in bike rental patterns between different genders. We thereby hope to get a more intricate analysis which aims to investigate whether these gender-related distinctions within bike rentals vary across distinct age groups.

By applying statistical methods, such as correlation analysis and time trend evaluations, and linear regression, we hope to understand how gender and age influence rental demand.

# Explanatory component 2

Consider weather impact on bike rental. We wish to explore the relationship between weather features and bike rentals.We consider doing so by applying analysis of correlation coefficients (e.g., Pearson's correlation) between bike rental demand and various weather features. By calculating the correlations we aim to identify which weather factors have the most significant impact on bike rentals. Furthermore, by including a section where we aim to train predictive models (e.g., regression models or machine learning models) that use weather features as input to predict bike rental demand. We hope to quantify the impact of weather on rentals and make forecasts based on future weather predictions. Additionally, we consider temporal trends by creating a time series plots in order to evaluate trends over time, such as bike rental patterns by hour, day, or month. We use the starttime variable as our timestamp.

# Conclusion

Word-count (MAX. 2000-2500):

In [20]:
import nbformat

# Load your Jupyter Notebook file
notebook_file = "IBA.ipynb"

# Read the notebook
with open(notebook_file, "r", encoding="utf-8") as notebook_file:
    notebook_content = nbformat.read(notebook_file, as_version=4)

# Initialize word count
word_count = 0

# Iterate through the notebook cells
for cell in notebook_content.cells:
    if cell.cell_type == "markdown":
        # Extract and split the content of each markdown cell into words
        cell_text = cell["source"]
        words = cell_text.split()
        word_count += len(words)

# Display the total word count
print(f"Total word count in the notebook: {word_count}")


Total word count in the notebook: 1654


Everybody contributed equally to the project and codework, but the main contributors to each section can be found in the Table below.

Contribution table: 

| Section         | Contributors        |
|-----------------|---------------------|
| Introduction    |             |
| Data Cleaning   |      |
| Prediction challenge, clustering   |   |
| Prediction challenge, prediction model  |  |
| Prediction challenge, required number of bicycles   |  |
| Exploratory research Q1 |             |
| Exploratory research Q2   |             |
| Conclusion    |             |
