# Making Capital Bikeshare system data usable

## Unzipping files

We'll add all of the files into a new folder (`/unzippedData`)

**Warning**: requires at least 5.63GB of space

In [44]:
import os
import zipfile

for file in os.listdir("./capitalBikeshareData"):
    filepath = "./capitalBikeshareData/" + file
    with zipfile.ZipFile(filepath, 'r') as zip_ref:
        zip_ref.extractall("./unzippedData")

## Analysing data before any further data preparation

Since we want to work on as much data as possible, we want to combine the data into one csv file. This however requires the csv files to be the same for the most part. The most important part is for each of the csv files to each have similar/same features. Let's check that.

In [45]:
import os
import pandas as pd

columnNamesDict = dict()
for file in os.listdir("./unzippedData"):
    filepath = "./unzippedData/" + file
    if not os.path.isfile(filepath):
        continue
    fileDataframe = pd.read_csv(filepath, low_memory=False)
    columnNames = ""
    for name in fileDataframe.columns:
        columnNames += name + ", "
    columnNames = columnNames[:-2]
    if columnNames not in columnNamesDict:
        print(filepath)
        print(columnNames + "\n")
        columnNamesDict[columnNames] = fileDataframe.columns.values.tolist()


./unzippedData/2010-capitalbikeshare-tripdata.csv
Duration, Start date, End date, Start station number, Start station, End station number, End station, Bike number, Member type

./unzippedData/202004-capitalbikeshare-tripdata.csv
ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual



Csv files before April 2020 have columns `Duration`, `Start date`, `End date`, `Start station number`, `Start station`, `End station number`, `End station`, `Bike number`, `Member type`.

Csv files after April 2020 have columns `ride_id`, `rideable_type`, `started_at`, `ended_at`, `start_station_name`, `start_station_id`, `end_station_name`, `end_station_id`, `start_lat`, `start_lng`, `end_lat`, `end_lng`, `member_casual`.

These factors make combining datasets difficult.


## Combining data into single files based on features

Since we have two different sets of column names / features for two different time ranges, we'll try to make one csv file for both sets.

**Warning**: takes a long time (~8 minutes) and requires at least 5.63 GB of space

In [46]:
import os
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

combinedDataBeforeApril2020 = pd.DataFrame(
    {'Duration': [], 'Start date': [], 'End date': [], 'Start station number': [],
     'Start station': [], 'End station number': [], 'End station': [], 'Bike number': [],
     'Member type': []})

combinedDataAfterApril2020 = pd.DataFrame({'ride_id': [], 'rideable_type': [], 'started_at': [], 'ended_at': [],
                              'start_station_name': [],
                              'start_station_id': [], 'end_station_name': [], 'end_station_id': [], 'start_lat': [],
                              'start_lng': [],
                              'end_lat': [], 'end_lng': [], 'member_casual': []})

columnNamesDict = set()
for file in os.listdir("./unzippedData"):
    filepath = "./unzippedData/" + file

    if not os.path.isfile(filepath):
        continue

    fileDataframe = pd.read_csv(filepath, low_memory=False, dtype="string") # String type is important to not convert integers into floats (becomes a problem by adding .0 to the end of an integer)

    if fileDataframe.columns[0] == 'ride_id':
        combinedDataAfterApril2020 = pd.concat([combinedDataAfterApril2020, fileDataframe], ignore_index=True)
    else:
        combinedDataBeforeApril2020 = pd.concat([combinedDataBeforeApril2020, fileDataframe], ignore_index=True)

combinedDataAfterApril2020.to_csv('./combinedDataAfterApril2020.csv')

combinedDataBeforeApril2020.to_csv('./combinedDataBeforeApril2020.csv')

## Assembling a combined csv file based on common features

Since the features after April 2020 are named better, we'll try to use their names.

**Warning**: takes a long time and requires at least 5.63 GB of space

In [47]:
# Function for assigning correct datatypes
def set_data_types_combined(df):
    df['started_at'] = pd.to_datetime(df['started_at'], format='ISO8601', yearfirst=True)
    df['ended_at'] = pd.to_datetime(df['started_at'], format='ISO8601', yearfirst=True)
    df.astype({"is_member": "bool"})


trimmedDataAfterApril2020 = pd.read_csv('./combinedDataAfterApril2020.csv', low_memory=False, index_col=0, dtype={"Start station number": "string", "End station number": "string"})
trimmedDataAfterApril2020.drop(columns=["ride_id", "rideable_type", "start_lat", "end_lat", "start_lng", "end_lng"], axis=1, inplace=True)
trimmedDataAfterApril2020.columns = ["started_at", "ended_at", "start_station_name", "start_station_id", "end_station_name", "end_station_id", "is_member"]
trimmedDataAfterApril2020.loc[trimmedDataAfterApril2020['is_member'] == "member",'is_member'] = True
trimmedDataAfterApril2020.loc[trimmedDataAfterApril2020['is_member'] == "casual",'is_member'] = False


trimmedDataBeforeApril2020 = pd.read_csv('./combinedDataBeforeApril2020.csv', low_memory=False, index_col=0, dtype={"start_station_id": "string", "end_station_id": "string"})
trimmedDataBeforeApril2020.drop(columns=["Duration", "Bike number"], axis=1, inplace=True)
trimmedDataBeforeApril2020.columns = ["started_at", "ended_at", "start_station_id", "start_station_name", "end_station_id", "end_station_name", "is_member"]
trimmedDataBeforeApril2020.loc[trimmedDataBeforeApril2020['is_member'] == "Member",'is_member'] = True
trimmedDataBeforeApril2020.loc[trimmedDataBeforeApril2020['is_member'] == "Casual",'is_member'] = False

combinedData = pd.concat([trimmedDataAfterApril2020, trimmedDataBeforeApril2020], ignore_index=True)
set_data_types_combined(combinedData)
combinedData.to_csv('./combinedData.csv')

# In-depth analysis

## Analysing data in greater detail

Now that we have multiple large datasets, we can do further investigation into how good our data is.

In [None]:
# Loading in data if required, takes a long time
combinedData = pd.read_csv('./combinedData.csv', low_memory=False, index_col=0, dtype={"start_station_id": "string", "end_station_id": "string"})
combinedDataAfterApril2020 = pd.read_csv('./combinedDataAfterApril2020.csv')
combinedDataBeforeApril2020 = pd.read_csv('./combinedDataBeforeApril2020.csv')

In [43]:

print(combinedData.isna().sum()/len(combinedData))
print(combinedDataAfterApril2020.isna().sum()/len(combinedDataAfterApril2020))
print(combinedDataBeforeApril2020.isna().sum()/len(combinedDataAfterApril2020))

print()
print(combinedData.start_station_id.value_counts())
print(combinedData.end_station_id.value_counts())



started_at            0.000000
ended_at              0.000000
start_station_name    0.046966
start_station_id      0.046966
end_station_name      0.050064
end_station_id        0.050086
is_member             0.000000
dtype: float64
ride_id               0.000000e+00
rideable_type         0.000000e+00
started_at            0.000000e+00
ended_at              0.000000e+00
start_station_name    1.027012e-01
start_station_id      1.027012e-01
end_station_name      1.094751e-01
end_station_id        1.095228e-01
start_lat             5.822848e-07
start_lng             5.822848e-07
end_lat               1.550042e-03
end_lng               1.550042e-03
member_casual         0.000000e+00
dtype: float64
Duration                0.000000e+00
Start date              0.000000e+00
End date                0.000000e+00
Start station number    0.000000e+00
Start station           0.000000e+00
End station number      0.000000e+00
End station             0.000000e+00
Bike number             8.151988e-07
Me