# Making Capital Bikeshare system data usable

## Unzipping files

We'll add all of the files into a new folder (`/unzippedData`)

**Warning**: requires at least 5.63GB of space

In [5]:
import os
import zipfile

for file in os.listdir("./capitalBikeshareData"):
    filepath = "./capitalBikeshareData/" + file
    with zipfile.ZipFile(filepath, 'r') as zip_ref:
        zip_ref.extractall("./unzippedData")

C:\Users\Sten\Documents\GitHub\ProjectE8


## Analysing data before any further data preparation

Since we want to work on as much data as possible, we want to combine the data into one csv file. This however requires the csv files to be the same for the most part. The most important part is for each of the csv files to each have similar/same features. Let's check that.

In [10]:
import os
import pandas as pd

combinedData = pd.DataFrame({'Duration': [], 'Start date': [], 'End date': [], 'Start station number': [],
                             'Start station': [], 'End station number': [], 'End station': [], 'Bike number': [],
                             'Member type': []})

columnNamesDict = dict()
for file in os.listdir("./unzippedData"):
    filepath = "./unzippedData/" + file
    if not os.path.isfile(filepath):
        continue
    fileDataframe = pd.read_csv(filepath)
    columnNames = ""
    for name in fileDataframe.columns:
        columnNames += name + ", "
    columnNames = columnNames[:-2]
    if columnNames not in columnNamesDict:
        print(filepath)
        print(columnNames + "\n")
        columnNamesDict[columnNames] = fileDataframe.columns.values.tolist()


./unzippedData/2010-capitalbikeshare-tripdata.csv
Duration, Start date, End date, Start station number, Start station, End station number, End station, Bike number, Member type

./unzippedData/202004-capitalbikeshare-tripdata.csv
ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual



  fileDataframe = pd.read_csv(filepath)


Csv files before April 2020 have columns `Duration`, `Start date`, `End date`, `Start station number`, `Start station`, `End station number`, `End station`, `Bike number`, `Member type`.

Csv files after April 2020 have columns `ride_id`, `rideable_type`, `started_at`, `ended_at`, `start_station_name`, `start_station_id`, `end_station_name`, `end_station_id`, `start_lat`, `start_lng`, `end_lat`, `end_lng`, `member_casual`.

These factors make combining datasets difficult.


## Combining Data into one singular csv file

Since we have two different sets of column names / features for two different time ranges, we'll try to make one csv file for both sets.

**Warning**: takes a long time (~8 minutes) and requires at least 5.63 GB of space

In [13]:
import os
import pandas as pd

combinedDataBeforeApril2020 = pd.DataFrame(
    {'Duration': [], 'Start date': [], 'End date': [], 'Start station number': [],
     'Start station': [], 'End station number': [], 'End station': [], 'Bike number': [],
     'Member type': []})

combinedDataAfterApril2020 = pd.DataFrame({'ride_id': [], 'rideable_type': [], 'started_at': [], 'ended_at': [],
                              'start_station_name': [],
                              'start_station_id': [], 'end_station_name': [], 'end_station_id': [], 'start_lat': [],
                              'start_lng': [],
                              'end_lat': [], 'end_lng': [], 'member_casual': []})

columnNamesDict = set()
for file in os.listdir("./unzippedData"):
    filepath = "./unzippedData/" + file

    if not os.path.isfile(filepath):
        continue

    fileDataframe = pd.read_csv(filepath)

    if fileDataframe.columns[0] == 'ride_id':
        combinedDataAfterApril2020 = pd.concat([combinedDataAfterApril2020, fileDataframe], ignore_index=True)
    else:
        combinedDataBeforeApril2020 = pd.concat([combinedDataBeforeApril2020, fileDataframe], ignore_index=True)

combinedDataAfterApril2020.to_csv('./combinedDataAfterApril2020.csv')
combinedDataBeforeApril2020.to_csv('./combinedDataBeforeApril2020.csv')

  fileDataframe = pd.read_csv(filepath)
