# Creating the Database
---


Citi Bike provides a valuable tranportation service to people who live in or travel to NYC. It has an [open Citi Bike database](https://ride.citibikenyc.com/system-data) and it's free for the public to [download](https://s3.amazonaws.com/tripdata/index.html). In this notebook, we will build a SQL database populated with the data. Since the data format was changed in Feburary 2021, we'll focus on the NYC's trip data before 2021.
  
Firstly, we will build up a webscraper to pull all the necessary data. After downloading the data, we will merge files and import the data into a PostgreSQL database. 
  
⚠️A new data format of Citi Bike trip data has been used from February 2021.  
⚠️If you encounter issues when you run the codes. Please go and check [Issues](https://github.com/Fitzmon/CitiBike_Analysis/issues?q=is%3Aissue+is%3Aclosed) page.
  


---
Refenrence: 
- [Building a Citibike Database with Python](https://medium.com/@fausto.manon/building-a-citibike-database-with-python-9849a59fb90c)  
- [Analysis and prediction of Citi Bike usage in the unpredictable 2020](https://towardsdatascience.com/analysis-and-prediction-of-citi-bike-usage-in-the-unpredictable-2020-3401da26881b)

---
## Table of Contents

- [01. Importing Libraries](#01.-Importing-Libraries)
- [02. Data Preparation](#02.-Data-Preparation)
- [03. Building a PostgreSQL Database](#03.-Building-a-PostgreSQL-Database)
- [04. Unresolved Isuues](#04.-Unresolved-Issues)

---
### 01. Importing Libraries

In [1]:
# Web scraping libraries
import requests
import urllib.request
from bs4 import BeautifulSoup

# Downloading, moving and unzipping files
import webbrowser
from time import sleep
import shutil 
import os
from zipfile import ZipFile

# DataFrame exploration and manipulation
import pandas as pd
from glob import glob

# PostgreSQL interaction
import psycopg2
import linecache
from psycopg2 import sql
from psycopg2 import Error
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT

---
### 02. Data Preparation

Under the project directory, we run `jupyter.exe notebook` to activate Jupyter Notebook.

In [2]:
url = 'https://s3.amazonaws.com/tripdata/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'xml')
data_files = soup.find_all('Key')

If you run Jupyter Notebook under the project directory, you can run `os.getcwd()` to get current filepath. Otherwise, you have to input the filepath of folder where you store the data.

In [3]:
# Get current work directory
cwd = os.getcwd()
data_loc = cwd + '\data\\'
#data_loc = '..//data/'

# Create a list with picked years
years = []
for year in range(2013, 2021):
    years.append(str(year))

In the step below, we're going to retreive all of the zip files from the [site](https://s3.amazonaws.com/tripdata/index.html). Our study area is NYC. As Citi Bike also extends to Jersey City and Hoboken in New Jersey, we will filter the data because we only want the data before 2021.

In [4]:
# Instantiate empty list
zip_files = []
filter_files = []

# Populate list with zip file names
for file in range(len(data_files)-1):
    zip_files.append(data_files[file].get_text())

# Filter list with data file names
for year in years:
    for file in zip_files:
        if year in file and file not in filter_files: 
            if 'JC' not in file:
                filter_files.append(file)

# Sort the list
filter_files.sort()

The `webbrowser.open_new(url)` function will open url using the default browser.  
<div class="alert alert-block alert-warning">
Here I suggest that you <b>change the dafult download location</b> (usually "C:\Users\your_name\Downloads") to the project folder from the Settings page in the browser.
</div>

In [5]:
# Download New York City zip files
for file in filter_files:
    if not os.path.exists(data_loc + file):
        webbrowser.open_new(url + file)
        sleep(5)

After downloading all of respective .zip files, we will unzip them and then relocate them from the default download folder to the data folder.  


In [6]:
source = 'C:/Users/Hui/Downloads/'
destination = data_loc
#destination = '..//data/'

#-----OPTION 1: Save in project folder-----
# Unzip files and clean up data folder
for file in filter_files:
    file_name = destination + file
    with ZipFile(file_name) as zip_ref:
        zip_ref.extractall(destination)
    os.remove(file_name)

#-----OPTION 2: Save in default folder-----
# Unzip files and clean up data folder
# for file in filter_files:
#     file_name = source + file
#     with ZipFile(file_name) as zip_ref:
#         zip_ref.extractall(project_loc)
#     os.remove(file_name)

Next, we need to uniform the data to match the same data format. In `pandas`, a column in a DataFrame can only have one data type.

In [7]:
files = os.listdir(data_loc)

for csv in files:
    if csv.endswith('.csv'):
        df = pd.read_csv(data_loc + csv)
        df = df.rename(columns=({'trip_duration':'tripduration',
                             'start_time':'starttime',
                             'stop_time':'stoptime',
                             'start_station_id':'start station id',
                             'start_station_name':'start station name',
                             'start_station_latitude':'start station latitude',
                             'start_station_longitude':'start station longitude',
                             'end_station_id':'end station id',
                             'end_station_name':'end station name',
                             'end_station_latitude':'end station latitude',
                             'end_station _longitude':'end station longitude',
                             'bike_id':'bikeid',
                             'user_type':'usertype',
                             'birth_year':'birth year',
                             'gender':'gender'}))
        df.to_csv(data_loc + csv, index = None)

We have formatted all column headers across all downloaded data, we will now separate the data by year and categorise them into different folders named by year from 2013 to 2020. Then, we will merge all `.csv` files into one for each folder.

In [8]:
# Create new folders
for year in years:
    year_dir = os.path.join(data_loc, year)
    if not os.path.exists(year_dir):
        os.mkdir(year_dir)
    
# Move from project folder to folders named by year
for item in os.listdir(data_loc):
    for year in years:
        if year in item and item.endswith('.csv'):
            shutil.move(data_loc + item, data_loc + year)

# Merge files
for year in years:
    ny_files = sorted(glob(data_loc + year + '\*.csv'))
    ny_trip_data = pd.concat((pd.read_csv(file) for file in ny_files), ignore_index = True)
    ny_trip_data.to_csv(data_loc + year + '/ny_trip_data_' + year + '.csv', index = False)

---
### 03. Building a PostgreSQL Database
Please establish a connection with PostgreSQL and create a new database.

In [9]:
# Connect to PostgreSQL
conn = psycopg2.connect("user=postgres password='password'");
conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT);

# Obtain in a DB Cursor
cursor = conn.cursor();
db_name = "citibike_data"

# Create DB in PostgreSQL
create_database = f"CREATE DATABASE {db_name};"
cursor.execute(create_database);
print("Database created successfully in PostgreSQL")

# Closing PostgreSQL connection
if(conn):
    cursor.close()
    conn.close()
    print("PostgreSQL connection is closed")

Database created successfully in PostgreSQL
PostgreSQL connection is closed


In [10]:
# Building a table for New York City data
try:
    conn = psycopg2.connect(user = 'postgres',
                           password = 'password',
                           host = '127.0.0.1',
                           port = '5432',
                           database = 'citibike_data')
    
    cursor = conn.cursor()
    
    # Some data types have been amended below to account for all data in the csv (i.e. blank cells)
    create_table_query = '''CREATE TABLE new_york_city(
                            trip_duration INT,
                            start_time TIMESTAMP,
                            stop_time TIMESTAMP,
                            start_station_id INT,
                            start_station_name TEXT,
                            start_station_latitude FLOAT,
                            start_station_longitude FLOAT,
                            end_station_id INT,
                            end_station_name TEXT,
                            end_station_latitude FLOAT,
                            end_station_longitude FLOAT,
                            bike_id INT,
                            user_type TEXT,
                            birth_year TEXT,
                            gender INT
                            ); '''
    
    cursor.execute(create_table_query)
    conn.commit()
    print("Table created successfully in PostgreSQL")

except (Exception, psycopg2.DatabaseError) as error:
    print("Error while creating PostgreSQL table:", error)
finally:
    # closing database connection
    if (conn):
        cursor.close()
        conn.close()
        print("PostgreSQL connection is closed")

Table created successfully in PostgreSQL
PostgreSQL connection is closed


In [11]:
# Populating the NYC table
try:
    conn = psycopg2.connect(user = 'postgres',
                           password = 'password',
                           host = '127.0.0.1',
                           port = '5432',
                           database = 'citibike_data')
    
    cursor = conn.cursor()
    
    for year in years:
        if year == '2013' or year == '2016': # meet an syntax issue
            continue
        filename = data_loc + year + '/ny_trip_data_' + year + '.csv'
        with open(filename, 'r') as data:
            next(data) # skip the header row
            cursor.copy_from(data, 'new_york_city', sep=',')
        
        conn.commit()
    
    print("Table updated successfully in PostgreSQL ")
        
except (Exception, psycopg2.DatabaseError) as error:
    print ("Error while updating PostgreSQL table:", error)
    
finally:
    if(conn):
            cursor.close()
            conn.close()
            print("PostgreSQL connection is closed")

Error while updating PostgreSQL table: extra data after last expected column
CONTEXT:  COPY new_york_city, line 1: "680.0,2017-01-01 00:00:21,2017-01-01 00:11:41,3226.0,W 82 St & Central Park West,40.78275,-73.97137,..."

PostgreSQL connection is closed


### 04. Unsolved Issues

- ~~separate the data by years~~ 
- ~~improve pd.read() function~~
- PostgreSQL errir -> 'Error while updating PostgreSQL table: extra data after last expected column'