## Data Engineering Summary


#### This notebook provides recommendations on the best path for accessing citibike data
##### The two data access recommendations are:
- A: Master csv file should be subsampled for global analysis
    - i.e. how has ridership changed over the 10+ year history of the company
<br>
- B: Local sqlite server should be used for local analysis
    - i.e. give me all data with start station == times square for during the pandemic
    - i.e. group by start station and get average trip duration
<br>

##### Method A helps to reduce the full dataset down to a subsample that is representative of the program history
##### Method B helps the user to access a specific subset of data without having to load all data into memory
- i.e. without the sql database you would have to load all relevant data then do group by

    

### Method A - Master CSV Subsampling

##### read an example full dataset from aws

In [6]:
import pandas as pd
import random

# Reading data directly from aws using "Link Address"
df_aws_full = pd.read_csv('https://schwinning.s3.us-east-2.amazonaws.com/JC-201909-citibike-tripdata.csv.zip')

# FileNotFoundError: [Errno 2] No such file or directory: 'https://schwinning.s3.us-east-2.amazonaws.com/JC-201909-citibike-tripdata.csv.zip'
    # filename = "https://schwinning.s3.us-east-2.amazonaws.com/JC-201909-citibike-tripdata.csv.zip"
    # n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    # s = 100 #desired sample size
    # skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    # df = pd.read_csv(filename, skiprows=skip)
    
df_aws_full.shape

(49244, 15)

##### read an example full dataset from aws and randomly pull 5% of it

In [8]:
import pandas as pd
import random

filename = "https://schwinning.s3.us-east-2.amazonaws.com/JC-201909-citibike-tripdata.csv.zip"

p = 0.05  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df_aws_percentage = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

df_aws_percentage.shape

(2465, 15)

##### read THE full dataset from aws and randomly pull 0.01% of it

In [10]:
filename = "https://schwinning.s3.us-east-2.amazonaws.com/combined.csv"

p = 0.0001  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df_aws_combined_percentage = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)##### read an example full dataset from aws

df_aws_combined_percentage.shape

### Method B: Local SQLite Server

#### load necessary libraries and map connection to the local sqlite server

In [12]:
# !pip install ipython-sql
%load_ext sql
%sql sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db

#### run standard sql query

In [19]:
%sql SELECT * FROM yr_2019 LIMIT 2;

 * sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db
Done.


tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
20,2019-01-01 00:01:47.4010,2019-01-01 00:07:07.5810,3160,Central Park West & W 76 St,40.77896784,-73.97374737,3283,W 89 St & Columbus Ave,40.7882213,-73.97041561,15839,Subscriber,1971,1
16,2019-01-01 00:04:43.7360,2019-01-01 00:10:00.6080,519,Pershing Square North,40.751873,-73.977706,518,E 39 St & 2 Ave,40.74780373,-73.9734419,32723,Subscriber,1964,1


#### save standard sql query as pandas dataframe

In [20]:
import pandas as pd

result = %sql SELECT * FROM yr_2019 LIMIT 5;
df = result.DataFrame()
print(type(df))
df

 * sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db
Done.
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
0,20,2019-01-01 00:01:47.4010,2019-01-01 00:07:07.5810,3160,Central Park West & W 76 St,40.778968,-73.973747,3283,W 89 St & Columbus Ave,40.788221,-73.970416,15839,Subscriber,1971,1
1,16,2019-01-01 00:04:43.7360,2019-01-01 00:10:00.6080,519,Pershing Square North,40.751873,-73.977706,518,E 39 St & 2 Ave,40.747804,-73.973442,32723,Subscriber,1964,1
2,91,2019-01-01 00:06:03.9970,2019-01-01 00:15:55.4380,3171,Amsterdam Ave & W 82 St,40.785247,-73.976673,3154,E 77 St & 3 Ave,40.773142,-73.958562,27451,Subscriber,1987,1
3,719,2019-01-01 00:07:03.5450,2019-01-01 00:52:22.6500,504,1 Ave & E 16 St,40.732219,-73.981656,3709,W 15 St & 6 Ave,40.738046,-73.99643,21579,Subscriber,1990,1
4,3,2019-01-01 00:07:35.9450,2019-01-01 00:12:39.5020,229,Great Jones St,40.727434,-73.99379,503,E 20 St & Park Ave,40.738274,-73.98752,35379,Subscriber,1979,1


#### demonstration of randomly sampling from table need to speed up

In [38]:
# Approximately 20 seconds. Not a big problem if we run once and save as csv
results = %sql select * from yr_2019 order by random() limit 10;
df_random = result.DataFrame()
df_random.head(5)

 * sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db
(sqlite3.OperationalError) wrong number of arguments to function random()
[SQL: select random(1,2) from yr_2019 limit 10;]
(Background on this error at: http://sqlalche.me/e/e3q8)


Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
0,20,2019-01-01 00:01:47.4010,2019-01-01 00:07:07.5810,3160,Central Park West & W 76 St,40.778968,-73.973747,3283,W 89 St & Columbus Ave,40.788221,-73.970416,15839,Subscriber,1971,1
1,16,2019-01-01 00:04:43.7360,2019-01-01 00:10:00.6080,519,Pershing Square North,40.751873,-73.977706,518,E 39 St & 2 Ave,40.747804,-73.973442,32723,Subscriber,1964,1
2,91,2019-01-01 00:06:03.9970,2019-01-01 00:15:55.4380,3171,Amsterdam Ave & W 82 St,40.785247,-73.976673,3154,E 77 St & 3 Ave,40.773142,-73.958562,27451,Subscriber,1987,1
3,719,2019-01-01 00:07:03.5450,2019-01-01 00:52:22.6500,504,1 Ave & E 16 St,40.732219,-73.981656,3709,W 15 St & 6 Ave,40.738046,-73.99643,21579,Subscriber,1990,1
4,3,2019-01-01 00:07:35.9450,2019-01-01 00:12:39.5020,229,Great Jones St,40.727434,-73.99379,503,E 20 St & Park Ave,40.738274,-73.98752,35379,Subscriber,1979,1


In [21]:
results = %sql SELECT start_station_name from yr_2019;


 * sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db
Done.


In [24]:
df_2019 = results.DataFrame()

In [25]:
df_2019.head(2)

Unnamed: 0,start_station_name
0,Central Park West & W 76 St
1,Pershing Square North


In [26]:
# df_2019.start_station_name.value_counts()
df_2019.start_station_name.nunique()

992

In [30]:
results = %sql SELECT start_station_name from yr_2014;
df_2014 = results.DataFrame()

 * sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db
Done.


In [32]:
# df_2019.start_station_name.value_counts()
df_2014.start_station_name.nunique()

344

In [33]:
results = %sql SELECT start_station_name from yr_2015;
df_2015 = results.DataFrame()

 * sqlite:////Users/michaellink/Desktop/__NYCDSA/_Projects/Capstone/data/citibike/sqlite/citi_bike.db
Done.


In [35]:
# df_2019.start_station_name.value_counts()
df_2015.start_station_name.nunique()

532

### Future Direction
- pick global random sampling method
    - count unique imbalanced features and calculate ratio 
        - (dock station A is 24% of global population, dock station B is 13% of global population, etc.)
    - run multiple times with various subsample ratio (1% subsample, 5% subsample, 10% subsample)
        - (dock station A is 22% of subsample population, dock station B is 16% of subsample population, etc.)
    - plot histogram of subsample 