## Compile 2016 thru 2019 Citi Bike ride data for Year-Over-Year (YOY) analysis 
## Script 1 of 2

> Many of the questions posed by Citi Bike management require trend analysis over multiple years. Because the source data size is so enormous, I made the decision to confine the analysis to the busiest ride season comprised of the months: July, August, and September. These months will enable study of peak rider participation for both members and those purchasing one-day or three-day passes. 

> **The YOY data is prepared using two different Jupyter Notebook scripts. This first script** reads in source data for Jersey City and New York, which are stored in separate files on the Citi Bike web site. Next, start and stop time data types are converted to datetime. Then, Jersey City columns are renamed to match New York columns (necessary for 2016 months only) and the records for Jersey City and New York are combined into a single dataframe representing the month. The script also assigns a random integer that is subsequently subjected to a modulus operation to randomly split the records into 10 representative groups. Finally, the random integer and modulus operation are used to reduce the sample size to make it more manageable for Tableau.

>NOTE: the **downloaded source data** is *not* cloned to GitHub due to its size. Files with sample source records for 201607 are included in the 'Resources' folder on GitHub. Also, the **output files** generated by this script are *not* cloned to GitHub due to size constraints. An output file with sample records for 201607 are included in 'Resources' folder on GitHub.

In [1]:
# import libraries
import pandas as pd
import numpy as np

## 2016

### July 2016

#### read in the source csv files containing the Jersey City (JC) and New York (NY) ride data

In [2]:
file_path = ('Resources/YearOverYear/JC-201607-citibike-tripdata.csv')
JC_1607_df = pd.read_csv(file_path)

In [3]:
file_path = ('Resources/YearOverYear/201607-citibike-tripdata.csv')
NY_1607_df = pd.read_csv(file_path)

#### convert start and stop columns to datetime data type

In [4]:
JC_1607_df['Start Time'] = pd.to_datetime(JC_1607_df['Start Time'])
JC_1607_df['Stop Time'] = pd.to_datetime(JC_1607_df['Stop Time'])
NY_1607_df['starttime'] = pd.to_datetime(NY_1607_df['starttime'])
NY_1607_df['stoptime'] = pd.to_datetime(NY_1607_df['stoptime'])

#### rename the Jersey City columns to match the column names for New York (necessary for 2016 months only)

In [5]:
JC_1607_df = JC_1607_df.rename(columns={"Trip Duration":"tripduration", 
                                          "Start Time":"starttime",
                                          "Stop Time":"stoptime",
                                          "Start Station ID":"start station id",
                                          "Start Station Name":"start station name",
                                          "Start Station Latitude":"start station latitude",
                                          "Start Station Longitude":"start station longitude",
                                          "End Station ID":"end station id",
                                          "End Station Name":"end station name",
                                          "End Station Latitude":"end station latitude",
                                          "End Station Longitude":"end station longitude",
                                          "Bike ID":"bikeid",
                                          "User Type":"usertype",
                                          "Birth Year":"birth year",
                                          "Gender":"gender"})

#### combine the New York and Jersey City ride data into a single file representing the month

In [6]:
citi_201607_df = NY_1607_df.append(JC_1607_df, ignore_index = True) 

#### assign a random integer and perform a modulus operation to assign the data to 10 representitive groups

In [7]:
citi_201607_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201607_df),1))

In [8]:
citi_201607_df['mod10'] = citi_201607_df['Random_integer'] % 10

In [9]:
# check the randomly created groups
citi_201607_df['mod10'].value_counts()

2    141336
4    141131
9    140886
3    140712
1    140406
6    140055
0    140026
8    140016
7    139990
5    139988
Name: mod10, dtype: int64

#### create a new sample dataframe containing 10% of the month's ride data; here, using records with 'mod10' = 7

In [10]:
sample_201607_df = citi_201607_df.loc[citi_201607_df["mod10"] == 7]

#### drop 'mod10' and 'Random_integer' columns to reduce file size

In [12]:
sample_201607_df = sample_201607_df.drop('mod10', axis=1)
sample_201607_df = sample_201607_df.drop('Random_integer', axis=1)

#### output the sample dataframe to csv for further processing in subsequent scripts

In [14]:
sample_201607_df.to_csv("Resources/YOY_citi_201607.csv", index=False)

## Documentation Note for Following Cells

> The following cells repeat the above steps performed on the July 2016 data for each of the following months selected for the YOY analysis. Therefore, the documentation for the following processing steps is the same as shown above for July 2016 and will not be repeated for brevity's sake.

### August 2016

In [15]:
file_path = ('Resources/YearOverYear/JC-201608-citibike-tripdata.csv')
JC_1608_df = pd.read_csv(file_path)

In [16]:
file_path = ('Resources/YearOverYear/201608-citibike-tripdata.csv')
NY_1608_df = pd.read_csv(file_path)

In [17]:
JC_1608_df['Start Time'] = pd.to_datetime(JC_1608_df['Start Time'])
JC_1608_df['Stop Time'] = pd.to_datetime(JC_1608_df['Stop Time'])
NY_1608_df['starttime'] = pd.to_datetime(NY_1608_df['starttime'])
NY_1608_df['stoptime'] = pd.to_datetime(NY_1608_df['stoptime'])

In [18]:
JC_1608_df = JC_1608_df.rename(columns={"Trip Duration":"tripduration", 
                                          "Start Time":"starttime",
                                          "Stop Time":"stoptime",
                                          "Start Station ID":"start station id",
                                          "Start Station Name":"start station name",
                                          "Start Station Latitude":"start station latitude",
                                          "Start Station Longitude":"start station longitude",
                                          "End Station ID":"end station id",
                                          "End Station Name":"end station name",
                                          "End Station Latitude":"end station latitude",
                                          "End Station Longitude":"end station longitude",
                                          "Bike ID":"bikeid",
                                          "User Type":"usertype",
                                          "Birth Year":"birth year",
                                          "Gender":"gender"})

In [19]:
citi_201608_df = NY_1608_df.append(JC_1608_df, ignore_index = True) 

In [20]:
citi_201608_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201608_df),1))

In [21]:
citi_201608_df['mod10'] = citi_201608_df['Random_integer'] % 10

In [22]:
citi_201608_df['mod10'].value_counts()

3    159542
7    159497
9    159457
2    159256
0    159151
4    159147
1    159029
8    159028
5    158973
6    158732
Name: mod10, dtype: int64

In [23]:
sample_201608_df = citi_201608_df.loc[citi_201608_df["mod10"] == 7]

In [24]:
sample_201608_df = sample_201608_df.drop('mod10', axis=1)
sample_201608_df = sample_201608_df.drop('Random_integer', axis=1)

In [25]:
sample_201608_df.to_csv("Resources/YOY_citi_201608.csv", index=False)

### September 2016

In [26]:
file_path = ('Resources/YearOverYear/JC-201609-citibike-tripdata.csv')
JC_1609_df = pd.read_csv(file_path)

In [27]:
file_path = ('Resources/YearOverYear/201609-citibike-tripdata.csv')
NY_1609_df = pd.read_csv(file_path)

In [28]:
JC_1609_df['Start Time'] = pd.to_datetime(JC_1609_df['Start Time'])
JC_1609_df['Stop Time'] = pd.to_datetime(JC_1609_df['Stop Time'])
NY_1609_df['starttime'] = pd.to_datetime(NY_1609_df['starttime'])
NY_1609_df['stoptime'] = pd.to_datetime(NY_1609_df['stoptime'])

In [29]:
JC_1609_df = JC_1609_df.rename(columns={"Trip Duration":"tripduration", 
                                          "Start Time":"starttime",
                                          "Stop Time":"stoptime",
                                          "Start Station ID":"start station id",
                                          "Start Station Name":"start station name",
                                          "Start Station Latitude":"start station latitude",
                                          "Start Station Longitude":"start station longitude",
                                          "End Station ID":"end station id",
                                          "End Station Name":"end station name",
                                          "End Station Latitude":"end station latitude",
                                          "End Station Longitude":"end station longitude",
                                          "Bike ID":"bikeid",
                                          "User Type":"usertype",
                                          "Birth Year":"birth year",
                                          "Gender":"gender"})

In [30]:
citi_201609_df = NY_1609_df.append(JC_1609_df, ignore_index = True) 

In [31]:
citi_201609_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201609_df),1))

In [32]:
citi_201609_df['mod10'] = citi_201609_df['Random_integer'] % 10

In [33]:
citi_201609_df['mod10'].value_counts()

4    168714
5    168489
7    168310
3    168265
1    168228
2    168225
8    168188
0    168186
9    167898
6    167778
Name: mod10, dtype: int64

In [34]:
sample_201609_df = citi_201609_df.loc[citi_201609_df["mod10"] == 7]

In [35]:
sample_201609_df = sample_201609_df.drop('mod10', axis=1)
sample_201609_df = sample_201609_df.drop('Random_integer', axis=1)

In [36]:
sample_201609_df.to_csv("Resources/YOY_citi_201609.csv", index=False)

## 2017

### July 2017

In [37]:
file_path = ('Resources/YearOverYear/JC-201707-citibike-tripdata.csv')
JC_1707_df = pd.read_csv(file_path)

In [38]:
file_path = ('Resources/YearOverYear/201707-citibike-tripdata.csv')
NY_1707_df = pd.read_csv(file_path)

In [39]:
JC_1707_df['starttime'] = pd.to_datetime(JC_1707_df['starttime'])
JC_1707_df['stoptime'] = pd.to_datetime(JC_1707_df['stoptime'])
NY_1707_df['starttime'] = pd.to_datetime(NY_1707_df['starttime'])
NY_1707_df['stoptime'] = pd.to_datetime(NY_1707_df['stoptime'])

In [40]:
citi_201707_df = NY_1707_df.append(JC_1707_df, ignore_index = True) 

In [41]:
citi_201707_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201707_df),1))

In [42]:
citi_201707_df['mod10'] = citi_201707_df['Random_integer'] % 10

In [43]:
citi_201707_df['mod10'].value_counts()

8    177422
2    177384
3    177280
1    177066
5    177000
0    177000
4    176956
7    176930
9    176155
6    175979
Name: mod10, dtype: int64

In [44]:
sample_201707_df = citi_201707_df.loc[citi_201707_df["mod10"] == 7]

In [45]:
sample_201707_df = sample_201707_df.drop('mod10', axis=1)
sample_201707_df = sample_201707_df.drop('Random_integer', axis=1)

In [46]:
sample_201707_df.to_csv("Resources/YOY_citi_201707.csv", index=False)

### August 2017

In [47]:
file_path = ('Resources/YearOverYear/JC-201708-citibike-tripdata.csv')
JC_1708_df = pd.read_csv(file_path)

In [48]:
file_path = ('Resources/YearOverYear/201708-citibike-tripdata.csv')
NY_1708_df = pd.read_csv(file_path)

In [49]:
JC_1708_df['starttime'] = pd.to_datetime(JC_1708_df['starttime'])
JC_1708_df['stoptime'] = pd.to_datetime(JC_1708_df['stoptime'])
NY_1708_df['starttime'] = pd.to_datetime(NY_1708_df['starttime'])
NY_1708_df['stoptime'] = pd.to_datetime(NY_1708_df['stoptime'])

In [50]:
citi_201708_df = NY_1708_df.append(JC_1708_df, ignore_index = True) 

In [51]:
citi_201708_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201708_df),1))

In [52]:
citi_201708_df['mod10'] = citi_201708_df['Random_integer'] % 10

In [53]:
citi_201708_df['mod10'].value_counts()

6    185838
0    185417
4    185288
1    185224
3    185222
7    185105
5    185078
8    185050
2    184929
9    184819
Name: mod10, dtype: int64

In [54]:
sample_201708_df = citi_201708_df.loc[citi_201708_df["mod10"] == 7]

In [55]:
sample_201708_df = sample_201708_df.drop('mod10', axis=1)
sample_201708_df = sample_201708_df.drop('Random_integer', axis=1)

In [56]:
sample_201708_df.to_csv("Resources/YOY_citi_201708.csv", index=False)

### September 2017

In [57]:
file_path = ('Resources/YearOverYear/JC-201709-citibike-tripdata.csv')
JC_1709_df = pd.read_csv(file_path)

In [58]:
file_path = ('Resources/YearOverYear/201709-citibike-tripdata.csv')
NY_1709_df = pd.read_csv(file_path)

In [59]:
JC_1709_df['starttime'] = pd.to_datetime(JC_1709_df['starttime'])
JC_1709_df['stoptime'] = pd.to_datetime(JC_1709_df['stoptime'])
NY_1709_df['starttime'] = pd.to_datetime(NY_1709_df['starttime'])
NY_1709_df['stoptime'] = pd.to_datetime(NY_1709_df['stoptime'])

In [60]:
citi_201709_df = NY_1709_df.append(JC_1709_df, ignore_index = True) 

In [61]:
citi_201709_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201709_df),1))

In [62]:
citi_201709_df['mod10'] = citi_201709_df['Random_integer'] % 10

In [63]:
citi_201709_df['mod10'].value_counts()

5    191352
3    191310
1    191236
6    191194
9    191157
2    191138
4    191066
8    191033
7    191005
0    190726
Name: mod10, dtype: int64

In [64]:
sample_201709_df = citi_201709_df.loc[citi_201709_df["mod10"] == 7]

In [65]:
sample_201709_df = sample_201709_df.drop('mod10', axis=1)
sample_201709_df = sample_201709_df.drop('Random_integer', axis=1)

In [66]:
sample_201709_df.to_csv("Resources/YOY_citi_201709.csv", index=False)

## 2018

### July 2018

In [67]:
file_path = ('Resources/YearOverYear/JC-201807-citibike-tripdata.csv')
JC_1807_df = pd.read_csv(file_path)

In [68]:
file_path = ('Resources/YearOverYear/201807-citibike-tripdata.csv')
NY_1807_df = pd.read_csv(file_path)

In [69]:
JC_1807_df['starttime'] = pd.to_datetime(JC_1807_df['starttime'])
JC_1807_df['stoptime'] = pd.to_datetime(JC_1807_df['stoptime'])
NY_1807_df['starttime'] = pd.to_datetime(NY_1807_df['starttime'])
NY_1807_df['stoptime'] = pd.to_datetime(NY_1807_df['stoptime'])

In [70]:
citi_201807_df = NY_1807_df.append(JC_1807_df, ignore_index = True) 

In [71]:
citi_201807_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201807_df),1))

In [72]:
citi_201807_df['mod10'] = citi_201807_df['Random_integer'] % 10

In [73]:
citi_201807_df['mod10'].value_counts()

0    196546
4    196404
5    196018
2    195974
3    195789
7    195462
9    195068
1    195066
6    194833
8    194733
Name: mod10, dtype: int64

In [74]:
sample_201807_df = citi_201807_df.loc[citi_201807_df["mod10"] == 7]

In [75]:
sample_201807_df = sample_201807_df.drop('mod10', axis=1)
sample_201807_df = sample_201807_df.drop('Random_integer', axis=1)

In [76]:
sample_201807_df.to_csv("Resources/YOY_citi_201807.csv", index=False)

### August 2018

In [77]:
file_path = ('Resources/YearOverYear/JC-201808-citibike-tripdata.csv')
JC_1808_df = pd.read_csv(file_path)

In [78]:
file_path = ('Resources/YearOverYear/201808-citibike-tripdata.csv')
NY_1808_df = pd.read_csv(file_path)

In [79]:
JC_1808_df['starttime'] = pd.to_datetime(JC_1808_df['starttime'])
JC_1808_df['stoptime'] = pd.to_datetime(JC_1808_df['stoptime'])
NY_1808_df['starttime'] = pd.to_datetime(NY_1808_df['starttime'])
NY_1808_df['stoptime'] = pd.to_datetime(NY_1808_df['stoptime'])

In [80]:
citi_201808_df = NY_1808_df.append(JC_1808_df, ignore_index = True) 

In [81]:
citi_201808_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201808_df),1))

In [82]:
citi_201808_df['mod10'] = citi_201808_df['Random_integer'] % 10

In [83]:
citi_201808_df['mod10'].value_counts()

9    203144
8    202602
4    202423
6    202396
2    202352
7    202277
5    201817
0    201817
3    201596
1    201185
Name: mod10, dtype: int64

In [84]:
sample_201808_df = citi_201808_df.loc[citi_201808_df["mod10"] == 7]

In [85]:
sample_201808_df = sample_201808_df.drop('mod10', axis=1)
sample_201808_df = sample_201808_df.drop('Random_integer', axis=1)

In [86]:
sample_201808_df.to_csv("Resources/YOY_citi_201808.csv", index=False)

### September 2018

In [87]:
file_path = ('Resources/YearOverYear/JC-201809-citibike-tripdata.csv')
JC_1809_df = pd.read_csv(file_path)

In [88]:
file_path = ('Resources/YearOverYear/201809-citibike-tripdata.csv')
NY_1809_df = pd.read_csv(file_path)

In [89]:
JC_1809_df['starttime'] = pd.to_datetime(JC_1809_df['starttime'])
JC_1809_df['stoptime'] = pd.to_datetime(JC_1809_df['stoptime'])
NY_1809_df['starttime'] = pd.to_datetime(NY_1809_df['starttime'])
NY_1809_df['stoptime'] = pd.to_datetime(NY_1809_df['stoptime'])

In [90]:
citi_201809_df = NY_1809_df.append(JC_1809_df, ignore_index = True) 

In [91]:
citi_201809_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201809_df),1))

In [92]:
citi_201809_df['mod10'] = citi_201809_df['Random_integer'] % 10

In [93]:
citi_201809_df['mod10'].value_counts()

0    192185
3    191976
5    191926
1    191926
6    191887
2    191740
7    191632
4    191443
8    191121
9    191070
Name: mod10, dtype: int64

In [94]:
sample_201809_df = citi_201809_df.loc[citi_201809_df["mod10"] == 7]

In [95]:
sample_201809_df = sample_201809_df.drop('mod10', axis=1)
sample_201809_df = sample_201809_df.drop('Random_integer', axis=1)

In [96]:
sample_201809_df.to_csv("Resources/YOY_citi_201809.csv", index=False)

## 2019

### July 2019

In [97]:
file_path = ('Resources/YearOverYear/JC-201907-citibike-tripdata.csv')
JC_1907_df = pd.read_csv(file_path)

In [98]:
file_path = ('Resources/YearOverYear/201907-citibike-tripdata.csv')
NY_1907_df = pd.read_csv(file_path)

In [99]:
JC_1907_df['starttime'] = pd.to_datetime(JC_1907_df['starttime'])
JC_1907_df['stoptime'] = pd.to_datetime(JC_1907_df['stoptime'])
NY_1907_df['starttime'] = pd.to_datetime(NY_1907_df['starttime'])
NY_1907_df['stoptime'] = pd.to_datetime(NY_1907_df['stoptime'])

In [100]:
citi_201907_df = NY_1907_df.append(JC_1907_df, ignore_index = True) 

In [101]:
citi_201907_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201907_df),1))

In [102]:
citi_201907_df['mod10'] = citi_201907_df['Random_integer'] % 10

In [103]:
citi_201907_df['mod10'].value_counts()

8    223382
7    222909
0    222520
4    222489
2    222459
3    222386
1    222332
6    222288
9    222089
5    221956
Name: mod10, dtype: int64

In [104]:
sample_201907_df = citi_201907_df.loc[citi_201907_df["mod10"] == 7]

In [105]:
sample_201907_df = sample_201907_df.drop('mod10', axis=1)
sample_201907_df = sample_201907_df.drop('Random_integer', axis=1)

In [106]:
sample_201907_df.to_csv("Resources/YOY_citi_201907.csv", index=False)

### August 2019

In [107]:
file_path = ('Resources/YearOverYear/JC-201908-citibike-tripdata.csv')
JC_1908_df = pd.read_csv(file_path)

In [108]:
file_path = ('Resources/YearOverYear/201908-citibike-tripdata.csv')
NY_1908_df = pd.read_csv(file_path)

In [109]:
JC_1908_df['starttime'] = pd.to_datetime(JC_1908_df['starttime'])
JC_1908_df['stoptime'] = pd.to_datetime(JC_1908_df['stoptime'])
NY_1908_df['starttime'] = pd.to_datetime(NY_1908_df['starttime'])
NY_1908_df['stoptime'] = pd.to_datetime(NY_1908_df['stoptime'])

In [110]:
citi_201908_df = NY_1908_df.append(JC_1908_df, ignore_index = True) 

In [111]:
citi_201908_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201908_df),1))

In [112]:
citi_201908_df['mod10'] = citi_201908_df['Random_integer'] % 10

In [113]:
citi_201908_df['mod10'].value_counts()

9    240198
0    239681
2    239527
3    239507
4    239430
1    239333
8    239105
7    239063
5    238609
6    238482
Name: mod10, dtype: int64

In [114]:
sample_201908_df = citi_201908_df.loc[citi_201908_df["mod10"] == 7]

In [115]:
sample_201908_df = sample_201908_df.drop('mod10', axis=1)
sample_201908_df = sample_201908_df.drop('Random_integer', axis=1)

In [116]:
sample_201908_df.to_csv("Resources/YOY_citi_201908.csv", index=False)

### September 2019

In [117]:
file_path = ('Resources/YearOverYear/JC-201909-citibike-tripdata.csv')
JC_1909_df = pd.read_csv(file_path)

In [118]:
file_path = ('Resources/YearOverYear/201909-citibike-tripdata.csv')
NY_1909_df = pd.read_csv(file_path)

In [119]:
JC_1909_df['starttime'] = pd.to_datetime(JC_1909_df['starttime'])
JC_1909_df['stoptime'] = pd.to_datetime(JC_1909_df['stoptime'])
NY_1909_df['starttime'] = pd.to_datetime(NY_1909_df['starttime'])
NY_1909_df['stoptime'] = pd.to_datetime(NY_1909_df['stoptime'])

In [120]:
citi_201909_df = NY_1909_df.append(JC_1909_df, ignore_index = True) 

In [121]:
citi_201909_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201909_df),1))

In [122]:
citi_201909_df['mod10'] = citi_201909_df['Random_integer'] % 10

In [123]:
citi_201909_df['mod10'].value_counts()

9    250418
1    250026
5    249696
0    249537
4    249383
3    249278
7    249173
6    249107
2    249041
8    248485
Name: mod10, dtype: int64

In [124]:
sample_201909_df = citi_201909_df.loc[citi_201909_df["mod10"] == 7]

In [125]:
sample_201909_df = sample_201909_df.drop('mod10', axis=1)
sample_201909_df = sample_201909_df.drop('Random_integer', axis=1)

In [126]:
sample_201909_df.to_csv("Resources/YOY_citi_201909.csv", index=False)