## Compile complete 12-month ride data for 2019
## Script 1 of 2

> Some of the questions posed by Citi Bike require a full year of ride data to be used for exploring seasonality and other analytic perspectives requiring a full 12 months of ride activity. I elected to use rider data from all of 2019, which at the time of the analysis was the most recent full year of data. The required source data was previously downloaded from the Citi Bike web site into a local 'Resources' work folder. 

>**The 12-month 2019 data is prepared using two different Jupyter Notebook scripts. This first script** combines the New York and Jersey City data for each month of 2019, since the source data from the Citi Bike web site stores New York ride data and Jersey City ride data in separate files.  The script also assigns a random integer that is subsequently subjected to a modulus operation to randomly split the records into 10 representative groups. The random integer and modulus operation are then used to reduce the sample size to make it more manageable for Tableau.

>NOTE: the **downloaded source data** is *not* cloned to GitHub due to its size. Files with sample source records for 201901 are included in the 'Resources' folder on GitHub. Also, the **output files** generated by this script are *not* cloned to GitHub due to size constraints. An output file with sample records for 201901 are included in 'Resources' folder on GitHub.

In [4]:
# import libraries
import pandas as pd
import numpy as np

## January

#### read in the source csv files containing the Jersey City (JC) and New York (NY) ride data

In [5]:
file_path = ('Resources/2019/JC-201901-citibike-tripdata.csv')
JC_01_df = pd.read_csv(file_path)

In [6]:
file_path = ('Resources/2019/201901-citibike-tripdata.csv')
NY_01_df = pd.read_csv(file_path)

#### combine the New York and Jersey City ride data into a single file representing the month

In [7]:
citi_201901_df = NY_01_df.append(JC_01_df, ignore_index = True) 

#### assign a random integer and perform a modulus operation to assign the data to 10 representitive groups

In [8]:
citi_201901_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201901_df),1))

In [9]:
citi_201901_df['mod10'] = citi_201901_df['Random_integer'] % 10

In [10]:
# check the randomly created groups
citi_201901_df['mod10'].value_counts()

9    99150
1    99056
0    98821
4    98691
5    98659
6    98649
3    98648
8    98538
7    98440
2    98311
Name: mod10, dtype: int64

#### create a new sample dataframe containing 10% of the month's ride data; here, using records with 'mod10' = 7

In [11]:
sample_201901_df = citi_201901_df.loc[citi_201901_df["mod10"] == 7]

#### output the sample dataframe to csv for further processing in subsequent scripts

In [12]:
sample_201901_df.to_csv("Resources/sample_citi_201901.csv", index=False)

## Documentation Note for Following Cells

> The following cells repeat the above steps performed on the January data for each month of 2019. Therefore, the documentation for the following processing steps is the same as shown above for January and will not be repeated for brevity's sake.

## February

In [13]:
file_path = ('Resources/2019/JC-201902-citibike-tripdata.csv')
JC_02_df = pd.read_csv(file_path)

In [14]:
file_path = ('Resources/2019/201902-citibike-tripdata.csv')
NY_02_df = pd.read_csv(file_path)

In [15]:
citi_201902_df = NY_02_df.append(JC_02_df, ignore_index = True) 

In [16]:
citi_201902_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201902_df),1))

In [17]:
citi_201902_df['mod10'] = citi_201902_df['Random_integer'] % 10

In [18]:
citi_201902_df['mod10'].value_counts()

0    96545
1    96470
6    96465
2    96389
8    96269
9    96207
7    96127
5    96053
4    95965
3    95819
Name: mod10, dtype: int64

In [19]:
sample_201902_df = citi_201902_df.loc[citi_201902_df["mod10"] == 7]

In [20]:
sample_201902_df.to_csv("Resources/sample_citi_201902.csv", index=False)

## March

In [21]:
file_path = ('Resources/2019/JC-201903-citibike-tripdata.csv')
JC_03_df = pd.read_csv(file_path)

In [22]:
file_path = ('Resources/2019/201903-citibike-tripdata.csv')
NY_03_df = pd.read_csv(file_path)

In [23]:
citi_201903_df = NY_03_df.append(JC_03_df, ignore_index = True) 

In [24]:
citi_201903_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201903_df),1))

In [25]:
citi_201903_df['mod10'] = citi_201903_df['Random_integer'] % 10

In [26]:
citi_201903_df['mod10'].value_counts()

2    135633
0    135566
7    135348
5    135219
4    135105
1    135023
8    135014
3    134921
6    134879
9    134858
Name: mod10, dtype: int64

In [27]:
sample_201903_df = citi_201903_df.loc[citi_201903_df["mod10"] == 7]

In [28]:
sample_201903_df.to_csv("Resources/sample_citi_201903.csv", index=False)

## April

In [29]:
file_path = ('Resources/2019/JC-201904-citibike-tripdata.csv')
JC_04_df = pd.read_csv(file_path)

In [30]:
file_path = ('Resources/2019/201904-citibike-tripdata.csv')
NY_04_df = pd.read_csv(file_path)

In [31]:
citi_201904_df = NY_04_df.append(JC_04_df, ignore_index = True) 

In [32]:
citi_201904_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201904_df),1))

In [33]:
citi_201904_df['mod10'] = citi_201904_df['Random_integer'] % 10

In [34]:
citi_201904_df['mod10'].value_counts()

9    180433
8    180206
1    180106
3    179982
0    179946
5    179899
4    179794
2    179654
7    179605
6    179525
Name: mod10, dtype: int64

In [35]:
sample_201904_df = citi_201904_df.loc[citi_201904_df["mod10"] == 7]

In [36]:
sample_201904_df.to_csv("Resources/sample_citi_201904.csv", index=False)

## May

In [37]:
file_path = ('Resources/2019/JC-201905-citibike-tripdata.csv')
JC_05_df = pd.read_csv(file_path)

In [38]:
file_path = ('Resources/2019/201905-citibike-tripdata.csv')
NY_05_df = pd.read_csv(file_path)

In [39]:
citi_201905_df = NY_05_df.append(JC_05_df, ignore_index = True) 

In [40]:
citi_201905_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201905_df),1))

In [41]:
citi_201905_df['mod10'] = citi_201905_df['Random_integer'] % 10

In [42]:
citi_201905_df['mod10'].value_counts()

4    197256
0    196663
3    196519
6    196385
1    196149
2    195954
9    195884
5    195658
7    195348
8    194882
Name: mod10, dtype: int64

In [43]:
sample_201905_df = citi_201905_df.loc[citi_201905_df["mod10"] == 7]

In [44]:
sample_201905_df.to_csv("Resources/sample_citi_201905.csv", index=False)

## June

In [45]:
file_path = ('Resources/2019/JC-201906-citibike-tripdata.csv')
JC_06_df = pd.read_csv(file_path)

In [46]:
file_path = ('Resources/2019/201906-citibike-tripdata.csv')
NY_06_df = pd.read_csv(file_path)

In [47]:
citi_201906_df = NY_06_df.append(JC_06_df, ignore_index = True) 

In [48]:
citi_201906_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201906_df),1))

In [49]:
citi_201906_df['mod10'] = citi_201906_df['Random_integer'] % 10

In [50]:
citi_201906_df['mod10'].value_counts()

4    217242
3    216975
2    216922
0    216863
7    216436
8    216238
9    216128
1    216117
5    216089
6    215790
Name: mod10, dtype: int64

In [51]:
sample_201906_df = citi_201906_df.loc[citi_201906_df["mod10"] == 7]

In [52]:
sample_201906_df.to_csv("Resources/sample_citi_201906.csv", index=False)

## July

In [53]:
file_path = ('Resources/2019/JC-201907-citibike-tripdata.csv')
JC_07_df = pd.read_csv(file_path)

In [54]:
file_path = ('Resources/2019/201907-citibike-tripdata.csv')
NY_07_df = pd.read_csv(file_path)

In [55]:
citi_201907_df = NY_07_df.append(JC_07_df, ignore_index = True) 

In [56]:
citi_201907_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201907_df),1))

In [57]:
citi_201907_df['mod10'] = citi_201907_df['Random_integer'] % 10

In [58]:
citi_201907_df['mod10'].value_counts()

5    223178
7    223006
4    222835
3    222765
6    222722
2    222655
1    222199
9    221972
8    221952
0    221526
Name: mod10, dtype: int64

In [59]:
sample_201907_df = citi_201907_df.loc[citi_201907_df["mod10"] == 7]

In [60]:
sample_201907_df.to_csv("Resources/sample_citi_201907.csv", index=False)

## August

In [61]:
file_path = ('Resources/2019/JC-201908-citibike-tripdata.csv')
JC_08_df = pd.read_csv(file_path)

In [62]:
file_path = ('Resources/2019/201908-citibike-tripdata.csv')
NY_08_df = pd.read_csv(file_path)

In [63]:
citi_201908_df = NY_08_df.append(JC_08_df, ignore_index = True) 

In [64]:
citi_201908_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201908_df),1))

In [65]:
citi_201908_df['mod10'] = citi_201908_df['Random_integer'] % 10

In [66]:
citi_201908_df['mod10'].value_counts()

7    239652
2    239476
6    239472
3    239324
0    239305
8    239283
1    239186
9    239146
5    239140
4    238951
Name: mod10, dtype: int64

In [67]:
sample_201908_df = citi_201908_df.loc[citi_201908_df["mod10"] == 7]

In [68]:
sample_201908_df.to_csv("Resources/sample_citi_201908.csv", index=False)

## September

In [69]:
file_path = ('Resources/2019/JC-201909-citibike-tripdata.csv')
JC_09_df = pd.read_csv(file_path)

In [70]:
file_path = ('Resources/2019/201909-citibike-tripdata.csv')
NY_09_df = pd.read_csv(file_path)

In [71]:
citi_201909_df = NY_09_df.append(JC_09_df, ignore_index = True) 

In [72]:
citi_201909_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201909_df),1))

In [73]:
citi_201909_df['mod10'] = citi_201909_df['Random_integer'] % 10

In [74]:
citi_201909_df['mod10'].value_counts()

1    250170
6    250111
0    249789
4    249476
3    249450
9    249373
7    249357
2    248968
8    248859
5    248591
Name: mod10, dtype: int64

In [75]:
sample_201909_df = citi_201909_df.loc[citi_201909_df["mod10"] == 7]

In [76]:
sample_201909_df.to_csv("Resources/sample_citi_201909.csv", index=False)

## October

In [77]:
file_path = ('Resources/2019/JC-201910-citibike-tripdata.csv')
JC_10_df = pd.read_csv(file_path)

In [78]:
file_path = ('Resources/2019/201910-citibike-tripdata.csv')
NY_10_df = pd.read_csv(file_path)

In [79]:
citi_201910_df = NY_10_df.append(JC_10_df, ignore_index = True) 

In [80]:
citi_201910_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201910_df),1))

In [81]:
citi_201910_df['mod10'] = citi_201910_df['Random_integer'] % 10

In [82]:
citi_201910_df['mod10'].value_counts()

0    213987
3    213791
2    213705
9    213679
8    213643
5    213509
7    213243
4    213242
1    213087
6    212940
Name: mod10, dtype: int64

In [83]:
sample_201910_df = citi_201910_df.loc[citi_201910_df["mod10"] == 7]

In [84]:
sample_201910_df.to_csv("Resources/sample_citi_201910.csv", index=False)

## November

In [85]:
file_path = ('Resources/2019/JC-201911-citibike-tripdata.csv')
JC_11_df = pd.read_csv(file_path)

In [86]:
file_path = ('Resources/2019/201911-citibike-tripdata.csv')
NY_11_df = pd.read_csv(file_path)

In [87]:
citi_201911_df = NY_11_df.append(JC_11_df, ignore_index = True) 

In [88]:
citi_201911_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201911_df),1))

In [89]:
citi_201911_df['mod10'] = citi_201911_df['Random_integer'] % 10

In [90]:
citi_201911_df['mod10'].value_counts()

2    151387
5    151259
8    151096
0    151034
4    151008
3    151006
7    150979
1    150698
9    150587
6    150451
Name: mod10, dtype: int64

In [91]:
sample_201911_df = citi_201911_df.loc[citi_201911_df["mod10"] == 7]

In [92]:
sample_201911_df.to_csv("Resources/sample_citi_201911.csv", index=False)

## December

In [93]:
file_path = ('Resources/2019/JC-201912-citibike-tripdata.csv')
JC_12_df = pd.read_csv(file_path)

In [94]:
file_path = ('Resources/2019/201912-citibike-tripdata.csv')
NY_12_df = pd.read_csv(file_path)

In [95]:
citi_201912_df = NY_12_df.append(JC_12_df, ignore_index = True) 

In [96]:
citi_201912_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_201912_df),1))

In [97]:
citi_201912_df['mod10'] = citi_201912_df['Random_integer'] % 10

In [98]:
citi_201912_df['mod10'].value_counts()

0    98330
6    97667
1    97566
4    97503
3    97480
5    97313
8    97311
2    97287
9    97250
7    97231
Name: mod10, dtype: int64

In [99]:
sample_201912_df = citi_201912_df.loc[citi_201912_df["mod10"] == 7]

In [100]:
sample_201912_df.to_csv("Resources/sample_citi_201912.csv", index=False)