## Compile complete 12-month ride data for 2019
## Script 2 of 2

> Some of the questions posed by Citi Bike require a full year of ride data to be used for exploring seasonality and other analytic perspectives requiring a full 12 months of ride activity. I elected to use rider data from all of 2019, which at the time of the analysis was the most recent full year of data. 

>**The 12-month 2019 data is prepared using two different Jupyter Notebook scripts.** The first step was to combine separate New York and Jersey City files for each month of 2019 (see ) . **This second script** combines the monthly data into a single dataframe and processes the combined file further by cleaning data and creating additional variables (columns).

>NOTE: the **downloaded source data** is *not* cloned to GitHub due to its size. Files with sample source records for 201901 are included in the 'Resources' folder on GitHub. Also, **output files for Tableau** only contain a sample of records in the 'Output' folder; again, due to file size constraints.

In [4]:
# import libraries
import pandas as pd
import numpy as np

#### read the csv files with the monthly ride data

In [5]:
file_path = ('Resources/sample_citi_201901.csv')
citi_01_df = pd.read_csv(file_path)

In [6]:
file_path = ('Resources/sample_citi_201902.csv')
citi_02_df = pd.read_csv(file_path)

In [7]:
file_path = ('Resources/sample_citi_201903.csv')
citi_03_df = pd.read_csv(file_path)

In [8]:
file_path = ('Resources/sample_citi_201904.csv')
citi_04_df = pd.read_csv(file_path)

In [9]:
file_path = ('Resources/sample_citi_201905.csv')
citi_05_df = pd.read_csv(file_path)

In [10]:
file_path = ('Resources/sample_citi_201906.csv')
citi_06_df = pd.read_csv(file_path)

In [11]:
file_path = ('Resources/sample_citi_201907.csv')
citi_07_df = pd.read_csv(file_path)

In [12]:
file_path = ('Resources/sample_citi_201908.csv')
citi_08_df = pd.read_csv(file_path)

In [13]:
file_path = ('Resources/sample_citi_201909.csv')
citi_09_df = pd.read_csv(file_path)

In [14]:
file_path = ('Resources/sample_citi_201910.csv')
citi_10_df = pd.read_csv(file_path)

In [15]:
file_path = ('Resources/sample_citi_201911.csv')
citi_11_df = pd.read_csv(file_path)

In [16]:
file_path = ('Resources/sample_citi_201912.csv')
citi_12_df = pd.read_csv(file_path)

#### append each monthly file to successive dataframes that accumulate all 12 months of detailed ride data into a single dataframe called 'citi_2019_df'

In [17]:
citi_1902_df = citi_01_df.append(citi_02_df, ignore_index = True) 

In [18]:
citi_1903_df = citi_1902_df.append(citi_03_df, ignore_index = True)

In [19]:
citi_1904_df = citi_1903_df.append(citi_04_df, ignore_index = True)

In [20]:
citi_1905_df = citi_1904_df.append(citi_05_df, ignore_index = True)

In [21]:
citi_1906_df = citi_1905_df.append(citi_06_df, ignore_index = True)

In [22]:
citi_1907_df = citi_1906_df.append(citi_07_df, ignore_index = True)

In [23]:
citi_1908_df = citi_1907_df.append(citi_08_df, ignore_index = True)

In [24]:
citi_1909_df = citi_1908_df.append(citi_09_df, ignore_index = True)

In [25]:
citi_1910_df = citi_1909_df.append(citi_10_df, ignore_index = True)

In [26]:
citi_1911_df = citi_1910_df.append(citi_11_df, ignore_index = True)

In [27]:
citi_2019_df = citi_1911_df.append(citi_12_df, ignore_index = True)

#### inspect data types, sample table records, value counts 

In [28]:
citi_2019_df.dtypes

tripduration                 int64
starttime                   object
stoptime                    object
start station id           float64
start station name          object
start station latitude     float64
start station longitude    float64
end station id             float64
end station name            object
end station latitude       float64
end station longitude      float64
bikeid                       int64
usertype                    object
birth year                   int64
gender                       int64
Random_integer               int64
mod10                        int64
dtype: object

In [29]:
citi_2019_df.head(10)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,Random_integer,mod10
0,3494,2019-01-01 00:18:57.5640,2019-01-01 01:17:11.5700,3171.0,Amsterdam Ave & W 82 St,40.785247,-73.976673,3164.0,Columbus Ave & W 72 St,40.777057,-73.978985,35785,Subscriber,1954,1,75477,7
1,1145,2019-01-01 00:25:00.0880,2019-01-01 00:44:05.4970,3163.0,Central Park West & W 68 St,40.773407,-73.977825,490.0,8 Ave & W 33 St,40.751551,-73.993934,24945,Customer,1969,0,73107,7
2,607,2019-01-01 00:26:16.4100,2019-01-01 00:36:23.8570,285.0,Broadway & E 14 St,40.734546,-73.990741,301.0,E 2 St & Avenue B,40.722174,-73.983688,33456,Subscriber,1999,1,45887,7
3,813,2019-01-01 00:27:12.5070,2019-01-01 00:40:45.5940,3165.0,Central Park West & W 72 St,40.775794,-73.976206,450.0,W 49 St & 8 Ave,40.762272,-73.987882,32899,Subscriber,1970,1,77387,7
4,837,2019-01-01 00:35:00.3960,2019-01-01 00:48:57.7890,3115.0,India St & Manhattan Ave,40.732322,-73.955086,3067.0,Broadway & Whipple St,40.701666,-73.94373,21528,Subscriber,1989,1,59297,7
5,2919,2019-01-01 00:42:01.2580,2019-01-01 01:30:40.5560,3285.0,W 87 St & Amsterdam Ave,40.78839,-73.9747,3162.0,W 78 St & Broadway,40.7834,-73.980931,18737,Subscriber,1972,1,93767,7
6,1437,2019-01-01 00:43:32.3160,2019-01-01 01:07:29.5040,3140.0,1 Ave & E 78 St,40.771404,-73.953517,3553.0,Frederick Douglass Blvd & W 112 St,40.801694,-73.957145,32496,Customer,1981,2,23557,7
7,100,2019-01-01 00:45:32.6070,2019-01-01 00:47:12.7270,3286.0,E 89 St & 3 Ave,40.780628,-73.952167,3305.0,E 91 St & 2 Ave,40.781122,-73.949656,27766,Subscriber,1999,1,2257,7
8,282,2019-01-01 00:47:58.6860,2019-01-01 00:52:41.5200,504.0,1 Ave & E 16 St,40.732219,-73.981656,487.0,E 20 St & FDR Drive,40.733143,-73.975739,17370,Subscriber,1990,1,14227,7
9,468,2019-01-01 00:49:05.1410,2019-01-01 00:56:53.2090,195.0,Liberty St & Broadway,40.709056,-74.010434,195.0,Liberty St & Broadway,40.709056,-74.010434,20120,Subscriber,1980,1,30657,7


In [30]:
citi_2019_df["birth year"].value_counts()

1969    174239
1989     87197
1990     86685
1991     83554
1988     82340
1992     81853
1993     75289
1987     74329
1986     68496
1985     67300
1994     66974
1984     62491
1983     57760
1995     56732
1982     53286
1981     50298
1980     47887
1979     43863
1978     37828
1977     37201
1996     36575
1976     36528
1970     35667
1975     35071
1971     32858
1974     32754
1972     32466
1973     31845
1968     29917
1967     28920
         ...  
1921        32
1917        30
1928        29
1885        26
1899        25
1893        25
1886        24
1887        23
1912        23
1910        19
1911        15
1919        14
1918        11
1929        11
1895        10
1931        10
1915        10
1909         9
1916         9
1913         5
1920         4
1922         4
1907         3
1923         3
1904         3
1894         2
1874         1
1891         1
1926         1
1857         1
Name: birth year, Length: 110, dtype: int64

In [31]:
citi_2019_df["gender"].value_counts()

1    1433344
2     502998
0     158430
Name: gender, dtype: int64

#### assign random integers to each record; the random integers will be used to help reduce the size of the data

In [32]:
citi_2019_df['Random_integer'] = np.random.randint(0,100000,size=(len(citi_2019_df),1))

In [33]:
# citi_2019_df['mod3'] = citi_2019_df['Random_integer'] % 3

In [34]:
# citi_2019_df['mod3'].value_counts()

#### create a new column containing modulus '2' values calculated using the Random_integer column

In [35]:
citi_2019_df['mod2'] = citi_2019_df['Random_integer'] % 2

In [36]:
# check for size of the two random samples created using the modulus operation
citi_2019_df["mod2"].value_counts()

0    1047464
1    1047308
Name: mod2, dtype: int64

#### convert data types

In [37]:
# convert ids to strings
citi_2019_df['start station id'] = citi_2019_df['start station id'].apply(str)
citi_2019_df['end station id'] = citi_2019_df['end station id'].apply(str)
citi_2019_df['bikeid'] = citi_2019_df['bikeid'].apply(str)

In [38]:
# convert startride to datetime
citi_2019_df['startride'] = pd.to_datetime(citi_2019_df['starttime'])

In [39]:
# convert stopride to datetime
citi_2019_df['stopride'] = pd.to_datetime(citi_2019_df['stoptime'])

In [40]:
# inspect data for counts, min/max range, mean
citi_2019_df.describe()

Unnamed: 0,tripduration,start station latitude,start station longitude,end station latitude,end station longitude,birth year,gender,Random_integer,mod10,mod2
count,2094772.0,2094772.0,2094772.0,2094772.0,2094772.0,2094772.0,2094772.0,2094772.0,2094772.0,2094772.0
mean,971.4454,40.73704,-73.98322,40.73672,-73.98343,1980.226,1.16449,50022.75,7.0,0.4999628
std,10733.18,0.03021881,0.021502,0.03007355,0.02154243,12.1142,0.5373036,28882.5,0.0,0.5000001
min,61.0,40.6554,-74.08364,40.6554,-74.08364,1857.0,0.0,0.0,7.0,0.0
25%,357.0,40.71755,-73.99601,40.71755,-73.99662,1970.0,1.0,25012.0,7.0,0.0
50%,609.0,40.73705,-73.98584,40.7365,-73.98627,1983.0,1.0,50042.0,7.0,0.0
75%,1072.0,40.75715,-73.97188,40.75651,-73.97208,1990.0,1.0,75050.0,7.0,1.0
max,3245687.0,40.866,-73.884,40.866,-73.881,2003.0,2.0,99999.0,7.0,1.0


#### convert trip length from seconds to hours for easier comprehension in Tableau visuals

In [41]:
citi_2019_df['trip_len_hrs'] = citi_2019_df['tripduration'] / 3600
    

#### create new variables for time periods (year, month, quarter, season)

In [42]:
# extract trip year from 'starttime' and store in a new column called 'trip_year' 
citi_2019_df['trip_year'] = citi_2019_df['starttime'].str[:4]

In [43]:
# convert 'trip_year data' type to numeric
citi_2019_df['trip_year'] = pd.to_numeric(citi_2019_df['trip_year'])

In [44]:
# extract trip month from 'starttime' and store in anew column called 'month'
citi_2019_df['month'] = citi_2019_df['starttime'].str[5:7]

In [45]:
# convert trip 'month' data type to numeric
citi_2019_df['month'] = pd.to_numeric(citi_2019_df['month'])

In [46]:
# function that returns the starttime's annual quarter time period
def quarter_cat(row):

    if row['month'] >= 1 and row['month'] <= 3:
        qtr_val = 'Q1'
    elif row['month'] >= 4 and row['month'] <= 6:
        qtr_val = 'Q2'
    elif row['month'] >= 7 and row['month'] <= 9:
        qtr_val = 'Q3'
    elif row['month'] >= 10 and row['month'] <= 12:
        qtr_val = 'Q4'
    else:
        qtr_val = 'NaN'
    return qtr_val

In [47]:
# use the quarter category function to return the quarter value to a new column called 'quarter'
citi_2019_df['quarter'] = citi_2019_df.apply(quarter_cat, axis=1)

In [48]:
# function that returns the starttime's season
def season_cat(row):

    if row['month'] >= 1 and row['month'] <= 2:
        season_val = 'winter'
    elif row['month'] >= 4 and row['month'] <= 5:
        season_val = 'spring'
    elif row['month'] >= 7 and row['month'] <= 8:
        season_val = 'summer'
    elif row['month'] >= 9 and row['month'] <= 10:
        season_val = 'fall'
    else:
        season_val = 'other'
    return season_val

In [49]:
# use the season category function to return the season value to a new column called 'season'
citi_2019_df['season'] = citi_2019_df.apply(season_cat, axis=1)

#### create new variables for rider sex and age

In [50]:
# function that returns the sex of the rider
def sex_label(row):

    if row['gender'] == 2:
        sex_val = 'female'
    elif row['gender'] == 1:
        sex_val = 'male'
    else:
        sex_val = 'na'
    return sex_val

In [51]:
# use the sex category function to return the rider's sex to a new column called 'sex'
citi_2019_df['sex'] = citi_2019_df.apply(sex_label, axis=1)

In [52]:
# function that returns the approximate age of the rider
# NOTE: using the random modulus 2 value to assign the rider's birth to the first half or second half of the year since the
#       source data does not have a birth month

def age_calc(row):

    if row['birth year'] >= 2000:
        age_val = row['trip_year'] - row['birth year'] + row['mod2']
    elif row['birth year'] >= 1930:
        age_val = (2000 - row['birth year']) + (row['trip_year'] - 2000) + row['mod2']
    else:
        age_val = 'na'
    return age_val

In [53]:
# use the age category function to return the rider's age to a new column called 'age'
citi_2019_df['age'] = citi_2019_df.apply(age_calc, axis=1)

In [54]:
# drop the 'mod2' and 'Random_integer' columns to reduce the size of the file imported into Tableau
citi_2019_df = citi_2019_df.drop('mod2', axis=1)
citi_2019_df = citi_2019_df.drop('Random_integer', axis=1)

#### check data types, sample table records, value counts 

In [55]:
citi_2019_df.dtypes

tripduration                        int64
starttime                          object
stoptime                           object
start station id                   object
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                     object
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                             object
usertype                           object
birth year                          int64
gender                              int64
mod10                               int64
startride                  datetime64[ns]
stopride                   datetime64[ns]
trip_len_hrs                      float64
trip_year                           int64
month                               int64
quarter                            object
season                             object
sex                               

In [56]:
citi_2019_df.head(10)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,...,mod10,startride,stopride,trip_len_hrs,trip_year,month,quarter,season,sex,age
0,3494,2019-01-01 00:18:57.5640,2019-01-01 01:17:11.5700,3171.0,Amsterdam Ave & W 82 St,40.785247,-73.976673,3164.0,Columbus Ave & W 72 St,40.777057,...,7,2019-01-01 00:18:57.564,2019-01-01 01:17:11.570,0.970556,2019,1,Q1,winter,male,65
1,1145,2019-01-01 00:25:00.0880,2019-01-01 00:44:05.4970,3163.0,Central Park West & W 68 St,40.773407,-73.977825,490.0,8 Ave & W 33 St,40.751551,...,7,2019-01-01 00:25:00.088,2019-01-01 00:44:05.497,0.318056,2019,1,Q1,winter,na,51
2,607,2019-01-01 00:26:16.4100,2019-01-01 00:36:23.8570,285.0,Broadway & E 14 St,40.734546,-73.990741,301.0,E 2 St & Avenue B,40.722174,...,7,2019-01-01 00:26:16.410,2019-01-01 00:36:23.857,0.168611,2019,1,Q1,winter,male,21
3,813,2019-01-01 00:27:12.5070,2019-01-01 00:40:45.5940,3165.0,Central Park West & W 72 St,40.775794,-73.976206,450.0,W 49 St & 8 Ave,40.762272,...,7,2019-01-01 00:27:12.507,2019-01-01 00:40:45.594,0.225833,2019,1,Q1,winter,male,49
4,837,2019-01-01 00:35:00.3960,2019-01-01 00:48:57.7890,3115.0,India St & Manhattan Ave,40.732322,-73.955086,3067.0,Broadway & Whipple St,40.701666,...,7,2019-01-01 00:35:00.396,2019-01-01 00:48:57.789,0.2325,2019,1,Q1,winter,male,31
5,2919,2019-01-01 00:42:01.2580,2019-01-01 01:30:40.5560,3285.0,W 87 St & Amsterdam Ave,40.78839,-73.9747,3162.0,W 78 St & Broadway,40.7834,...,7,2019-01-01 00:42:01.258,2019-01-01 01:30:40.556,0.810833,2019,1,Q1,winter,male,47
6,1437,2019-01-01 00:43:32.3160,2019-01-01 01:07:29.5040,3140.0,1 Ave & E 78 St,40.771404,-73.953517,3553.0,Frederick Douglass Blvd & W 112 St,40.801694,...,7,2019-01-01 00:43:32.316,2019-01-01 01:07:29.504,0.399167,2019,1,Q1,winter,female,39
7,100,2019-01-01 00:45:32.6070,2019-01-01 00:47:12.7270,3286.0,E 89 St & 3 Ave,40.780628,-73.952167,3305.0,E 91 St & 2 Ave,40.781122,...,7,2019-01-01 00:45:32.607,2019-01-01 00:47:12.727,0.027778,2019,1,Q1,winter,male,21
8,282,2019-01-01 00:47:58.6860,2019-01-01 00:52:41.5200,504.0,1 Ave & E 16 St,40.732219,-73.981656,487.0,E 20 St & FDR Drive,40.733143,...,7,2019-01-01 00:47:58.686,2019-01-01 00:52:41.520,0.078333,2019,1,Q1,winter,male,29
9,468,2019-01-01 00:49:05.1410,2019-01-01 00:56:53.2090,195.0,Liberty St & Broadway,40.709056,-74.010434,195.0,Liberty St & Broadway,40.709056,...,7,2019-01-01 00:49:05.141,2019-01-01 00:56:53.209,0.13,2019,1,Q1,winter,male,39


#### output detailed 2019 ride data to csv for importing into Tableau

In [57]:
citi_2019_df.to_csv("Output/citi_sample_2019.csv", index=False)