## 1) Write some documentation


Name: Jersey City Citibike Tripdata for January 2020

Link to Data: [CitibikeNYC](https://www.citibikenyc.com/system-data) [Actual Dataset](https://s3.amazonaws.com/tripdata/JC-202001-citibike-tripdata.csv.zip)

Source / Origin: Trip data for Jersey City, from Citibike NYC's official website for the month of January 2020. Data provided by Bikeshare, they appear to be using Amazon Web Services.

Format: csv

Types of each column:

- tripduration:                 int64
- starttime:                   object
- stoptime:                    object
- start station id:             int64
- start station name:          object
- start station latitude:     float64
- start station longitude:    float64
- end station id:               int64
- end station name:            object
- end station latitude:       float64
- end station longitude:      float64
- bikeid:                       int64
- usertype:                    object
- birth year:                   int64
- gender:                       int64

## 2) Retrieve the data, create a DataFrame


In [1]:
import numpy as np 
import pandas as pd

df = pd.read_csv("JC-202001-citibike-tripdata.csv")
print(df)
print(df.dtypes)


tripduration                 starttime                  stoptime  \
0               226  2020-01-01 00:04:50.1920  2020-01-01 00:08:37.0370   
1               377  2020-01-01 00:16:01.6700  2020-01-01 00:22:19.0800   
2               288  2020-01-01 00:17:33.8770  2020-01-01 00:22:22.4420   
3               435  2020-01-01 00:32:05.9020  2020-01-01 00:39:21.0660   
4               231  2020-01-01 00:46:19.6780  2020-01-01 00:50:11.3440   
...             ...                       ...                       ...   
26015           544  2020-01-31 23:29:29.3910  2020-01-31 23:38:33.6910   
26016           122  2020-01-31 23:30:59.3670  2020-01-31 23:33:01.6870   
26017           201  2020-01-31 23:42:34.8460  2020-01-31 23:45:55.8780   
26018           300  2020-01-31 23:45:00.6800  2020-01-31 23:50:00.8740   
26019           721  2020-01-31 23:48:35.1700  2020-02-01 00:00:36.4060   

       start station id          start station name  start station latitude  \
0                  3186    

## 3) Using the Data


Motivation: Was curious about transportation stats in general, wanted to clean up the data for further analysis.

Documentation:

1) Transform a column

2) Create a new calculated column

3) Calculate summary statistics

4) Calculate value counts

5) Visualization


In [2]:
# 1) 
def get_gender(n):
    if n == 0:
        return 'unknown'
    elif n == 1:
        return 'male'
    else:
        return 'female'

df['gender'] = df['gender'].transform(get_gender)
df


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,226,2020-01-01 00:04:50.1920,2020-01-01 00:08:37.0370,3186,Grove St PATH,40.719586,-74.043117,3211,Newark Ave,40.721525,-74.046305,29444,Subscriber,1984,female
1,377,2020-01-01 00:16:01.6700,2020-01-01 00:22:19.0800,3186,Grove St PATH,40.719586,-74.043117,3269,Brunswick & 6th,40.726012,-74.050389,26305,Subscriber,1989,female
2,288,2020-01-01 00:17:33.8770,2020-01-01 00:22:22.4420,3186,Grove St PATH,40.719586,-74.043117,3269,Brunswick & 6th,40.726012,-74.050389,29268,Customer,1989,male
3,435,2020-01-01 00:32:05.9020,2020-01-01 00:39:21.0660,3195,Sip Ave,40.730897,-74.063913,3280,Astor Place,40.719282,-74.071262,29278,Customer,1969,unknown
4,231,2020-01-01 00:46:19.6780,2020-01-01 00:50:11.3440,3186,Grove St PATH,40.719586,-74.043117,3276,Marin Light Rail,40.714584,-74.042817,29276,Subscriber,1983,female
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26015,544,2020-01-31 23:29:29.3910,2020-01-31 23:38:33.6910,3213,Van Vorst Park,40.718489,-74.047727,3194,McGinley Square,40.725340,-74.067622,29659,Subscriber,1989,male
26016,122,2020-01-31 23:30:59.3670,2020-01-31 23:33:01.6870,3792,Columbus Dr at Exchange Pl,40.716870,-74.032810,3639,Harborside,40.719252,-74.034234,42361,Subscriber,1991,male
26017,201,2020-01-31 23:42:34.8460,2020-01-31 23:45:55.8780,3273,Manila & 1st,40.721651,-74.042884,3209,Brunswick St,40.724176,-74.050656,42368,Subscriber,1988,male
26018,300,2020-01-31 23:45:00.6800,2020-01-31 23:50:00.8740,3185,City Hall,40.717733,-74.043845,3267,Morris Canal,40.712419,-74.038526,42257,Subscriber,1981,female


In [3]:
# 2)    referenced algorithm from here: https://kite.com/python/answers/how-to-find-the-distance-between-two-lat-long-coordinates-in-python
import math
R = 6373.0      # radius of the Earth

def get_distance(lat1, lon1, lat2, lon2):   # Calculate distance between points using the Haversine formula

    lat1 = math.radians(lat1)
    lon1 = math.radians(lon1)
    lat2 = math.radians(lat2)
    lon2 = math.radians(lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2     # Haversine formula
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    distance = R * c
    return distance

# print(get_distance(df['start station latitude'][0], df['start station longitude'][0], df['end station latitude'][0], df['end station longitude'][0]))

df['direct distance traveled'] = df.apply(lambda row: get_distance(row['start station latitude'], row['start station longitude'], row['end station latitude'], row['end station longitude']), axis=1)       # create new column from lat/lon points, in kilometers

df['direct distance traveled']
print(df)


tripduration                 starttime                  stoptime  \
0               226  2020-01-01 00:04:50.1920  2020-01-01 00:08:37.0370   
1               377  2020-01-01 00:16:01.6700  2020-01-01 00:22:19.0800   
2               288  2020-01-01 00:17:33.8770  2020-01-01 00:22:22.4420   
3               435  2020-01-01 00:32:05.9020  2020-01-01 00:39:21.0660   
4               231  2020-01-01 00:46:19.6780  2020-01-01 00:50:11.3440   
...             ...                       ...                       ...   
26015           544  2020-01-31 23:29:29.3910  2020-01-31 23:38:33.6910   
26016           122  2020-01-31 23:30:59.3670  2020-01-31 23:33:01.6870   
26017           201  2020-01-31 23:42:34.8460  2020-01-31 23:45:55.8780   
26018           300  2020-01-31 23:45:00.6800  2020-01-31 23:50:00.8740   
26019           721  2020-01-31 23:48:35.1700  2020-02-01 00:00:36.4060   

       start station id          start station name  start station latitude  \
0                  3186    

In [4]:
# 3)
print(df['birth year'].describe())


count    26020.000000
mean      1981.163605
std         10.310239
min       1888.000000
25%       1976.000000
50%       1983.000000
75%       1989.000000
max       2002.000000
Name: birth year, dtype: float64


In [5]:
# 4)
print(df['start station name'].value_counts())
print(df['end station name'].value_counts())


Grove St PATH                 3100
Sip Ave                       1493
Hamilton Park                 1327
Columbus Dr at Exchange Pl    1152
Harborside                    1091
Newport PATH                   989
Marin Light Rail               904
Brunswick & 6th                776
City Hall                      745
Newark Ave                     720
Newport Pkwy                   641
Manila & 1st                   632
Warren St                      628
Jersey & 3rd                   616
Monmouth and 6th               613
Jersey & 6th St                584
Washington St                  575
Columbus Drive                 556
Morris Canal                   542
McGinley Square                538
Brunswick St                   529
Van Vorst Park                 528
Dixon Mills                    511
Paulus Hook                    493
Liberty Light Rail             474
Essex Light Rail               398
JC Medical Center              398
Journal Square                 393
Grand St            

In [6]:
# 5) 
df.plot(kind='scatter', x='direct distance traveled', y='tripduration')


<matplotlib.axes._subplots.AxesSubplot at 0x120d9c990>

1) 

- Created a function called get_gender that takes the assigned number for gender in the dataset and converts to gender in string form.
- Transformed the gender column from numerical to string representation of gender using transform.

2)

- Looked for and found an algorithm that finds the distance between two latitude/longitude points.
- Created a function called get_distance that uses the algorithm to return the direct distance traveled between the start and end stations.
- Created a new column called direct distance traveled, calculated from applying the new function to each row.

3) 

- Called describe on birth year to find the summary statistics for birth year.
- The youngest birthyear is 2002, while the oldest birthyear appears to be 1888, which doesn't seem like it's possible, so it could be an error.
- The average age of users appears to be around 39 years old, with a mean of 1981.

4) 

- 