# Analysis of NYC Bike Share Citi Bike Data
__<font size="4">by Minjian Wu</font>__

NYC Bike Share operates Citi Bike program and generates data regarding the program, including trip records, a real time feed of station status and monthly reports. The Citi Bike program data is exclusively generated by the operator NYC Bike Share, a limited liability corporation solely owned by Motivate.

The data includes (for each month of a year):
- Trip Duration in seconds
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
- Gender (0 = unknown; 1 = male; 2 = female)
- Year of Birth

This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of the “test” stations, and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it's secure).

## Preliminary Wrangling

Looking at the data, we are motivated by a few questions, for example: Where do Citi Bikers ride? How far do they go? When are most trips taken in terms of time of day, day of the week, or month of the year? Which stations are most popular? How long does the average trip take? Does any of these depend on if a user is a subscriber or customer, a male or female, and/or some certain age? 

We are going to find these out from exploration of the downloaded dataset. We will leverage Python's visualisation libraries to help with our analyses.

In [18]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Load in our dataset and describe its properties through the screening questions at the end of the section.
We will try and motivate our exploration goals through this section.

In [2]:
from bs4 import BeautifulSoup as bs
import os
import requests
from zipfile import ZipFile
from io import BytesIO

# # parse the domain html
# domain='https://s3.amazonaws.com/tripdata'
# soup = bs(requests.get(domain).content, 'html.parser')

# folder_name = 'NYC_bike_data'
# # Make directory if it doesn't already exist
# if not os.path.exists(folder_name):
#     os.makedirs(folder_name)

# # download files and unzip to folder
# for link in soup.find_all('key'):
#     if link.text.startswith('2020'):  # only need 2020 data
#         r = requests.get(domain + '/' + link.text)
#         f = ZipFile(BytesIO(r.content))
#         f.extractall(os.path.join(folder_name))

In [3]:
# # concatenate the 12 dataframes (1 from each month of 2020)
# df_2020_months = {
#     'df_{:02d}'.format(i):
#     pd.read_csv('NYC_bike_data/2020' + '{:02d}'.format(i) +
#                 '-citibike-tripdata.csv')
#     for i in range(1, 13)
# }

# df_2020 = pd.concat(df_2020_months.values(), ignore_index=True)

In [4]:
# # write the concatenated dataframe to a csv for re-use
# df_2020.to_csv('NYC_bike_2020.csv')

### Data inspection

In [5]:
df_2020 = pd.read_csv('NYC_bike_2020.csv', index_col=False)
df_2020.drop('Unnamed: 0', axis=1, inplace=True)

In [6]:
df_2020.head(1)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,789,2020-01-01 00:00:55.3900,2020-01-01 00:14:05.1470,504,1 Ave & E 16 St,40.732219,-73.981656,307,Canal St & Rutgers St,40.714275,-73.9899,30326,Subscriber,1992,1


In [7]:
df_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19506857 entries, 0 to 19506856
Data columns (total 15 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tripduration             int64  
 1   starttime                object 
 2   stoptime                 object 
 3   start station id         int64  
 4   start station name       object 
 5   start station latitude   float64
 6   start station longitude  float64
 7   end station id           int64  
 8   end station name         object 
 9   end station latitude     float64
 10  end station longitude    float64
 11  bikeid                   int64  
 12  usertype                 object 
 13  birth year               int64  
 14  gender                   int64  
dtypes: float64(4), int64(6), object(5)
memory usage: 2.2+ GB


In [8]:
# total number of duplicated rows
df_2020.duplicated().sum()

0

In [9]:
# total number of NaN/null values
df_2020.isna().sum().sum()

0

In [10]:
df_2020['gender'].value_counts()  # 0 for unknown gender - no use

1    11798407
2     5551873
0     2156577
Name: gender, dtype: int64

In [11]:
df_2020['usertype'].value_counts()

Subscriber    14955766
Customer       4551091
Name: usertype, dtype: int64

### Data Cleaning

In [12]:
# make a copy of the original dataframe
df_2020_cl = df_2020.copy()

In [13]:
# remove unknown genders
df_2020_cl = df_2020_cl[df_2020_cl['gender'] != 0]
df_2020_cl.reset_index(inplace=True, drop=True)
df_2020_cl['gender'].value_counts()

1    11798407
2     5551873
Name: gender, dtype: int64

In [14]:
# convert time columns to datatime objects
df_2020_cl['starttime'] = pd.to_datetime(df_2020_cl['starttime'])
df_2020_cl['stoptime'] = pd.to_datetime(df_2020_cl['stoptime'])

In [15]:
# convert birthyear to age at time of 2020
df_2020_cl['age'] = 2020 - df_2020_cl['birth year']
df_2020_cl.drop('birth year', axis=1, inplace=True)

In [19]:
# Calculate the straight line distance from the start and end Lat/Long
def line_dist(row):
    # convert decimal degrees to radians
    lat1, long1, lat2, long2 = row[[
        'start station latitude', 'start station longitude',
        'end station latitude', 'end station longitude'
    ]]
    lat1, long1, lat2, long2 = map(np.radians, [lat1, long1, lat2, long2])
    # haversine formula
    d_long = long2 - long1
    d_lat = lat2 - lat1
    a = np.sin(
        d_lat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(d_long / 2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    dist = 6371 * c  # earth radius is approx. 6371km
    return dist

# make a new column with unit in km
df_2020_cl['dist stations'] = df_2020_cl.apply(line_dist, axis=1)

In [20]:
# check out the cleaned dataframe
df_2020_cl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17350280 entries, 0 to 17350279
Data columns (total 16 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   tripduration             int64         
 1   starttime                datetime64[ns]
 2   stoptime                 datetime64[ns]
 3   start station id         int64         
 4   start station name       object        
 5   start station latitude   float64       
 6   start station longitude  float64       
 7   end station id           int64         
 8   end station name         object        
 9   end station latitude     float64       
 10  end station longitude    float64       
 11  bikeid                   int64         
 12  usertype                 object        
 13  gender                   int64         
 14  age                      int64         
 15  dist stations            float64       
dtypes: datetime64[ns](2), float64(5), int64(6), object(3)
memory usage: 2.

In [21]:
df_2020_cl

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,gender,age,dist stations
0,789,2020-01-01 00:00:55.390,2020-01-01 00:14:05.147,504,1 Ave & E 16 St,40.732219,-73.981656,307,Canal St & Rutgers St,40.714275,-73.989900,30326,Subscriber,1,28,2.112754
1,1541,2020-01-01 00:01:08.102,2020-01-01 00:26:49.178,3423,West Drive & Prospect Park West,40.661063,-73.979453,3300,Prospect Park West & 8 St,40.665147,-73.976376,17105,Customer,1,51,0.522978
2,1464,2020-01-01 00:01:42.140,2020-01-01 00:26:07.011,3687,E 33 St & 1 Ave,40.743227,-73.974498,259,South St & Whitehall St,40.701221,-74.012342,40177,Subscriber,1,57,5.655762
3,592,2020-01-01 00:01:45.561,2020-01-01 00:11:38.155,346,Bank St & Hudson St,40.736529,-74.006180,490,8 Ave & W 33 St,40.751551,-73.993934,27690,Subscriber,1,40,1.963301
4,702,2020-01-01 00:01:45.788,2020-01-01 00:13:28.240,372,Franklin Ave & Myrtle Ave,40.694546,-73.958014,3637,Fulton St & Waverly Ave,40.683239,-73.965996,32583,Subscriber,1,38,1.426126
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17350275,1178,2020-12-31 23:58:14.100,2021-01-01 00:17:52.338,3335,Union St & 4 Ave,40.677274,-73.982820,3860,Wilson Ave & Troutman St,40.701660,-73.927540,47991,Customer,1,32,5.392248
17350276,1344,2020-12-31 23:58:17.480,2021-01-01 00:20:41.607,456,E 53 St & Madison Ave,40.759711,-73.974023,468,Broadway & W 56 St,40.765265,-73.981923,36946,Subscriber,1,26,0.907857
17350277,589,2020-12-31 23:58:21.262,2021-01-01 00:08:10.922,3999,Adam Clayton Powell Blvd & W 138 St,40.816960,-73.942296,3518,Lenox Ave & W 126 St,40.808442,-73.945209,48973,Subscriber,1,23,0.978335
17350278,2045,2020-12-31 23:58:21.704,2021-01-01 00:32:27.157,526,E 33 St & 5 Ave,40.747659,-73.984907,3614,Crescent St & 30 Ave,40.768692,-73.924957,36467,Subscriber,1,26,5.564695


### What is the structure of our dataset?

12 months of bike share data in NYC are concatenated one after the other to form a huge 17350280 × 15 shaped dataframe.

### What is/are the main feature(s) of interest in our dataset? What features in the dataset do you think will help support your investigation into your feature(s) of interest?

- Where do Citi Bikers ride (in terms of destination as route taken is unknown)? 
  <br>$end\ station\ id$
  <br>$start\ \&\ end\ station\ Lat/Long$ $(possibly\ shown\ on\ a\ map)$
- How far do they go (in terms of straight line distance as route taken is unknown)? 
  <br>$dist stations$
- When are most trips taken in terms of time of day, day of the week, or month of the year? 
  <br>$starttime,\ stoptime$
- Which stations are most popular? 
  <br>$start\ \&\ end\ station\ id$
- How long does the average trip take? 
  <br>$tripduration$
- Does any of the above depend on if a user is a subscriber or customer, a male or female, and/or some certain age? 
  <br>$user\ type,\ gender,\ age$


## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!