# Verfying Integrity

We are counting trips and bikes in two different ways. We can do some checks to make sure that those checks add up to the same number, and that that number is reasonable-looking.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests
import io
import zipfile
from tqdm import tqdm

In [2]:
stations = pd.read_csv("../data/final/june_22_station_metadata.csv", index_col=0)

In [3]:
june_22 = pd.read_csv("../data/final/all_june_22_citibike_trips.csv", index_col=0)

In [4]:
stations.head(1)

Unnamed: 0_level_0,latitude,longitude,station name,incoming trips,outgoing trips,all trips,kind,bikes outbound,outbound trips,bikes inbound,inbound trips,delta bikes,delta trips
station id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
72,40.767272,-73.993929,W 52 St & 11 Ave,133,147,280,active,3,25,13,128,10,103


**Check 1**: The outbound bike count is reasonable.

In [5]:
sum(stations['bikes outbound'])

6493

There are reportedly 8000 bikes in the system, including bikes out for maintainance and bikes in New Jersey (not yet included in the data), so this is a competely reasonable number.

**Check 2**: The number of outbound bikes and inbound bikes match.

In [6]:
sum(stations['bikes outbound']) == sum(stations['bikes inbound'])

True

**Check 3**: The outbound bike count matches the bike count in the raw data.

In [7]:
len(set(june_22['bikeid'].values))

6493

In [8]:
len(set(june_22['bikeid'].values)) == sum(stations['bikes outbound'])

True

**Check 4**: The incoming trip count *almost* matches the outgoing trip count.

(only almost because we removed a bunch of trip-*emitting* but non-*receving* depots from the dataset.

In [9]:
sum(stations['incoming trips'])

56759

In [10]:
sum(stations['outgoing trips'])

56749

**Check 5**: The all trips counter is *almost* twice the length of our base dataset (twice because we consider comings and going seperately, almost because of the additional depot-emitted trips).

In [11]:
sum(stations['all trips']) / 2

56754.0

In [12]:
len(june_22)

56759

**Check 6**: Delta bikes zeroes out.

In [13]:
sum(stations['delta bikes'])

0

**Check 7**: Delta trips zeroes out.

In [14]:
sum(stations['delta trips'])

0

**Check 8**: Outbound trips match inbound trips (and both match the length of the core dataset).

In [15]:
sum(stations['inbound trips'])

56759

In [16]:
sum(stations['inbound trips']) == sum(stations['outbound trips'])

True