# Helsinki City Bikes data analysis, challenge - using numpy only


<img src="cover.png">


We'll be using the Helsinki City Bikes data (available [here](https://www.hsl.fi/en/opendata)) for our Numpy analysis. The data is covering the whole period from April to October 2019 and can be downloaded monthly. More general information about Helsinki City Bikes project can be found [here](https://kaupunkipyorat.hsl.fi/en).

The data cover details of all the bike rides. It covers date of the departure date,	return date, departure station id,	departure station name,	return station id, return station name,	covered distance (m), duration (sec.).

Before the loading of the CSV data, we need to rename two stations that have comma in their name (to prevent errors caused by the different number of columns). We changed the commas inside the name of these stations to semicommas in all our files. Another thing we have to be aware of is that Finnish characters are not part of the typical encoding range, thus if we need to use encoding UTF-8 (8-bit Unicode Transformation Format) to have all the station names displayed correctly.





In [1]:
import numpy as np
# april_data = np.genfromtxt('2019-04.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)
# may_data = np.genfromtxt('2019-05.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)
# june_data = np.genfromtxt('2019-06.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)
# july_data = np.genfromtxt('2019-07.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)
# august_data = np.genfromtxt('2019-08.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)
# september_data = np.genfromtxt('2019-09.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)
# november_data = np.genfromtxt('2019-10.csv', delimiter="," , encoding='utf-8',
#                            dtype=[('<U18'),('<U18'),('int64'),('<U18'),('int64'),('<U18'),('int64'),('int64')], skip_header=1)

april_data = np.genfromtxt('2019-04.csv', delimiter="," , encoding='utf-8',dtype='str', skip_header=1)
may_data = np.genfromtxt('2019-05.csv', delimiter="," , encoding='utf-8', dtype='str', skip_header=1)
june_data = np.genfromtxt('2019-06.csv', delimiter="," , encoding='utf-8',dtype='str', skip_header=1)
july_data = np.genfromtxt('2019-07.csv', delimiter="," , encoding='utf-8', dtype='str', skip_header=1)
august_data = np.genfromtxt('2019-08.csv', delimiter="," , encoding='utf-8',dtype='str', skip_header=1)
september_data = np.genfromtxt('2019-09.csv', delimiter="," , encoding='utf-8',dtype='str', skip_header=1)
october_data = np.genfromtxt('2019-10.csv', delimiter="," , encoding='utf-8',dtype='str', skip_header=1)
print(september_data.shape)
september_data[:5]

(450056, 8)


array([['2019-09-30T23:59:29', '2019-10-01T00:10:02', '7', 'Designmuseo',
        '64', 'Tyynenmerenkatu', '2250', '628'],
       ['2019-09-30T23:59:11', '2019-10-01T00:07:37', '111',
        'Esterinportti', '86', 'Kuusitie', '1749', '501'],
       ['2019-09-30T23:54:58', '2019-09-30T23:59:22', '161',
        'Eteläesplanadi', '24', 'Mannerheimintie', '535', '260'],
       ['2019-09-30T23:51:09', '2019-09-30T23:58:59', '517',
        'Länsituuli', '900', 'Orionintie', '2140', '466'],
       ['2019-09-30T23:50:09', '2019-10-01T00:13:52', '115',
        'Venttiilikuja', '12', 'Kanavaranta', '4792', '1422']],
      dtype='<U37')

We're reading all of our data as string dtype, as we want to combine them in the next step (if we choose to have different dtype in each column, we will lose one dimension).

After that we'll stack all of our monthly datasets vertically.

In [17]:
final_data = np.vstack((april_data,may_data,june_data,july_data,august_data,september_data,october_data))

In [18]:
np.info(final_data)

class:  ndarray
shape:  (3787948, 8)
strides:  (1184, 148)
itemsize:  148
aligned:  True
contiguous:  True
fortran:  False
data pointer: 0x199d9b3c040
byteorder:  little
byteswap:  False
type: <U37


Our final data has two dimensions and shape (3787948, 8).

Next we will check for missing values. As the data is tracked for all the trips, even those that are of 0 meter distance, it's expected, that we won't have any missing values, but will have to deal with some invalid inputs instead.

In [19]:
np.where(final_data=='nan')

(array([], dtype=int64), array([], dtype=int64))

In [20]:
np.where(final_data==' ')

(array([], dtype=int64), array([], dtype=int64))

In [None]:
np.isnan(final_data).any

As predicted, no missing values detected, both as 'nan' or as ' ' (empty space) in the string of our values. We now should check for the possible invalid inputs in our data.

Further in this analysis we're going to work either with string data or numerical data (with distance and time of the rides). Thus we will create dataset with only numerical values and will combine two datasets in our analysis.

In [21]:
#looking at the data, we have array of lists
dst_time = final_data[:,-2:]
print(dst_time.shape)
dst_time[:5]

(3787948, 2)


array([['2', '57'],
       ['2196', '569'],
       ['0', '20'],
       ['2121', '596'],
       ['2460', '1127']], dtype='<U37')

We need to convert our data from string type to integer. In the process we had to first convert to floats (as there were inputs that had decimal values) and then convert to integers.

In [22]:
#first to the float64
dst_time = dst_time[:][:].astype('float64')
dst_time[0]  ## array of list

array([ 2., 57.])

In [23]:
#then to the integer
dst_time = dst_time[:][:].astype('int64')
dst_time[0]  ## array of list

array([ 2, 57], dtype=int64)

#### Possible work with the data

1) DATA Screenig and specifying the invalid inputs
   - returing to same stations (with ride time less than 60 seconds)
   - checking the min max of the distance and time
   - frequency bike malfunctions (too short rides, too long rides, problem with mobile apps internet)
   
2) DATA Analysis
   - Total distance covered over the period and std deviation
   - Total hours cycled over the period and std deviation
   - Average number of rides by months
   - Correlation between distance and time
   - The diffences between months according to the ride distance and time
   - Most populer departure and return stations
   - Finding the busiest hours in the day time
   - Most frequent rutes used
   - Normalizing the data
   - Moving average

### Checking those rides that returned the bike to the same station

As this could be due to some malfunction, mobile app error or mistake. If the person just borrowed the bike for couple seconds and returned it, it could negatively effect our analysis.

In [24]:
#checking for indexes of those rides
same_stations_inds = np.where(final_data[:,3] == final_data[:,5])
same_stations_inds

(array([      0,       2,      15, ..., 3787911, 3787918, 3787919],
       dtype=int64),)

In [25]:
#finding total number of those rides
uniquevalues,same_stations_totals = np.unique(same_stations_inds, return_counts=True)
same_stations_totals = same_stations_totals.sum()
print('\nThe number of rides with same departure and return station {:,}.'.format(same_stations_totals),'\n')
print('Percentage of rides from same station rides is {:.2f}% from the whole dataset.'.format(same_stations_totals/(len(final_data))*100))


The number of rides with same departure and return station 228,383. 

Percentage of rides from same station rides is 6.03% from the whole dataset.


In total there were 228,383 rides that returned the bike to the same station. But that doesn't mean that it was invalid ride right away, therefore we need to check the times of these rides.

What we would consider invalid ride would be if someone borrowed the bike and returned it withing 60 seconds to the same station.

In [26]:
invalid_rides_ind = np.where(dst_time[same_stations_inds,1]<= 60)
inv_values, total_inv_rides = np.unique(invalid_rides_ind[1], return_counts=True)
total_inv_rides = total_inv_rides.sum()
print('\nThe total number of invalid rides is {:,}.'.format(total_inv_rides),'\n')
print('Percentage of invalid rides from those with same departure and return station rides is {:.2f}%, which is {:.2f}% of all rides.'.format(total_inv_rides/same_stations_totals*100,total_inv_rides/(len(final_data))*100))


The total number of invalid rides is 112,474. 

Percentage of invalid rides from those with same departure and return station rides is 49.25%, which is 2.97% of all rides.


So the total number of invalid rides, meaning those that started and finished in the same station and lasted less than 60 second, was 112,474, which was 49.25% of all the rides that had same departure and return station.

We want to remove these rides from our final data.

In [27]:
last_col = final_data[:,7]
last_col = last_col.astype('float64')
last_col = last_col.astype('int64')

In [28]:
temp_inds = np.where(last_col <= 60)
common_elements = np.intersect1d(same_stations_inds[0],temp_inds)

#final_data1 = np.delete(final_data[:],common_emelments,axis=1) for some reason didnt work??

In [30]:
arr = np.arange(len(final_data))
#np.setdiff1d(arr,common_emelments)
final_data = final_data[np.setdiff1d(arr,common_elements)]
print(final_data.shape)
print(final_data[:10])

(3675474, 8)
[['2019-04-30T23:59:31' '2019-05-01T00:09:00' '140' 'Verkatehtaanpuisto'
  '134' 'Haukilahdenkatu' '2196' '569']
 ['2019-04-30T23:59:21' '2019-05-01T00:09:20' '39' 'Ooppera' '44'
  'Sörnäisten metroasema' '2121' '596']
 ['2019-04-30T23:59:19' '2019-05-01T00:18:12' '57'
  'Lauttasaaren ostoskeskus' '63' 'Jätkäsaarenlaituri' '2460' '1127']
 ['2019-04-30T23:59:19' '2019-05-01T00:16:32' '505' 'Westendinasema'
  '593' 'Toppelundintie' '2058' '1028']
 ['2019-04-30T23:59:14' '2019-05-01T00:06:15' '647' 'Lystimäki' '623'
  'Nelikkotie' '1429' '416']
 ['2019-04-30T23:59:13' '2019-05-01T00:14:11' '36' 'Apollonkatu' '120'
  'Mäkelänkatu' '3413' '898']
 ['2019-04-30T23:59:09' '2019-05-01T00:07:16' '225' 'Maunula' '230'
  'Mäkitorpantie' '1882' '483']
 ['2019-04-30T23:59:04' '2019-05-01T00:06:34' '12' 'Kanavaranta' '22'
  'Rautatientori / länsi' '1217' '444']
 ['2019-04-30T23:59:01' '2019-05-01T00:03:28' '647' 'Lystimäki' '581'
  'Niittykumpu (M)' '1180' '266']
 ['2019-04-30T23:58:48' 

As we already removed those rides, that were invalid cause they were not really rides, we should also have a look into those rides, that were way too long, so could cause some problems in calculation of our descriptive analysis.

Let's look at the longest rides distance-wise.

### Removing too long rides (distance)

In [29]:
len_col = final_data[:,6]
len_col = len_col.astype('float64')
sort_len = np.argsort(len_col) # giving us the indexes
print(len_col[sort_len[-20:]])#printing values

[  69405.     69908.33   74425.     83175.    118841.67  122115.
  129742.    157656.    163633.33  190216.67  241519.    347575.
  407383.33  421708.33  433104.    556125.    951100.   1004058.33
 2106675.   3589426.  ]


These rides were rather long for the 30 minutes limit that is set for use of the City Bikes (within this time you have to return the bike to the station, otherwise you have to pay for exceeding the time). Nevertheless, we believe that rides within 200km are possible, even though they will end up being quite pricey. We decided to remove longest 10 rides, because they were over 240km long.

In [30]:
#removing longest 10 rides from the dataset
longest = sort_len[-10:]
arr1 = np.arange(len(final_data))
#np.setdiff1d(arr,common_emelments)
final_data = final_data[np.setdiff1d(arr1,longest)]
print(final_data.shape)
print(final_data[:5])

(3787938, 8)
[['2019-04-30T23:59:35' '2019-05-01T00:00:36' '43' 'Karhupuisto' '43'
  'Karhupuisto' '2' '57']
 ['2019-04-30T23:59:31' '2019-05-01T00:09:00' '140' 'Verkatehtaanpuisto'
  '134' 'Haukilahdenkatu' '2196' '569']
 ['2019-04-30T23:59:25' '2019-04-30T23:59:46' '121' 'Vilhonvuorenkatu'
  '121' 'Vilhonvuorenkatu' '0' '20']
 ['2019-04-30T23:59:21' '2019-05-01T00:09:20' '39' 'Ooppera' '44'
  'Sörnäisten metroasema' '2121' '596']
 ['2019-04-30T23:59:19' '2019-05-01T00:18:12' '57'
  'Lauttasaaren ostoskeskus' '63' 'Jätkäsaarenlaituri' '2460' '1127']]


### Removing too short rides (distance)

In [31]:
#final_data[final_data[:,6].astype('float')<0]=0

len_col = final_data[:,6]
len_col = len_col.astype('float64')
sort_len2 = np.argsort(len_col)
#print(len_col[sort_len[:10]]) #printing values of the first 10 - to assess the values
#print(np.where(len_col == -4290436)[0]) 
too_short = np.where(len_col<= 0)

#print(too_short[0][:3447982])
#getting those values <=0 from len_col
too_short1 = list(too_short[0])
print(len(too_short1))
#too_short1[:5]
#print(sort_len2[:37768])
#final_data1 = np.delete(final_data[:],common_emelments,axis=1) for some reason didnt work??

100235


We see that some rides were still with distance of 0 meter, even though already removed those starting and finishing in the same station within one minute lenght. This is visibly another kind of invalid ride, thus we will removed all those with distance of 0 meters as well, including the strange ride with negative distance. 

Let's remove the rides that are shorter than 5 meters.

In [58]:
#removing too short rides from data
#final_data[final_data[:,6].astype('float')<0]=0
#shortest = sort_len[:37768]
#arr2 = np.arange(len(final_data))
#len(np.setdiff1d(arr2,shortest))
#final_data = final_data[np.setdiff1d(arr2,shortest)]
final_data = final_data[final_data[:,6].astype("float")>5] ##distance limit
print(final_data.shape)
#print(final_data[:5])

(3653803, 8)


### Removing too short rides (time)

Even though we removed the rides that have the same departure and return station and are shorter than 1 minute, we still have some rides that last are of very short time, 0 seconds included. We decided to removed those, those rides, that lasted less than 10 seconds from our dataset.

In [60]:
final_data = final_data[final_data[:,7].astype("float")>10] #time limit
print(final_data.shape)

(3653803, 8)


In [34]:
#checking for our negative distance, which is no longer there
#print(final_data[:,6])
#np.where(final_data[:,6] == '-4290436')

['2' '2196' '2121' ... '3477' '3531' '1468']


(array([], dtype=int64),)

In [63]:
#final number of rows
print('\nNumber of rides in our final dataset is {:,}.'.format(len(final_data)))


Number of rides in our final dataset is 3,653,803.


## Data analysis

Covering the basic descriptive analysis of all the valid rides.

In [64]:
#setting up the time column
time_col = final_data[:,7]
time_col = time_col.astype('float64')
time_col = time_col.astype('int64')

#setting up the distance column
dist_col = final_data[:,6]
dist_col = dist_col.astype('float64')
dist_col = dist_col.astype('int64')

### Distance of rides

In [65]:
#total distance over the whole period
total_distance = np.sum(dist_col)/1000
print('\nTotal distance covered by all the rides was {:,.0f} km.'.format(total_distance))


Total distance covered by all the rides was 7,890,152 km.


In [66]:
tothemoon = total_distance/384400
tothemoon

20.525889547346512

#### To the Moon and back over 10 times!
It looks like people cycled a lot with Helsinki City Bikes, but this number seem rather big to give us a concrete idea. The distance from the Earth to the Moon is 384,400km. So withing 7 months, people in Helsinki can cycle to the Moon and back more than 10 times.

In [67]:
#average bike ride
average_dist = np.average(dist_col)/1000
print('\nPeople on average rode {:.2f} km per ride.'.format(average_dist))


People on average rode 2.16 km per ride.


In [68]:
#Standard deviation distance
std_dist = np.std(dist_col)/1000
print('std: {:.2f} km'.format(std_dist))

std: 1.64 km


In [69]:
#longest bike ride
longest_dist = np.max(dist_col)/1000
print('\nThe longest valid ride was {:,.2f} km long.'.format(longest_dist))


The longest valid ride was 190.22 km long.


In [70]:
#the shortest distance
shortest_dist = np.min(dist_col)
print('\nThe shortest valid ride was {:.2f} m long.'.format(shortest_dist))


The shortest valid ride was 6.00 m long.


### Time of rides

In [71]:
max_time = (np.max(time_col)/3600)/24
print('Maximum time of one single ride: {:.2f} day'.format(max_time))

Maximum time of one single ride: 62.52 day


In [72]:
min_time = np.min(time_col)/60
print('Minimum time of one single ride: {} min'.format(min_time))

Minimum time of one single ride: 0.2 min


In [73]:
#average ride time
average_time = np.average(time_col)/60
print('\nAverage time:{:.2f} min'.format(average_time))


Average time:15.83 min


In [74]:
#Standard deviation time
std_time = np.std(time_col)/60
print('std: {:,.2f} min'.format(std_time))

std: 136.32 min


### Is there correlation between time and distance of the rides..?

In [75]:
correlation = np.corrcoef(dist_col,time_col)
print(correlation)

[[1.         0.09738007]
 [0.09738007 1.        ]]


Even though the correlation between distance and time of the rides would seem rather intuitive, we can see that there was no relationship bewteen these two variables. Seems like some people managed to bike longer distance in longer time, but some were just going slowly and enjoyed the city.

### Most popular departure and return stations

What stations were used the most and which the least?

<img src="map.png">

#### Most popular departure station

In [76]:
# Popular Departure stations
(popular_dep_st,popular_dep_count) = np.unique(final_data[:,3],return_counts=True)

In [77]:
max_index_row = np.argmax(popular_dep_count)
print(popular_dep_st[max_index_row])
print(popular_dep_count[max_index_row])

Töölönlahdenkatu
81984


#### Most popular return station

In [78]:
# Popular Returned stations
(popular_ret_st, popular_ret_count) = np.unique(final_data[:,5],return_counts=True)

In [79]:
max_index_row = np.argmax(popular_ret_count)
print(popular_ret_st[max_index_row])
print(popular_ret_count[max_index_row])

Töölönlahdenkatu
81811


#### The least popular departure station

In [80]:
# Least departure stations
min_index_row = np.argmin(popular_dep_count)
print(popular_dep_st[min_index_row])
print(popular_dep_count[min_index_row])

Ruomelantie***
1


#### The least popular return station

In [81]:
# Least returend stations
min_index_row = np.argmin(popular_ret_count)
print(popular_ret_st[min_index_row])
print(popular_ret_count[min_index_row])

Ruomelantie***
3
