# 0.3 Station Data Cleaning

In this notebook the data of the bikesharing trips will be connected to station data which enables to give insight about the geographical distribution of the trips. Before that the station data will be examined seperately. 

## Import of needed packages for geospatial analysis

In [1]:
import pandas as pd
import numpy as np
import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs
import matplotlib.pyplot as plt
import seaborn as sns
import folium as fo
from folium import Map
from folium.plugins import HeatMap

## Import of data

In [2]:
df_Trips = pd.read_csv('data/boston_2017.csv')
df_stations_2017 = pd.read_csv('data/previous_Hubway_Stations_as_of_July_2017.csv')

## At first an overview of the station data is done

In [13]:
df_stations_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 281 entries, 0 to 280
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Station ID       281 non-null    object 
 1   Station          281 non-null    object 
 2   Latitude         281 non-null    float64
 3   Longitude        281 non-null    float64
 4   Municipality     281 non-null    object 
 5   publiclyExposed  281 non-null    int64  
 6   # of Docks       281 non-null    int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 15.5+ KB


In [14]:
df_stations_2017.head(10)

Unnamed: 0,Station ID,Station,Latitude,Longitude,Municipality,publiclyExposed,# of Docks
0,A32019,175 N Harvard St,42.363796,-71.129164,Boston,1,18
1,S32035,191 Beacon St,42.380323,-71.108786,Somerville,1,15
2,S32023,30 Dane St.,42.381001,-71.104025,Somerville,1,15
3,M32026,359 Broadway - Broadway at Fayette Street,42.370803,-71.104412,Cambridge,1,23
4,M32054,699 Mt Auburn St,42.375002,-71.148716,Cambridge,1,25
5,M32060,700 Huron Ave,42.380788,-71.154129,Cambridge,1,19
6,M32058,84 Cambridgepark Dr,42.3936,-71.143941,Cambridge,1,25
7,A32032,Airport T Stop - Bremen St at Brooks St,42.374113,-71.032775,Boston,1,16
8,M32046,Alewife MBTA at Steel Place,42.395588,-71.142606,Cambridge,1,19
9,M32033,Alewife Station at Russell Field,42.396105,-71.139459,Cambridge,1,23


In the data 281 stations are listed.

As our goal is to visualize the trips, only the geospatial information is important. 
Nonetheless further information about the municipality, the public exposeness and the number of bikes which can be stored at each station can be found. 

In the last step op the station data description the distribution of the stations in chicago is plotted on an interactive map.

In [15]:
m = Map(location=[42.353089, -71.066170], zoom_start=12, )

# mark each station as a point
for index, row in df_stations_2017.iterrows():
    fo.CircleMarker([row['Latitude'], row['Longitude']],
                        radius=1
                       ).add_to(m)
# convert to (n, 2) nd-array format for heatmap
stationArr = df_stations_2017[['Latitude', 'Longitude']]

# plot heatmap
m.add_child(fo.plugins.HeatMap(stationArr, radius=20))
m

## Examination of the trip data

In [16]:
StatCount = df_Trips.groupby('start_station_name')['start_time'].agg(len)

StatCount
#print(StatCount.to_string())

df_Trips[df_Trips['start_station_name'].str.contains("Curtis Hall")].iloc[3:5]

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type
1464,2017-01-03 08:14:44,2017-01-03 08:41:45,124,33,Curtis Hall at South Street,Kenmore Sq / Comm Ave,741,Subscriber
63315,2017-03-29 08:19:21,2017-03-29 08:42:20,124,36,Curtis Hall - South St at Centre St,Copley Square - Dartmouth St at Boylston St,1558,Subscriber


Further examination reveals that the station name in the trip data is not standardised and different spellings of the same station (same id) can be found in the data. Therefore the goal is to standardise these to link as many trips as possible to stations in the later coming geospational analysis.  

To solve this issue we add the Trip_Station_ID to the Station data and afterwards merge the Station Data to the Trip Data using the real Station ID.

In [17]:
mergeData = df_Trips[['start_station_id', 'start_station_name']].drop_duplicates()
StationDataMod = df_stations_2017.merge(mergeData, \
                                        left_on='Station', \
                                        right_on='start_station_name', \
                                        how= 'left')\
                    [['start_station_id','Latitude','Longitude']].dropna() 

start_station_location = StationDataMod.rename(columns={"Latitude": "start_latitude",  "Longitude": "start_longitude"})
end_station_location = StationDataMod.rename(columns={"start_station_id" : "end_station_id","Latitude": "end_latitude",  "Longitude": "end_longitude"})

end_station_location

Unnamed: 0,end_station_id,end_latitude,end_longitude
0,149.0,42.363796,-71.129164
3,116.0,42.370803,-71.104412
7,214.0,42.374113,-71.032775
8,183.0,42.395588,-71.142606
9,142.0,42.396105,-71.139459
...,...,...,...
274,39.0,42.338515,-71.074041
276,26.0,42.341522,-71.068922
277,218.0,42.351586,-71.045693
278,160.0,42.337586,-71.096271


In [18]:
StationDataMod

Unnamed: 0,start_station_id,Latitude,Longitude
0,149.0,42.363796,-71.129164
3,116.0,42.370803,-71.104412
7,214.0,42.374113,-71.032775
8,183.0,42.395588,-71.142606
9,142.0,42.396105,-71.139459
...,...,...,...
274,39.0,42.338515,-71.074041
276,26.0,42.341522,-71.068922
277,218.0,42.351586,-71.045693
278,160.0,42.337586,-71.096271


In [19]:
df_Trips_Start_Coord = df_Trips.merge(start_station_location, left_on ='start_station_id', right_on='start_station_id', how ='left')
df_Trips_Coord = df_Trips_Start_Coord.merge(end_station_location, left_on ='end_station_id', right_on='end_station_id', how = 'left')
df_Trips_Coord

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,start_latitude,start_longitude,end_latitude,end_longitude
0,2017-01-01 00:06:58,2017-01-01 00:12:49,67,139,MIT at Mass Ave / Amherst St,Dana Park,644,Subscriber,42.358100,-71.093198,42.361780,-71.108100
1,2017-01-01 00:13:16,2017-01-01 00:28:07,36,10,Boston Public Library - 700 Boylston St.,B.U. Central - 725 Comm. Ave.,230,Subscriber,42.349935,-71.077386,42.350406,-71.108279
2,2017-01-01 00:16:17,2017-01-01 00:44:10,36,9,Boston Public Library - 700 Boylston St.,Agganis Arena - 925 Comm Ave.,980,Customer,42.349935,-71.077386,,
3,2017-01-01 00:21:22,2017-01-01 00:33:50,46,19,Christian Science Plaza,Buswell St. at Park Dr.,1834,Subscriber,42.343666,-71.085824,42.347241,-71.105301
4,2017-01-01 00:30:06,2017-01-01 00:40:28,10,8,B.U. Central - 725 Comm. Ave.,Union Square - Brighton Ave. at Cambridge St.,230,Subscriber,42.350406,-71.108279,42.353334,-71.137313
...,...,...,...,...,...,...,...,...,...,...,...,...
1313769,2017-12-31 23:46:18,2017-12-31 23:50:27,117,141,Binney St / Sixth St,Kendall Street,1846,Subscriber,42.366162,-71.086883,42.363560,-71.082168
1313770,2017-12-29 16:11:56,2017-12-29 16:16:18,54,42,Tremont St at West St,Boylston St at Arlington St TEMPORARY WINTER L...,2,Subscriber,42.354979,-71.063348,42.352567,-71.067705
1313771,2017-12-30 08:09:44,2017-12-30 08:26:08,54,58,Tremont St at West St,Beacon St at Arlington St,1534,Subscriber,42.354979,-71.063348,,
1313772,2017-12-30 12:20:01,2017-12-30 12:49:12,54,46,Tremont St at West St,Christian Science Plaza - Massachusetts Ave at...,1978,Subscriber,42.354979,-71.063348,42.343666,-71.085824


In [20]:
MissingStart = df_Trips_Start_Coord.isnull().values.sum()/2
MissingEnd = df_Trips_Coord.isnull().values.sum()/2 - MissingStart

print("For %d trips the Start Location could not be matched" % MissingStart)
print("For %d trips the End Location could not be matched" % MissingEnd)

For 73293 trips the Start Location could not be matched
For 72267 trips the End Location could not be matched


In [21]:
df_Trips_Coord.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313774 entries, 0 to 1313773
Data columns (total 12 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   start_time          1313774 non-null  object 
 1   end_time            1313774 non-null  object 
 2   start_station_id    1313774 non-null  int64  
 3   end_station_id      1313774 non-null  int64  
 4   start_station_name  1313774 non-null  object 
 5   end_station_name    1313774 non-null  object 
 6   bike_id             1313774 non-null  int64  
 7   user_type           1313774 non-null  object 
 8   start_latitude      1240481 non-null  float64
 9   start_longitude     1240481 non-null  float64
 10  end_latitude        1241507 non-null  float64
 11  end_longitude       1241507 non-null  float64
dtypes: float64(4), int64(3), object(5)
memory usage: 130.3+ MB


## Advanced Heatmap

As the trips now have coordinates for the start location now the distribution over the city of boston can be plotted.

In [22]:
# https://alysivji.github.io/getting-started-with-folium.html
def map_points(df, lat_col='latitude', lon_col='longitude', zoom_start=12, \
                plot_points=False, pt_radius=15, \
                draw_heatmap=False, heat_map_weights_col=None, \
                heat_map_weights_normalize=True, heat_map_radius=15):
    """Creates a map given a dataframe of points. Can also produce a heatmap overlay

    Arg:
        df: dataframe containing points to maps
        lat_col: Column containing latitude (string)
        lon_col: Column containing longitude (string)
        zoom_start: Integer representing the initial zoom of the map
        plot_points: Add points to map (boolean)
        pt_radius: Size of each point
        draw_heatmap: Add heatmap to map (boolean)
        heat_map_weights_col: Column containing heatmap weights
        heat_map_weights_normalize: Normalize heatmap weights (boolean)
        heat_map_radius: Size of heatmap point

    Returns:
        folium map object
    """

    ## center map in the middle of points center in
    middle_lat = df[lat_col].median()
    middle_lon = df[lon_col].median()

    curr_map = fo.Map(location=[middle_lat, middle_lon],
                          zoom_start=zoom_start)

    # add points to map
    if plot_points:
        for _, row in df.iterrows():
            fo.CircleMarker([row[lat_col], row[lon_col]],
                                radius=pt_radius,
                               # popup=row['name'],
                                fill_color="#3db7e4", # divvy color
                               ).add_to(curr_map)

    # add heatmap
    if draw_heatmap:
        # convert to (n, 2) or (n, 3) matrix format
        if heat_map_weights_col is None:
            cols_to_pull = [lat_col, lon_col]
        else:
            # if we have to normalize
            if heat_map_weights_normalize:
                df[heat_map_weights_col] = \
                    df[heat_map_weights_col] / df[heat_map_weights_col].sum()

            cols_to_pull = [lat_col, lon_col, heat_map_weights_col]

        stations = df[cols_to_pull]
        curr_map.add_child(fo.plugins.HeatMap(stations, radius=heat_map_radius))


        return curr_map

In [23]:
StationCount = df_Trips_Coord.groupby(['start_latitude', 'start_longitude']).size().reset_index(name='counts')
StationCount.head()

Unnamed: 0,start_latitude,start_longitude,counts
0,42.303469,-71.085347,381
1,42.304128,-71.079295,4
2,42.307852,-71.065122,481
3,42.30791,-71.080952,287
4,42.309054,-71.11543,2273


In [24]:
heatmap_trips = map_points(StationCount, \
                   lat_col='start_latitude', \
                   lon_col='start_longitude', \
                   plot_points=True, \
                   draw_heatmap=True, \
                   heat_map_weights_normalize=True,\
                   pt_radius = 1, \
                   heat_map_radius = 20,\
                   heat_map_weights_col='counts') 

heatmap_trips

In [5]:
def colorCall(number):
  colourCount = int(number)
  return colors[colourCount]

In [15]:
df_stations_2017['Cluster'] = np.random.randint(0,4, size=len(df_stations_2017))

df_stations_2017['Cluster']

colors = [
    'red',
    'blue',
    'gray',
    'orange',
    'beige',
    'green',
    'purple',
    'cadetblue',
    'black',
    'pink'
]
 


In [25]:
# Funktion weißt Nummer einer Farbe zu 
def colorCall(number):
  colorCount = int(number)
  return colors[colorCount]

# Zufallszahl welche Clusternummer darstellen soll 
df_stations_2017['Cluster'] = np.random.randint(0,4, size=len(df_stations_2017))

colors = [
    'red',
    'blue',
    'gray',
    'orange',
    'beige',
    'green',
    'purple',
    'cadetblue',
    'black',
    'pink'
]
 
# Karte erstellen
stationClusterMap = Map(location=[42.353089, -71.066170], zoom_start=11.5, tiles ='Stamen Terrain')

# Stationen markieren. 'Cluster'
for index, row in df_stations_2017.iterrows():
    fo.CircleMarker([row['Latitude'], row['Longitude']],
                        radius=2,
                        fill_color = colorCall(row['Cluster']), color = False, fill_opacity=1
                       ).add_to(stationClusterMap)
    
stationClusterMap