# Goal: For every Trip in 2022, i want to identify the community area the trip started and where it ended and add this information to the trips data.

How?
 1. For every trip, take "start_lat" and "start_lon", and create a Geometrical Point
 2. Do the Same thing with end_lat and end_lon.
 3. Use community districts Polygon data (see notebook 01_get_geodata_districts_chicago.ipynb) and perform a spatial join, to match points (start/end)  and polygons (districts).

## Importing Packages

In [1]:
# pandas
import pandas as pd

# additional import of the geopandas package
import geopandas as gpd

# numpy
import numpy as np

# import mathplotlib.pyplot as plt
import matplotlib.pyplot as plt

# shapely.geometry      Package shapely.geomerty is usefull to for checking, weather a oint is inside a polygon and converting string type
from shapely import wkt
from shapely.geometry import Polygon, LineString, Point, MultiLineString

# importing self made functions from sql_functions script
import sql_functions as sf

## Loading Trips 2022 (see notebook 01_get_trip_data.ipynb) from SQL Database:

In [2]:
# constants:
path = "data/"
schema = "capstone_divvy_bikeshare"
engine = sf.get_engine()

In [3]:
# Loading from SQL Database. Lets just load a random Set of 500.000 Trips out of the 
df_22 = sf.get_dataframe(f"SELECT * From {schema}.trips_2022_v2 ORDER BY RANDOM() LIMIT 500000")

In [4]:
df_22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   ride_id                  500000 non-null  object        
 1   rideable_type            500000 non-null  object        
 2   starttime                500000 non-null  datetime64[ns]
 3   stoptime                 500000 non-null  datetime64[ns]
 4   from_station_id          500000 non-null  object        
 5   from_station_name        427663 non-null  object        
 6   to_station_id            500000 non-null  object        
 7   to_station_name          423118 non-null  object        
 8   start_lat                500000 non-null  float64       
 9   start_lng                500000 non-null  float64       
 10  end_lat                  499488 non-null  float64       
 11  end_lng                  499488 non-null  float64       
 12  member_casual   

In [5]:
df_22.head()

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,time_difference_seconds,tripduration_in_min,trip_value
0,6E70D9613415920B,classic_bike,2022-07-02 23:12:16,2022-07-02 23:34:49,13430,LaSalle St & Illinois St,TA1307000038,Sedgwick St & North Ave,41.890762,-87.631697,41.911386,-87.638677,casual,1353,22.55,4.8335
1,BCCC831F5A02BCF0,electric_bike,2022-10-23 06:39:05,2022-10-23 06:44:30,0,,18003,Fairbanks St & Superior St,41.89,-87.63,41.895748,-87.620104,member,325,5.416667,0.920833
2,29A0386C1A150B87,electric_bike,2022-04-08 11:22:48,2022-04-08 11:40:03,13084,California Ave & Milwaukee Ave,0,,41.922637,-87.697089,41.92,-87.76,casual,1035,17.25,8.245
3,C88D44706B5D3AD0,classic_bike,2022-04-15 07:15:35,2022-04-15 07:31:09,TA1306000006,Orleans St & Elm St,TA1305000006,Dearborn St & Monroe St,41.902924,-87.637715,41.88132,-87.629521,member,934,15.566667,0.0
4,B39728DFDD3698C6,classic_bike,2022-02-13 12:59:18,2022-02-13 13:16:28,SL-005,Indiana Ave & Roosevelt Rd,15539,Desplaines St & Jackson Blvd,41.867888,-87.623041,41.878119,-87.643948,casual,1030,17.166667,3.918333


## 1./2. For every trip, take "start_lat" and "start_lon", and create a Geometrical Point

#### Creating GeoDataFrame:
    * using latitude and longitude to create a geometry POINT
    * for this use: gpd.points_from_xy() Function
    help: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html

### Dealing with NaN Values in the start_lat, start_lng end_lat,end_lng columns:

In [6]:
# NaN values in end_lat:
df_22["end_lat"].isna().value_counts()

False    499488
True        512
Name: end_lat, dtype: int64

In [7]:
# NaN values in end lng:
df_22["end_lng"].isna().value_counts()

False    499488
True        512
Name: end_lng, dtype: int64

In [8]:
# NaN values in start_lat:
df_22["start_lat"].isna().value_counts()

False    500000
Name: start_lat, dtype: int64

In [9]:
# NaN values in start_lng:
df_22["start_lng"].isna().value_counts()

False    500000
Name: start_lng, dtype: int64

Lets drop all rows with NaN values in the end_lat and end_lng columns:

In [10]:
# Dropping all rows, where column "end_lat" has NaN value:
df_22.dropna(axis=0, subset="end_lat", inplace= True)

As we can see, we dont have any NaN values in the Columns:  rows have been deleted:

In [11]:
df_22["end_lat"].isna().value_counts()

False    499488
Name: end_lat, dtype: int64

Now, that all Trips have start lat/lon and end lat/lon, lets create point geometries:

In [12]:
# creating a geodataframe by combining start_lat and start_long to "start_point" geometry columns, crs = WGS 84 
gdf_22_start = gpd.GeoDataFrame(df_22, crs="WGS 84", geometry= gpd.points_from_xy(df_22["start_lng"],df_22["start_lat"])).rename(columns={"geometry":"start_point"})
# now use the created gdf_22_start Geodataframe and add endpoint
gdf_22 = gpd.GeoDataFrame(gdf_22_start,crs="WGS 84", geometry= gpd.points_from_xy(gdf_22_start["end_lng"],gdf_22_start["end_lat"])).rename(columns={"geometry":"end_point"})

In [13]:
gdf_22["start_point"].sample(10000, random_state=42).explore()

As we can see, we added two new geometry columns "start_point" and "end_point to the gdf_22 Geodataframe:

In [14]:
gdf_22.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 499488 entries, 0 to 499999
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   ride_id                  499488 non-null  object        
 1   rideable_type            499488 non-null  object        
 2   starttime                499488 non-null  datetime64[ns]
 3   stoptime                 499488 non-null  datetime64[ns]
 4   from_station_id          499488 non-null  object        
 5   from_station_name        427151 non-null  object        
 6   to_station_id            499488 non-null  object        
 7   to_station_name          423118 non-null  object        
 8   start_lat                499488 non-null  float64       
 9   start_lng                499488 non-null  float64       
 10  end_lat                  499488 non-null  float64       
 11  end_lng                  499488 non-null  float64       
 12  member_c

## 3. Lets use the table we created in Notebook 01_get_geodata_districts_chicago.ipynb

In [15]:
# loading table with the area geometries created in Notebook 01_get_geodata_districts_chicago.ipynb
df_areas = pd.read_csv("data/clean_areas.csv")
df_areas.head(2)

Unnamed: 0.1,Unnamed: 0,area_number,community_name,shape_area,shape_len,new_geometry
0,9,1,Rogers Park,51259900.0,34052.397576,MULTIPOLYGON (((-87.65455590025104 41.99816614...
1,19,2,West Ridge,98429090.0,43020.689458,MULTIPOLYGON (((-87.6846530946559 42.019484772...


In [16]:
# converting to GeoDataFrame by using self made function
gdf_areas = sf.to_gdf(df_areas,geometry_column="new_geometry")

In [17]:
gdf_areas.head()

Unnamed: 0.1,Unnamed: 0,area_number,community_name,shape_area,shape_len,new_geometry
0,9,1,Rogers Park,51259900.0,34052.397576,"MULTIPOLYGON (((-87.65456 41.99817, -87.65574 ..."
1,19,2,West Ridge,98429090.0,43020.689458,"MULTIPOLYGON (((-87.68465 42.01948, -87.68464 ..."
2,30,3,Uptown,65095640.0,46972.794555,"MULTIPOLYGON (((-87.64102 41.95480, -87.64400 ..."
3,5,4,Lincoln Square,71352330.0,36624.603085,"MULTIPOLYGON (((-87.67441 41.97610, -87.67440 ..."
4,47,5,North Center,57054170.0,31391.669754,"MULTIPOLYGON (((-87.67336 41.93234, -87.67342 ..."


#### Now lets add the columns start_area_number and end_area_number, for every Trip by performing a spatial Join:

#### Spatial Join on Start Points

In [18]:
# setting start_point as gemoetry, so that the starting points are checked (whether they are inside of a district)
gdf_22.set_geometry("start_point", inplace=True)

In [19]:
# Perform spatial join to match start_points and polygons
gdf_22_merge1 = gpd.tools.sjoin(gdf_22, gdf_areas[["area_number","community_name","new_geometry"]], predicate="within", how='left')

Renaming the columns "area_number" and "community_name" to "start_area_number" and "start_community_name":

In [20]:
gdf_22_merge1.rename(columns={"area_number":"start_area_number", "community_name":"start_community_name"},inplace=True)

In [21]:
gdf_22_merge1.head()

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,end_lng,member_casual,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,index_right,start_area_number,start_community_name
0,6E70D9613415920B,classic_bike,2022-07-02 23:12:16,2022-07-02 23:34:49,13430,LaSalle St & Illinois St,TA1307000038,Sedgwick St & North Ave,41.890762,-87.631697,...,-87.638677,casual,1353,22.55,4.8335,POINT (-87.63170 41.89076),POINT (-87.63868 41.91139),7.0,8.0,Near North Side
1,BCCC831F5A02BCF0,electric_bike,2022-10-23 06:39:05,2022-10-23 06:44:30,0,,18003,Fairbanks St & Superior St,41.89,-87.63,...,-87.620104,member,325,5.416667,0.920833,POINT (-87.63000 41.89000),POINT (-87.62010 41.89575),7.0,8.0,Near North Side
2,29A0386C1A150B87,electric_bike,2022-04-08 11:22:48,2022-04-08 11:40:03,13084,California Ave & Milwaukee Ave,0,,41.922637,-87.697089,...,-87.76,casual,1035,17.25,8.245,POINT (-87.69709 41.92264),POINT (-87.76000 41.92000),21.0,22.0,Logan Square
3,C88D44706B5D3AD0,classic_bike,2022-04-15 07:15:35,2022-04-15 07:31:09,TA1306000006,Orleans St & Elm St,TA1305000006,Dearborn St & Monroe St,41.902924,-87.637715,...,-87.629521,member,934,15.566667,0.0,POINT (-87.63772 41.90292),POINT (-87.62952 41.88132),7.0,8.0,Near North Side
4,B39728DFDD3698C6,classic_bike,2022-02-13 12:59:18,2022-02-13 13:16:28,SL-005,Indiana Ave & Roosevelt Rd,15539,Desplaines St & Jackson Blvd,41.867888,-87.623041,...,-87.643948,casual,1030,17.166667,3.918333,POINT (-87.62304 41.86789),POINT (-87.64395 41.87812),31.0,32.0,Loop


In [22]:
# droping "index_right" column
gdf_22_merge1.drop(columns="index_right",inplace=True)

In [23]:
# setting end_point as gemoetry, so that the ending points are checked (are they inside of a district (Polygon))
gdf_22_merge1.set_geometry("end_point", inplace=True)

In [24]:
# Perform spatial join to match points (this time end points)  and polygons (districts)
gdf_22_merge2 = gpd.tools.sjoin(gdf_22_merge1, gdf_areas[["area_number","community_name","new_geometry"]], predicate="within", how='left')

In [25]:
gdf_22_merge2.head()

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,start_area_number,start_community_name,index_right,area_number,community_name
0,6E70D9613415920B,classic_bike,2022-07-02 23:12:16,2022-07-02 23:34:49,13430,LaSalle St & Illinois St,TA1307000038,Sedgwick St & North Ave,41.890762,-87.631697,...,1353,22.55,4.8335,POINT (-87.63170 41.89076),POINT (-87.63868 41.91139),8.0,Near North Side,6.0,7.0,Lincoln Park
1,BCCC831F5A02BCF0,electric_bike,2022-10-23 06:39:05,2022-10-23 06:44:30,0,,18003,Fairbanks St & Superior St,41.89,-87.63,...,325,5.416667,0.920833,POINT (-87.63000 41.89000),POINT (-87.62010 41.89575),8.0,Near North Side,7.0,8.0,Near North Side
2,29A0386C1A150B87,electric_bike,2022-04-08 11:22:48,2022-04-08 11:40:03,13084,California Ave & Milwaukee Ave,0,,41.922637,-87.697089,...,1035,17.25,8.245,POINT (-87.69709 41.92264),POINT (-87.76000 41.92000),22.0,Logan Square,18.0,19.0,Belmont Cragin
3,C88D44706B5D3AD0,classic_bike,2022-04-15 07:15:35,2022-04-15 07:31:09,TA1306000006,Orleans St & Elm St,TA1305000006,Dearborn St & Monroe St,41.902924,-87.637715,...,934,15.566667,0.0,POINT (-87.63772 41.90292),POINT (-87.62952 41.88132),8.0,Near North Side,31.0,32.0,Loop
4,B39728DFDD3698C6,classic_bike,2022-02-13 12:59:18,2022-02-13 13:16:28,SL-005,Indiana Ave & Roosevelt Rd,15539,Desplaines St & Jackson Blvd,41.867888,-87.623041,...,1030,17.166667,3.918333,POINT (-87.62304 41.86789),POINT (-87.64395 41.87812),32.0,Loop,27.0,28.0,Near West Side


Renaming the columns "area_number" and "community_name" to "start_area_number" and "start_community_name":

In [26]:
# renaming columns:
gdf_22_merge2.rename(columns={"area_number":"end_area_number", "community_name":"end_community_name"},inplace=True)
gdf_22_merge2.head(4)

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,start_area_number,start_community_name,index_right,end_area_number,end_community_name
0,6E70D9613415920B,classic_bike,2022-07-02 23:12:16,2022-07-02 23:34:49,13430,LaSalle St & Illinois St,TA1307000038,Sedgwick St & North Ave,41.890762,-87.631697,...,1353,22.55,4.8335,POINT (-87.63170 41.89076),POINT (-87.63868 41.91139),8.0,Near North Side,6.0,7.0,Lincoln Park
1,BCCC831F5A02BCF0,electric_bike,2022-10-23 06:39:05,2022-10-23 06:44:30,0,,18003,Fairbanks St & Superior St,41.89,-87.63,...,325,5.416667,0.920833,POINT (-87.63000 41.89000),POINT (-87.62010 41.89575),8.0,Near North Side,7.0,8.0,Near North Side
2,29A0386C1A150B87,electric_bike,2022-04-08 11:22:48,2022-04-08 11:40:03,13084,California Ave & Milwaukee Ave,0,,41.922637,-87.697089,...,1035,17.25,8.245,POINT (-87.69709 41.92264),POINT (-87.76000 41.92000),22.0,Logan Square,18.0,19.0,Belmont Cragin
3,C88D44706B5D3AD0,classic_bike,2022-04-15 07:15:35,2022-04-15 07:31:09,TA1306000006,Orleans St & Elm St,TA1305000006,Dearborn St & Monroe St,41.902924,-87.637715,...,934,15.566667,0.0,POINT (-87.63772 41.90292),POINT (-87.62952 41.88132),8.0,Near North Side,31.0,32.0,Loop


In [27]:
# droping "index_right" column
gdf_22_merge2.drop(columns="index_right",inplace=True)

#### Dealing with NaN Values: 
    - Problem: there are some trips, where we have NaN Values for start and end community area numbers and names. Lets look at them:

In [28]:
# NaN Values in start_area_number:
gdf_22_merge2["start_area_number"].isna().value_counts()

False    496835
True       2653
Name: start_area_number, dtype: int64

In [29]:
gdf_22_merge2["end_area_number"].isna().value_counts()

False    496489
True       2999
Name: end_area_number, dtype: int64

#### Since we made sure, that all Trips had a start and endpoint, NaN Values in the end_area_number/start_area_number mean, that these trips ended/started outside of the the areas of Divvy. Lets deal with the NaN Values, the following way:
    - If start_area_number = NaN         ---> 999
    - If start_community_name = Nan      ---> "not in districts"
    - end_area_number = NaN              ---> 999
    - end_community_name                 ---> "not in districts"

In [30]:
# replacing NaN Values:
gdf_22_merge2["start_area_number"].fillna(999, inplace=True)
gdf_22_merge2["end_area_number"].fillna(999, inplace=True)
gdf_22_merge2["start_community_name"].fillna("not in districts", inplace=True)
gdf_22_merge2["end_community_name"].fillna("not in districts", inplace=True)

As we can see, there are no more NaN values in those columns:

In [31]:
gdf_22_merge2["end_area_number"].isna().value_counts()

False    499488
Name: end_area_number, dtype: int64

In [34]:
# converting start_area_number and end_area_number from float to integer:
gdf_22_merge2["start_area_number"] = gdf_22_merge2["start_area_number"].astype(int)
gdf_22_merge2["end_area_number"] = gdf_22_merge2["end_area_number"].astype(int)

In [35]:
gdf_22_merge2.head(1)

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,member_casual,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,start_area_number,start_community_name,end_area_number,end_community_name
0,6E70D9613415920B,classic_bike,2022-07-02 23:12:16,2022-07-02 23:34:49,13430,LaSalle St & Illinois St,TA1307000038,Sedgwick St & North Ave,41.890762,-87.631697,...,casual,1353,22.55,4.8335,POINT (-87.63170 41.89076),POINT (-87.63868 41.91139),8,Near North Side,7,Lincoln Park


#### All ending Locations for Trips, which have ended outside of chicago districts:


In [37]:
gdf_22_merge2[gdf_22_merge2["end_area_number"]==999]["end_point"].explore()