# Goal: For every Trip in 2021, i want to identify the community area the trip started in and where it ended. Then, i want to add add this information to the trips table.

How?
 1. For every trip, take "start_lat" and "start_lon", and create a Geometrical Point
 2. Do the Same thing with end_lat and end_lon.
 3. Use community districts Polygon data (see notebook 01_get_geodata_districts_chicago.ipynb)
 4. Perform a spatial join, to match points (start/end)  and polygons (districts).

## Importing Packages

In [1]:
# pandas
import pandas as pd

# additional import of the geopandas package
import geopandas as gpd

# numpy
import numpy as np

# import mathplotlib.pyplot as plt
import matplotlib.pyplot as plt

# shapely.geometry      Package shapely.geomerty is usefull to for checking, weather a oint is inside a polygon and converting string type
from shapely import wkt
from shapely.geometry import Polygon, LineString, Point, MultiLineString

# importing self made functions from sql_functions script
import sql_functions as sf

## Loading Trips 2022 (see notebook 01_get_trip_data.ipynb) from SQL Database:

In [2]:
# constants:
path = "data/"
schema = "capstone_divvy_bikeshare"
engine = sf.get_engine()

In [3]:
# Loading from SQL Database. Lets just load a random Set of 500.000 Trips out of the 
df_22 = sf.get_dataframe(f"SELECT * From {schema}.trips_2022_v2 ORDER BY RANDOM() LIMIT 500000")
#df_22 = sf.get_dataframe(f"SELECT * From {schema}.trips_2022_v2")

In [4]:
df_22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   ride_id                  500000 non-null  object        
 1   rideable_type            500000 non-null  object        
 2   starttime                500000 non-null  datetime64[ns]
 3   stoptime                 500000 non-null  datetime64[ns]
 4   from_station_id          500000 non-null  object        
 5   from_station_name        428179 non-null  object        
 6   to_station_id            500000 non-null  object        
 7   to_station_name          423340 non-null  object        
 8   start_lat                500000 non-null  float64       
 9   start_lng                500000 non-null  float64       
 10  end_lat                  499470 non-null  float64       
 11  end_lng                  499470 non-null  float64       
 12  member_casual   

In [5]:
df_22.head()

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,time_difference_seconds,tripduration_in_min,trip_value
0,9B632EBA1CC3202B,electric_bike,2022-12-07 07:34:57,2022-12-07 07:39:48,TA1309000019,Lakeview Ave & Fullerton Pkwy,0,,41.925656,-87.639104,41.92,-87.64,member,291,4.85,0.8245
1,B20C4D0B45751662,electric_bike,2022-04-27 17:09:01,2022-04-27 17:20:53,316,Lamon Ave & Belmont Ave,0,,41.94,-87.75,41.93,-87.72,casual,712,11.866667,5.984
2,F90780236FB6EC9E,classic_bike,2022-06-03 15:31:03,2022-06-03 15:36:42,13001,Michigan Ave & Washington St,13008,Millennium Park,41.883984,-87.624684,41.881032,-87.624084,casual,339,5.65,1.9605
3,98B2B077FE0AFE33,electric_bike,2022-08-29 17:15:10,2022-08-29 17:36:03,13409,Sangamon St & Washington Blvd,20247,W Washington Blvd & N Peoria St,41.882971,-87.650839,41.88,-87.65,member,1253,20.883333,3.550167
4,D396D97F3616E948,electric_bike,2022-04-20 07:46:45,2022-04-20 07:50:55,SL-010,Financial Pl & Ida B Wells Dr,TA1309000007,Franklin St & Monroe St,41.875046,-87.633116,41.880317,-87.635185,casual,250,4.166667,2.75


## 1./2. For every trip, take "start_lat" and "start_lon", and create a Geometrical Point

#### Creating GeoDataFrame:
    * using latitude and longitude to create a geometry POINT
    * for this use: gpd.points_from_xy() Function
    help: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html

### Dealing with NaN Values in the start_lat, start_lng end_lat,end_lng columns:

In [6]:
# NaN values in end_lat:
df_22["end_lat"].isna().value_counts()

False    499470
True        530
Name: end_lat, dtype: int64

In [7]:
# NaN values in end lng:
df_22["end_lng"].isna().value_counts()

False    499470
True        530
Name: end_lng, dtype: int64

In [8]:
# NaN values in start_lat:
df_22["start_lat"].isna().value_counts()

False    500000
Name: start_lat, dtype: int64

In [9]:
# NaN values in start_lng:
df_22["start_lng"].isna().value_counts()

False    500000
Name: start_lng, dtype: int64

Lets drop all rows with NaN values in the end_lat and end_lng columns:

In [10]:
# Dropping all rows, where column "end_lat" has NaN value:
df_22.dropna(axis=0, subset="end_lat", inplace= True)

As we can see, we dont have any NaN values in the Columns:  rows have been deleted:

In [11]:
df_22["end_lat"].isna().value_counts()

False    499470
Name: end_lat, dtype: int64

Now, that all Trips have start lat/lon and end lat/lon, lets create point geometries:

In [12]:
# creating a geodataframe by combining start_lat and start_long to "start_point" geometry columns, crs = WGS 84 
gdf_22_start = gpd.GeoDataFrame(df_22, crs="WGS 84", geometry= gpd.points_from_xy(df_22["start_lng"],df_22["start_lat"])).rename(columns={"geometry":"start_point"})
# now use the created gdf_22_start Geodataframe and add endpoint
gdf_22 = gpd.GeoDataFrame(gdf_22_start,crs="WGS 84", geometry= gpd.points_from_xy(gdf_22_start["end_lng"],gdf_22_start["end_lat"])).rename(columns={"geometry":"end_point"})

In [13]:
gdf_22["start_point"].sample(10000, random_state=42).explore()

As we can see, we added two new geometry columns "start_point" and "end_point to the gdf_22 Geodataframe:

In [14]:
gdf_22.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 499470 entries, 0 to 499999
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   ride_id                  499470 non-null  object        
 1   rideable_type            499470 non-null  object        
 2   starttime                499470 non-null  datetime64[ns]
 3   stoptime                 499470 non-null  datetime64[ns]
 4   from_station_id          499470 non-null  object        
 5   from_station_name        427649 non-null  object        
 6   to_station_id            499470 non-null  object        
 7   to_station_name          423340 non-null  object        
 8   start_lat                499470 non-null  float64       
 9   start_lng                499470 non-null  float64       
 10  end_lat                  499470 non-null  float64       
 11  end_lng                  499470 non-null  float64       
 12  member_c

## 3. Lets use the table we created in Notebook 01_get_geodata_districts_chicago.ipynb

In [15]:
# loading table with the area geometries created in Notebook 01_get_geodata_districts_chicago.ipynb
df_areas = pd.read_csv("data/clean_areas.csv")
df_areas.head(2)

Unnamed: 0.1,Unnamed: 0,area_number,community_name,shape_area,shape_len,new_geometry
0,9,1,Rogers Park,51259900.0,34052.397576,MULTIPOLYGON (((-87.65455590025104 41.99816614...
1,19,2,West Ridge,98429090.0,43020.689458,MULTIPOLYGON (((-87.6846530946559 42.019484772...


In [16]:
# converting to GeoDataFrame by using self made function
gdf_areas = sf.to_gdf(df_areas,geometry_column="new_geometry")

In [17]:
gdf_areas.head()

Unnamed: 0.1,Unnamed: 0,area_number,community_name,shape_area,shape_len,new_geometry
0,9,1,Rogers Park,51259900.0,34052.397576,"MULTIPOLYGON (((-87.65456 41.99817, -87.65574 ..."
1,19,2,West Ridge,98429090.0,43020.689458,"MULTIPOLYGON (((-87.68465 42.01948, -87.68464 ..."
2,30,3,Uptown,65095640.0,46972.794555,"MULTIPOLYGON (((-87.64102 41.95480, -87.64400 ..."
3,5,4,Lincoln Square,71352330.0,36624.603085,"MULTIPOLYGON (((-87.67441 41.97610, -87.67440 ..."
4,47,5,North Center,57054170.0,31391.669754,"MULTIPOLYGON (((-87.67336 41.93234, -87.67342 ..."


## 4. Perform a spatial join, to match points (start/end)  and polygons (districts).

#### Spatial Join on Start Points

In [18]:
# setting start_point as gemoetry, so that the starting points are checked (whether they are inside of a district)
gdf_22.set_geometry("start_point", inplace=True)

In [19]:
# Perform spatial join to match start_points and polygons
gdf_22_merge1 = gpd.tools.sjoin(gdf_22, gdf_areas[["area_number","community_name","new_geometry"]], predicate="within", how='left')

Renaming the columns "area_number" and "community_name" to "start_area_number" and "start_community_name":

In [20]:
gdf_22_merge1.rename(columns={"area_number":"start_area_number", "community_name":"start_community_name"},inplace=True)

In [21]:
gdf_22_merge1.head()

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,end_lng,member_casual,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,index_right,start_area_number,start_community_name
0,9B632EBA1CC3202B,electric_bike,2022-12-07 07:34:57,2022-12-07 07:39:48,TA1309000019,Lakeview Ave & Fullerton Pkwy,0,,41.925656,-87.639104,...,-87.64,member,291,4.85,0.8245,POINT (-87.63910 41.92566),POINT (-87.64000 41.92000),6.0,7.0,Lincoln Park
1,B20C4D0B45751662,electric_bike,2022-04-27 17:09:01,2022-04-27 17:20:53,316,Lamon Ave & Belmont Ave,0,,41.94,-87.75,...,-87.72,casual,712,11.866667,5.984,POINT (-87.75000 41.94000),POINT (-87.72000 41.93000),14.0,15.0,Portage Park
2,F90780236FB6EC9E,classic_bike,2022-06-03 15:31:03,2022-06-03 15:36:42,13001,Michigan Ave & Washington St,13008,Millennium Park,41.883984,-87.624684,...,-87.624084,casual,339,5.65,1.9605,POINT (-87.62468 41.88398),POINT (-87.62408 41.88103),31.0,32.0,Loop
3,98B2B077FE0AFE33,electric_bike,2022-08-29 17:15:10,2022-08-29 17:36:03,13409,Sangamon St & Washington Blvd,20247,W Washington Blvd & N Peoria St,41.882971,-87.650839,...,-87.65,member,1253,20.883333,3.550167,POINT (-87.65084 41.88297),POINT (-87.65000 41.88000),27.0,28.0,Near West Side
4,D396D97F3616E948,electric_bike,2022-04-20 07:46:45,2022-04-20 07:50:55,SL-010,Financial Pl & Ida B Wells Dr,TA1309000007,Franklin St & Monroe St,41.875046,-87.633116,...,-87.635185,casual,250,4.166667,2.75,POINT (-87.63312 41.87505),POINT (-87.63519 41.88032),31.0,32.0,Loop


In [22]:
# droping "index_right" column
gdf_22_merge1.drop(columns="index_right",inplace=True)

In [23]:
# setting end_point as gemoetry, so that the ending points are checked (are they inside of a district (Polygon))
gdf_22_merge1.set_geometry("end_point", inplace=True)

In [24]:
# Perform spatial join to match points (this time end points)  and polygons (districts)
gdf_22_merge2 = gpd.tools.sjoin(gdf_22_merge1, gdf_areas[["area_number","community_name","new_geometry"]], predicate="within", how='left')

In [25]:
gdf_22_merge2.head()

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,start_area_number,start_community_name,index_right,area_number,community_name
0,9B632EBA1CC3202B,electric_bike,2022-12-07 07:34:57,2022-12-07 07:39:48,TA1309000019,Lakeview Ave & Fullerton Pkwy,0,,41.925656,-87.639104,...,291,4.85,0.8245,POINT (-87.63910 41.92566),POINT (-87.64000 41.92000),7.0,Lincoln Park,6.0,7.0,Lincoln Park
1,B20C4D0B45751662,electric_bike,2022-04-27 17:09:01,2022-04-27 17:20:53,316,Lamon Ave & Belmont Ave,0,,41.94,-87.75,...,712,11.866667,5.984,POINT (-87.75000 41.94000),POINT (-87.72000 41.93000),15.0,Portage Park,21.0,22.0,Logan Square
2,F90780236FB6EC9E,classic_bike,2022-06-03 15:31:03,2022-06-03 15:36:42,13001,Michigan Ave & Washington St,13008,Millennium Park,41.883984,-87.624684,...,339,5.65,1.9605,POINT (-87.62468 41.88398),POINT (-87.62408 41.88103),32.0,Loop,31.0,32.0,Loop
3,98B2B077FE0AFE33,electric_bike,2022-08-29 17:15:10,2022-08-29 17:36:03,13409,Sangamon St & Washington Blvd,20247,W Washington Blvd & N Peoria St,41.882971,-87.650839,...,1253,20.883333,3.550167,POINT (-87.65084 41.88297),POINT (-87.65000 41.88000),28.0,Near West Side,27.0,28.0,Near West Side
4,D396D97F3616E948,electric_bike,2022-04-20 07:46:45,2022-04-20 07:50:55,SL-010,Financial Pl & Ida B Wells Dr,TA1309000007,Franklin St & Monroe St,41.875046,-87.633116,...,250,4.166667,2.75,POINT (-87.63312 41.87505),POINT (-87.63519 41.88032),32.0,Loop,31.0,32.0,Loop


Renaming the columns "area_number" and "community_name" to "start_area_number" and "start_community_name":

In [26]:
# renaming columns:
gdf_22_merge2.rename(columns={"area_number":"end_area_number", "community_name":"end_community_name"},inplace=True)
gdf_22_merge2.head(4)

Unnamed: 0,ride_id,rideable_type,starttime,stoptime,from_station_id,from_station_name,to_station_id,to_station_name,start_lat,start_lng,...,time_difference_seconds,tripduration_in_min,trip_value,start_point,end_point,start_area_number,start_community_name,index_right,end_area_number,end_community_name
0,9B632EBA1CC3202B,electric_bike,2022-12-07 07:34:57,2022-12-07 07:39:48,TA1309000019,Lakeview Ave & Fullerton Pkwy,0,,41.925656,-87.639104,...,291,4.85,0.8245,POINT (-87.63910 41.92566),POINT (-87.64000 41.92000),7.0,Lincoln Park,6.0,7.0,Lincoln Park
1,B20C4D0B45751662,electric_bike,2022-04-27 17:09:01,2022-04-27 17:20:53,316,Lamon Ave & Belmont Ave,0,,41.94,-87.75,...,712,11.866667,5.984,POINT (-87.75000 41.94000),POINT (-87.72000 41.93000),15.0,Portage Park,21.0,22.0,Logan Square
2,F90780236FB6EC9E,classic_bike,2022-06-03 15:31:03,2022-06-03 15:36:42,13001,Michigan Ave & Washington St,13008,Millennium Park,41.883984,-87.624684,...,339,5.65,1.9605,POINT (-87.62468 41.88398),POINT (-87.62408 41.88103),32.0,Loop,31.0,32.0,Loop
3,98B2B077FE0AFE33,electric_bike,2022-08-29 17:15:10,2022-08-29 17:36:03,13409,Sangamon St & Washington Blvd,20247,W Washington Blvd & N Peoria St,41.882971,-87.650839,...,1253,20.883333,3.550167,POINT (-87.65084 41.88297),POINT (-87.65000 41.88000),28.0,Near West Side,27.0,28.0,Near West Side


In [27]:
# droping "index_right" column
gdf_22_merge2.drop(columns="index_right",inplace=True)

### Dealing with NaN Values: 
    - Problem: there are some trips, where we have NaN Values for start and end community area numbers and names. Lets look at them:

In [28]:
# NaN Values in start_area_number:
gdf_22_merge2["start_area_number"].isna().value_counts()

False    496821
True       2649
Name: start_area_number, dtype: int64

In [29]:
gdf_22_merge2["end_area_number"].isna().value_counts()

False    496499
True       2971
Name: end_area_number, dtype: int64

#### Since we made sure, that all Trips had a start and endpoint, NaN Values in the end_area_number/start_area_number mean, that these trips ended/started outside of the the areas of Divvy. Lets deal with the NaN Values, the following way:
    - If start_area_number = NaN         ---> 999
    - If start_community_name = Nan      ---> "not in districts"
    - end_area_number = NaN              ---> 999
    - end_community_name                 ---> "not in districts"

In [30]:
# replacing NaN Values:
gdf_22_merge2["start_area_number"].fillna(999, inplace=True)
gdf_22_merge2["end_area_number"].fillna(999, inplace=True)
gdf_22_merge2["start_community_name"].fillna("not in districts", inplace=True)
gdf_22_merge2["end_community_name"].fillna("not in districts", inplace=True)

As we can see, there are no more NaN values in those columns:

In [31]:
gdf_22_merge2["end_area_number"].isna().value_counts()

False    499470
Name: end_area_number, dtype: int64

In [32]:
# converting start_area_number and end_area_number from float to integer:
gdf_22_merge2["start_area_number"] = gdf_22_merge2["start_area_number"].astype(int)
gdf_22_merge2["end_area_number"] = gdf_22_merge2["end_area_number"].astype(int)

Store Copy of gdf_22_merge2:

In [33]:
gdf_22_final = gdf_22_merge2.copy()

## 5. Upload to Sql Database

In [None]:
# #Push DataFrame to SQL Database:
# table_name = 'trips_20xx_v3'

# df_22_final.to_sql(name=table_name, # Name of SQL table
#                     con=engine, # Engine or connection
#                     if_exists='replace', # Drop the table before inserting new values 
#                     schema=schema, # Use schema that was defined earlier
#                     index=False, # Write DataFrame index as a column
#                     chunksize=5000, # Specify the number of rows in each batch to be written at a time
#                     method='multi') # Pass multiple values in a single INSERT clause
# print(f"The {table_name} table was imported successfully.")

#### All ending Locations for Trips, which have ended outside of chicago districts:


In [34]:
gdf_22_final[gdf_22_merge2["end_area_number"]==999]["end_point"].explore()