In [1]:
import pandas as pd
import random
from keplergl import KeplerGl
import geopandas as gpd

In [2]:
columns_to_read = ['start_station_name','end_station_name',"start_lng","start_lng","end_lat","end_lng"]  
df1= pd.read_csv("Combined_DataWeatherFinal.csv", usecols=columns_to_read)
df1['new_column'] = 1
aggregated_df = df1.groupby(['start_station_name', 'end_station_name']).size().reset_index(name='Trip Count')
aggregated_df.columns = ['Starting Station', 'Ending Station', 'Trip Count',]

**Task Solution**

To optimize the code execution due to the large dataset and memory consumption issues, we made some changes to the approach. Initially, we were asked to initialize an instance in the Kepler.gl map and customize the map with arcs connecting the data points. However, due to the large dataset, reading the data was taking around 20 minutes and significantly using system RAM, almost causing it to crash. To address this, we performed an analysis using a subset of the data, reducing the number of data points to 'n'. By doing this, we were able to obtain satisfactory output results while efficiently handling a dataset of 120,000 data points. Additionally, we specifically selected 10,000 records from each month and concatenated the data to solve the problem more effectively.


In [3]:
columns_to_read = ['start_station_name','end_station_name','started_at',"start_lat","start_lng","end_lat","end_lng"]  # Replace with the actual column names
main_data= pd.read_csv("Combined_DataWeatherFinal.csv", usecols=columns_to_read)
main_data['started_at'] = pd.to_datetime(main_data['started_at'])

selected_records = pd.DataFrame()

grouped_data = main_data.groupby(main_data['started_at'].dt.month)

def select_random_records(group):
    return group.sample(n=10000, random_state=42)

selected_records = grouped_data.apply(select_random_records)
selected_records.reset_index(drop=True, inplace=True)
selected_records.head(2)


Unnamed: 0,started_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng
0,2022-01-27 14:21:01,5 Ave & E 63 St,Cathedral Pkwy & Broadway,40.766368,-73.971518,40.804213,-73.966991
1,2022-01-10 06:42:18,W 63 St & Broadway,E 53 St & Madison Ave,40.771639,-73.982614,40.759711,-73.974023


In [4]:
columns_to_read = ['start_station_name','end_station_name',"start_lat","start_lng","end_lat","end_lng"] 
df=gpd.read_file("Kepler_12mnth_data.csv", usecols=columns_to_read)
df["start_lat"] = pd.to_numeric(df["start_lat"])
df["start_lng"] = pd.to_numeric(df["start_lng"])
df["end_lat"] = pd.to_numeric(df["end_lat"])
df["end_lng"] = pd.to_numeric(df["end_lng"])


In [5]:
# Initialize Kepler.gl map instance
map_instance = KeplerGl(height=600,width=800)

gdf=gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df.start_lng,df.start_lat))
map_instance.add_data(data=gdf,name="City Bike Trip Data")

# Display the Kepler.gl map
map_instance

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant
  content = self.pack(content)


KeplerGl(data={'City Bike Trip Data': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…

**Task Solution**


To visualize the bike trips between two points, representing the starting and ending locations, we utilized the "arc" layer option in the map settings. By specifying the start and end latitude and longitude coordinates, we created arcs that connect these points on the map. This allowed us to represent the flow of bike trips between different stations.

Furthermore, we took advantage of the customization options, which enabled us to adjust the colors of the arcs according to our preference. Once the map was tailored to our liking, we saved the customized version, complete with the arcs connecting the points, as an HTML file for convenient sharing and viewing.

**Task Solution**

We have to find the common trips in New York City so we have used a kepler.gl map where we have used a filter option in whcih we given the top names of the station which has the highest number of trips

In [6]:
Commontrips = df.groupby(['start_station_name', 'end_station_name']).size().reset_index(name='Trip Count')
Commontrips.columns = ['Starting Station', 'Ending Station', 'Trip Count',]
Commontrips.head(2)
commontripsorted_df = aggregated_df.sort_values(by='Trip Count', ascending=False)


In [8]:
# Retrieve the top 20 records
commmontriptop_20_records = commontripsorted_df.head(20)
commmontriptop_20_records.head(2)


Unnamed: 0,Starting Station,Ending Station,Trip Count
294975,Central Park S & 6 Ave,Central Park S & 6 Ave,14071
147755,7 Ave & Central Park South,7 Ave & Central Park South,10342


**Inferences**

We have identified the top 20 bike trips along with their trip counts, indicating the most commonly used routes, which are likely the busiest. Notably, a common theme among these routes is their proximity to Central Park, a prominent landmark and recreational area in Manhattan. Manhattan stands out among all the zones in New York City due to its plethora of tourist attractions, including Time Square, Central Park, Empire State Building, and Broadway. This suggests that the frequency of bike trips in this zone is notably high, likely driven by the influx of tourists visiting these iconic destinations.