# Accident Propensity Index Calculation
## Overall strategy
1. Define a function that calculates the distance between a point (an accident) and a line segment (a road segment). One way to do this is by using the Haversine formula, which takes into account the curvature of the earth's surface. You can find several implementations of this formula online, or you can use a library like geopy that provides this functionality.
2. Define a function that takes the start and end points of a route and queries the database for accidents within a certain distance from the route. This function can use the distance function from step 1 to determine which accidents are close enough to be considered.
3. If your database doesn't have any indexes, you can create them for the columns that you will be querying frequently. In this case, you would want to create an index on the longitude and latitude columns to speed up spatial queries.
4. Use a library like pandas to read the data from the database into a dataframe. This will allow you to manipulate the data more easily.
5. Iterate over the road segments in your route and call the function from step 2 to find accidents that are close to each segment.
6. Combine the results from step 5 to get a list of all accidents that are on or close to the route.

## Speed-up the process
1. Reduce the search space: Depending on the size of your dataset, you may not need to check every accident against every road segment. One way to reduce the search space is to first query the database to find accidents that are within a certain distance of the route, as I mentioned in my previous answer. You can then check these accidents against the road segments.
2. Parallel processing: You can speed up the process by using parallel processing. You can split the road segments into several parts and process each part in parallel using multiple CPU cores. Python has several libraries for parallel processing, including multiprocessing and concurrent.futures.
3. Use a spatial index: You can use a spatial index, such as a quadtree or an R-tree, to speed up the process of finding accidents that are close to a road segment. A spatial index partitions the space into smaller regions, allowing you to quickly find points that are within a certain region. Several Python libraries support spatial indexing, including rtree and geopandas.
4. Simplify the road network: If your road network is too complex, you can simplify it by reducing the number of vertices or removing small segments. This can make the processing faster without significantly affecting the accuracy of the results.
5. Use a faster distance calculation: The Haversine formula is accurate, but it can be slow for large datasets. If you need a faster calculation, you can use an approximation, such as the Euclidean distance or the Manhattan distance, which may be good enough for your use case. However, keep in mind that these approximations may be less accurate than the Haversine formula.

In [1]:
import pandas as pd
import math

# Define a function to calculate the distance between two points
def distance(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    R = 6371 # radius of the earth in km
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat/2) * math.sin(dlat/2) +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon/2) * math.sin(dlon/2))
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = R * c
    return d

# Define a function to calculate the distance between a point and a line segment
def distance_to_segment(point, segment_start, segment_end):
    px, py = point
    x1, y1 = segment_start
    x2, y2 = segment_end
    dx = x2 - x1
    dy = y2 - y1
    if dx == 0 and dy == 0:
        return distance(point, segment_start)
    t = ((px - x1) * dx + (py - y1) * dy) / (dx * dx + dy * dy)
    if t < 0:
        return distance(point, segment_start)
    elif t > 1:
        return distance(point, segment_end)
    else:
        x = x1 + t * dx
        y = y1 + t * dy
        return distance(point, (x, y))

# Define a function to find accidents on a given route within a maximum distance
def find_accidents_on_route(start_point, end_point, max_distance):
    # Create a mask for accidents that are within the maximum distance from the route
    mask = data.apply(lambda row: distance_to_segment((row['latitude'], row['longitude']), start_point, end_point) <= max_distance, axis=1)

    # Return the accidents that match the mask
    return data.loc[mask]

In [2]:
# Accident data frame
data_dict = {'id': [1, 2, 3],
             'latitude': [33.781549, 33.786186, 33.781534],
             'longitude': [-84.403342, -84.401266, -84.401825],
             'severity_score': [2, 3, 3]}
data = pd.DataFrame(data_dict)

# Route start and end point
start_point = (33.781544, -84.403656)
end_point = (33.781538, -84.398188)

# Maximal distance of accidents from route in kilometers
max_distance = 0.05

In [3]:
# Run after entering the accident, route, and distance data
accidents = find_accidents_on_route(start_point, end_point, max_distance)
print(accidents)

   id   latitude  longitude  severity_score
0   1  33.781549 -84.403342               2
2   3  33.781534 -84.401825               3
