<a href="https://colab.research.google.com/github/JLin-NCE/ArcPy-QC-Tool/blob/main/Jordan_Lin's_Modified_Jupyter_Notebook_Road_Shapefile_Cleanup_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Road Shapefile Cleanup and Segmentation

This Jupyter notebook processes road shapefiles, cleans up the data, merges segments based on street names, and then splits them according to a predefined section list.

## Configuration:

1. Create a directory named "County Centerline" in your working environment.

2. Upload the contents of this folder into the "County Centerline" directory:
   https://drive.google.com/drive/folders/1B50TNQ4cgvvveT16-W6xEzlB6zx7GC9U?usp=sharing

## Usage:

1. Run the notebook cells in order.

2. The script will create several output directories:
   - Output/Merged Output
   - Output/Split Output
   - Output/Split Output/Split Shapefile
   - Output/Split Output/Split Shapefile/Split Output
   - Output/Final Output

3. The script processes the input shapefile, merges segments, splits them based on the section list, and generates various output files including shapefiles, Excel files, and KMZ files.

4. The final output will be compressed into an "Output.zip" file.

5. The resulting shapefile is located in the "Final Output" directory within the zip file.

## Output:

- Merged segments (shapefile and Excel)
- Split segments (shapefile and KMZ)
- Skipped rows (Excel)
- Final output shapefile and KMZ

## Notes:

- The script uses fuzzy matching for street names. Some segments may be skipped if street names are not recognized.
- Intersections are used to split road segments. In some cases, the script may fail to find proper intersection points.
- Review the "skipped_rows.xlsx" file to see which segments were not processed and why.

For any issues or improvements, please refer to the code comments or contact the script maintainer.

In [None]:
!pip install pandas geopandas xmltodict opencv-python tqdm simplekml haversine geopy shapely numpy thefuzz


Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl.metadata (7.7 kB)
Collecting simplekml
  Downloading simplekml-1.3.6.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting haversine
  Downloading haversine-2.8.1-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting thefuzz
  Downloading thefuzz-0.22.1-py3-none-any.whl.metadata (3.9 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz)
  Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Downloading haversine-2.8.1-py2.py3-none-any.whl (7.7 kB)
Downloading thefuzz-0.22.1-py3-none-any.whl (8.2 kB)
Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.

In [None]:

import pandas as pd
import geopandas as gpd
import xmltodict
import pandas as pd
from cmath import inf
import os
import cv2
from tqdm import tqdm
import simplekml
import re
from haversine import haversine, Unit
import geopy
import geopy.distance
import multiprocessing
from multiprocessing import Pool
from shapely.geometry import Polygon
import numpy as np
import ast
import simplekml
from shapely import wkt


In [None]:
INPUT_SHAPE_FILE = r"/content/County Centerline/HawaiianGardens-CAMS-0619224.shp"
OUTPUT_SHAPEFILE_DATA_EXCEL_PATH = r"/content/Output/coordinates.xlsx"
OUTPUT_EXCEL_MARGED_SEGMENTS = r"/content/Output/marged_segments.xlsx"
OUTPUT_SHAPEFILE_MARGED_SEGMENTS = r"/content/Output/marged_segments.shp"
OUTPUT_SHAPE_FILE_WITH_COORDINATES = r"/content/Output/roads_with_coordinates.shp"


In [None]:
os.makedirs("Output/Merged Output",exist_ok=False)
os.makedirs("Output/Split Output",exist_ok=False)
os.makedirs("Output/Split Output/Split Shapefile",exist_ok=False)
os.makedirs("Output/Split Output/Split Shapefile/Split Output",exist_ok=False)
os.makedirs("Output/Final Output",exist_ok=False)

In [None]:
import geopandas as gpd
import pandas as pd
from shapely.geometry import LineString
from pyproj import Transformer

# Function to convert EPSG:2229 coordinates to EPSG:4326 (WGS84)
def transform_coordinates(line, transformer):
    return LineString([transformer.transform(x, y) for x, y in line.coords])

def read_shapefile_to_dataframe(shapefile_path):
    gdf = gpd.read_file(shapefile_path)

    transformer = Transformer.from_crs("EPSG:2229", "EPSG:4326", always_xy=True)
    gdf['geometry'] = gdf['geometry'].apply(lambda geom: transform_coordinates(geom, transformer))

    return gdf

# Function to extract required information from a LineString
def extract_line_info(line):
    begin_latitude, begin_longitude = line.coords[0][1], line.coords[0][0]
    end_latitude, end_longitude = line.coords[-1][1], line.coords[-1][0]
    middle_points = [(y, x) for x, y in line.coords[1:-1]]
    middle_points_str = "; ".join([f"{lat},{lon}" for lat, lon in middle_points])

    return begin_latitude, begin_longitude, end_latitude, end_longitude, middle_points_str


gdf = read_shapefile_to_dataframe(INPUT_SHAPE_FILE)
gdf['BEGIN LATITUDE'], gdf['BEGIN LONGITUDE'], gdf['END LATITUDE'], gdf['END LONGITUDE'], gdf['MIDDLE POINTS'] = zip(*gdf['geometry'].apply(extract_line_info))
df_shapefile = pd.DataFrame(gdf)

In [None]:
df_shapefile.to_excel(OUTPUT_SHAPEFILE_DATA_EXCEL_PATH, index=False)

print(f"DataFrame has been saved to {OUTPUT_SHAPEFILE_DATA_EXCEL_PATH}")


# Convert DataFrame back to GeoDataFrame
gdf_shapefile = gpd.GeoDataFrame(df_shapefile, geometry='geometry', crs='EPSG:4326')

gdf_shapefile.to_file(OUTPUT_SHAPE_FILE_WITH_COORDINATES)

print(f"Final shapefile has been saved to {OUTPUT_SHAPE_FILE_WITH_COORDINATES}")

  gdf_shapefile.to_file(OUTPUT_SHAPE_FILE_WITH_COORDINATES)


DataFrame has been saved to /content/Output/coordinates.xlsx
Final shapefile has been saved to /content/Output/roads_with_coordinates.shp


### Merge Segments Based on Street Names

In [None]:
def get_distance(current_location, target_location):
    return int(geopy.distance.geodesic(current_location, target_location).feet)

def sort_coordinates(ref_point,Coordinates):
    #making the tuple of each coordinates enclosed in a list
    Coordinates = [tuple(map(float, coord.split(','))) for coord in Coordinates.split(';')]

    DistFromBegin = haversine(Coordinates[0], ref_point, unit=Unit.FEET)
    DistFromEnd =  haversine(Coordinates[-1], ref_point, unit=Unit.FEET)

    MinDist = min (DistFromBegin, DistFromEnd)
    if MinDist == DistFromEnd:
        Coordinates.reverse()
    #Extracting the last point of sorted_coordinates
    End_Lat, End_Long = Coordinates[-1][0], Coordinates[-1][1]

    Coordinates = Coordinates[0:-1]

    #converting to a specific format
    L1_mid ='; '.join([f"{lat},{lon}" for lat, lon in Coordinates])
    return  L1_mid, End_Lat, End_Long

def sort_df(dup_id):
    dup_id = dup_id.reset_index(drop=True)
    distance_dict = {}

    P1BegCoords = (float(dup_id.loc[0]['BEGIN LATITUDE']),float(dup_id.loc[0]['BEGIN LONGITUDE']))
    P1EndCoords = (float(dup_id.loc[0]['END LATITUDE']),float(dup_id.loc[0]['END LONGITUDE']))
    for _index, row in dup_id.iterrows():
        if _index == 0:
            continue
        if _index > len(dup_id)-1:
            continue
        dist = []
        P2BegCoords = (float(dup_id.loc[_index]['BEGIN LATITUDE']),float(dup_id.loc[_index]['BEGIN LONGITUDE']))
        P2EndCoords = (float(dup_id.loc[_index]['END LATITUDE']),float(dup_id.loc[_index]['END LONGITUDE']))
        dist.append(int(haversine(P2BegCoords, P1BegCoords, unit=Unit.FEET)))
        dist.append(int(haversine(P2BegCoords, P1EndCoords, unit=Unit.FEET)))
        dist.append(int(haversine(P2EndCoords, P1BegCoords, unit=Unit.FEET)))
        dist.append(int(haversine(P2EndCoords, P1EndCoords, unit=Unit.FEET)))
        distance_dict[_index] = min(dist)

    #sort the dictionary on the bases of values in descending order
    sorted_dict = dict(sorted(distance_dict.items(), key=lambda item: item[1], reverse=True))
    #getting the keys of the dictionary
    new_index = list(sorted_dict.keys())
    new_index.append(0)

    dup_id = dup_id.reindex(new_index)
    dup_id = dup_id.reset_index(drop=True)
    return dup_id

def  merge_segments(df):
    try:
        dup_df = pd.concat(g for _, g in df.groupby("FullName") if len(g) > 1)
    except:
        return df
    Repeat_ids = dup_df['FullName'].unique()
    new_df = pd.DataFrame(columns=df.columns)
    count = 0
    for id in Repeat_ids:
        dup_id = dup_df.loc[dup_df['FullName']==id]
        original_indexes = dup_df.loc[dup_df['FullName']==id]

        dup_id = dup_id.reset_index(drop=True)
        for i in range(len(dup_id)-1,0,-1):
            if len(dup_id) > 2:
                dup_id = sort_df(dup_id)
            """
            L0B = Line 0 Begin point
            L0E = Line 0 End Point
            L1B = Line 1 Begin point
            L1E = Line 1 End Point

            Remember that Loop is iterating from downward to upward
            which means that Line 0(L0B,L0E) is below than Line 1(L1b,L1E) in dataframe
            """
            L0B = tuple(float(x) for x in dup_id.iloc[i][['BEGIN LATITUDE', 'BEGIN LONGITUDE']])
            L0E = tuple(float(x) for x in dup_id.iloc[i][['END LATITUDE', 'END LONGITUDE']])
            L1B = tuple(float(x) for x in dup_id.iloc[i-1][['BEGIN LATITUDE', 'BEGIN LONGITUDE']])
            L1E = tuple(float(x) for x in dup_id.iloc[i-1][['END LATITUDE', 'END LONGITUDE']])

            D_L0B_L1B = get_distance(L0B,L1B)
            D_L0B_L1E = get_distance(L0B,L1E)
            D_L0E_L1B = get_distance(L0E,L1B)
            D_L0E_L1E = get_distance(L0E,L1E)

            min_dist = min(D_L0B_L1B,D_L0B_L1E,D_L0E_L1B,D_L0E_L1E)

            if D_L0E_L1E == min_dist:
                Beg_Lat, Beg_Long = L0B
                """
                Combine the Middle points
                """

                ref_point = L0E

                L1_coords = str(L1B[0])+","+str(L1B[1])+"; "
                if dup_id.iloc[i-1]['MIDDLE POINTS'] !="":
                    L1_coords += dup_id.iloc[i-1]['MIDDLE POINTS']+"; "
                L1_coords += str(L1E[0])+","+str(L1E[1])

                L1_mid, End_Lat, End_Long = sort_coordinates(ref_point,L1_coords)

                if dup_id.iloc[i]['MIDDLE POINTS'] != "":
                    Mid_points = dup_id.iloc[i]['MIDDLE POINTS'] + "; "
                else:
                    Mid_points = ""

                Mid_points += str(L0E[0]) + "," + str(L0E[1])
                if L1_mid !="":
                    Mid_points +="; " + L1_mid

            elif D_L0E_L1B == min_dist:
                Beg_Lat, Beg_Long = L0B

                """
                Combine the Middle points
                """
                ref_point = L0E
                L1_coords = str(L1B[0])+","+str(L1B[1])+"; "
                if dup_id.iloc[i-1]['MIDDLE POINTS'] !="":
                    L1_coords += dup_id.iloc[i-1]['MIDDLE POINTS']+"; "
                L1_coords += str(L1E[0])+","+str(L1E[1])

                L1_mid, End_Lat, End_Long = sort_coordinates(ref_point,L1_coords)

                if dup_id.iloc[i]['MIDDLE POINTS'] != "":
                    Mid_points = dup_id.iloc[i]['MIDDLE POINTS'] + "; "
                else:
                    Mid_points = ""

                Mid_points += str(L0E[0]) + "," + str(L0E[1])

                if L1_mid !="":
                    Mid_points +="; " + L1_mid


            elif D_L0B_L1E == min_dist:
                Beg_Lat, Beg_Long = L0E

                """
                Combine the Middle points
                """
                if dup_id.iloc[i]['MIDDLE POINTS'] !="":
                    L0_mid = dup_id.iloc[i]['MIDDLE POINTS']
                    L0_mid = "; ".join(L0_mid.split("; ")[::-1])
                    Mid_points =  L0_mid  + "; "
                else:
                    Mid_points = ""
                ref_point = L0B

                L1_coords = str(L1B[0])+","+str(L1B[1])+"; "
                if dup_id.iloc[i-1]['MIDDLE POINTS'] !="":
                    L1_coords += dup_id.iloc[i-1]['MIDDLE POINTS']+"; "
                L1_coords += str(L1E[0])+","+str(L1E[1])

                L1_mid, End_Lat, End_Long = sort_coordinates(ref_point,L1_coords)

                Mid_points += str(L0B[0]) + "," + str(L0B[1])

                if L1_mid !="":
                    Mid_points +="; " + L1_mid


            elif D_L0B_L1B == min_dist:
                Beg_Lat, Beg_Long = L0E

                """
                Combine the Middle points
                """
                if  dup_id.iloc[i]['MIDDLE POINTS'] !="":
                    L0_mid = dup_id.iloc[i]['MIDDLE POINTS']
                    L0_mid = "; ".join(L0_mid.split("; ")[::-1])
                    Mid_points =  L0_mid  + "; "
                else:
                    Mid_points = ""

                ref_point = L0B

                L1_coords = str(L1B[0])+","+str(L1B[1])+"; "
                if dup_id.iloc[i-1]['MIDDLE POINTS'] !="":
                    L1_coords += dup_id.iloc[i-1]['MIDDLE POINTS']+"; "
                L1_coords += str(L1E[0])+","+str(L1E[1])

                L1_mid, End_Lat, End_Long = sort_coordinates(ref_point,L1_coords)

                Mid_points += str(L0B[0]) + "," + str(L0B[1])

                if L1_mid !="":
                    Mid_points +="; " + L1_mid

            #update duplicate id Dataframe
            dup_id.loc[i-1,'BEGIN LATITUDE'] = Beg_Lat
            dup_id.loc[i-1,'BEGIN LONGITUDE'] = Beg_Long
            dup_id.loc[i-1,'END LATITUDE'] = End_Lat
            dup_id.loc[i-1,'END LONGITUDE'] = End_Long
            dup_id.loc[i-1,'MIDDLE POINTS'] = Mid_points

            df.drop(original_indexes.index[i],inplace=True)
            dup_id.drop(i,inplace=True)

        df.loc[original_indexes.index[0],'BEGIN LATITUDE'] = Beg_Lat
        df.loc[original_indexes.index[0],'BEGIN LONGITUDE'] = Beg_Long
        df.loc[original_indexes.index[0],'END LATITUDE'] = End_Lat
        df.loc[original_indexes.index[0],'END LONGITUDE'] = End_Long
        df.loc[original_indexes.index[0],'MIDDLE POINTS'] = Mid_points

        if Mid_points != "":
            all_coords = str(Beg_Lat)+","+str(Beg_Long)+"; "+Mid_points+"; "+str(End_Lat)+","+str(End_Long)
        else:
            all_coords = str(Beg_Lat)+","+str(Beg_Long)+"; "+str(End_Lat)+","+str(End_Long)
        try:
            coordinates = [tuple(map(float, coord.split(','))) for coord in all_coords.split("; ")]
        except:
            coordinates = [tuple(map(float, coord.split(','))) for coord in all_coords.split("; ")]
        Seg_len = 0
        for i in range(1, len(coordinates)):
            Seg_len += get_distance(coordinates[i-1], coordinates[i])
    return df

In [None]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
import simplekml

# Input and output paths
INPUT_SHAPE_FILE = r"/content/County Centerline/HawaiianGardens-CAMS-0619224.shp"
OUTPUT_EXCEL_MERGED_SEGMENTS = r"/content/Output/Merged Output/output_excel_merged_segments.xlsx"
OUTPUT_SHAPEFILE_MERGED_SEGMENTS = r"/content/Output/Merged Output/Merged Outputoutput_shapefile_merged_segments.shp"
OUTPUT_KMZ_MERGED_SEGMENTS = r"/content/Output/Merged Output/output_merged_segments.kmz"

# Assuming df_shapefile is your original DataFrame and merge_segments is defined somewhere
# Merge segments
merged_seg = merge_segments(df_shapefile)

# Rename the column
merged_seg = merged_seg.rename(columns={'FullName': 'STREET NAME'})

# Convert STREET NAME to uppercase
merged_seg['STREET NAME'] = merged_seg['STREET NAME'].str.upper()

# Save merged segments to Excel
merged_seg.to_excel(OUTPUT_EXCEL_MERGED_SEGMENTS, index=False)
print(f"Saved merged segments to {OUTPUT_EXCEL_MERGED_SEGMENTS}")

# Create geometry column if not present
if 'geometry' not in merged_seg.columns:
    merged_seg['geometry'] = merged_seg.apply(lambda row: LineString([(row['BEGIN LONGITUDE'], row['BEGIN LATITUDE']),
                                                                      (row['END LONGITUDE'], row['END LATITUDE'])]), axis=1)

# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(merged_seg, geometry='geometry', crs='EPSG:4326')

# Save the GeoDataFrame as a shapefile
gdf.to_file(OUTPUT_SHAPEFILE_MERGED_SEGMENTS, driver='ESRI Shapefile')
print(f"Saved GeoDataFrame to shapefile at {OUTPUT_SHAPEFILE_MERGED_SEGMENTS}")

# Save the GeoDataFrame as a KMZ file
kml = simplekml.Kml()

for index, row in gdf.iterrows():
    line = kml.newlinestring(name=row['STREET NAME'],
                             coords=[(row['BEGIN LONGITUDE'], row['BEGIN LATITUDE']),
                                     (row['END LONGITUDE'], row['END LATITUDE'])])
    line.style.linestyle.color = simplekml.Color.red  # Change color as needed
    line.style.linestyle.width = 2  # Change width as needed

    # Add extended data to the placemark
    for column_name, value in row.items():
        if column_name != 'geometry':
            line.extendeddata.schemadata.newsimpledata(column_name, str(value))

kml.savekmz(OUTPUT_KMZ_MERGED_SEGMENTS)
print(f"KMZ file has been saved to {OUTPUT_KMZ_MERGED_SEGMENTS}")


  gdf.to_file(OUTPUT_SHAPEFILE_MERGED_SEGMENTS, driver='ESRI Shapefile')


Saved merged segments to /content/Output/Merged Output/output_excel_merged_segments.xlsx
Saved GeoDataFrame to shapefile at /content/Output/Merged Output/Merged Outputoutput_shapefile_merged_segments.shp
KMZ file has been saved to /content/Output/Merged Output/output_merged_segments.kmz


### Split the segments

In [None]:
import pandas as pd

# Path to the Excel file
SECTION_XLSX = r"/content/County Centerline/Hawaiian Gardens Section List.xlsx"

# Read the Excel file
df_section = pd.read_excel(SECTION_XLSX)

# Output the DataFrame to a new Excel file to inspect it
output_path = r"/content/Output/Split Output/df_section_inspect.xlsx"
df_section.to_excel(output_path, index=False)

print(f"df_section has been saved to {output_path}")


df_section has been saved to /content/Output/Split Output/df_section_inspect.xlsx


In [None]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString, Point, MultiPoint
import numpy as np
from thefuzz import process
import geopy.distance
import simplekml
import os
import logging
import shutil

# Set up logging
logging.basicConfig(filename='processing_log.txt', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

def preprocess_street_name(name):
    if name is None:
        return ''
    name = str(name).upper()
    name = name.replace('STREET', 'ST').replace('AVENUE', 'AVE').replace('BOULEVARD', 'BLVD')
    return ' '.join(word for word in name.split() if word not in ['THE', 'OF', 'AND'])

def normalize_street_name(name):
    if name is None:
        return ''
    return ''.join([c for c in str(name) if not c.isdigit()]).strip()

def get_partial_match(query, choices, threshold=70):
    if not query:
        return None
    query = preprocess_street_name(normalize_street_name(query))
    processed_choices = {preprocess_street_name(normalize_street_name(k)): k for k in choices if k is not None}

    direction_suffixes = ['NB', 'SB', 'EB', 'WB']
    query_base = query
    for suffix in direction_suffixes:
        if query.endswith(suffix):
            query_base = query[:-len(suffix)].strip()
            break

    if query in processed_choices:
        return processed_choices[query]

    for processed, original in processed_choices.items():
        if query_base in processed or processed in query_base:
            return original

    matches = process.extractBests(query, processed_choices.keys(), score_cutoff=threshold)
    if matches:
        return processed_choices[matches[0][0]]

    return None

def create_linestring(row):
    points = [(row['BEGIN LONGITUDE'], row['BEGIN LATITUDE'])]
    if row['MIDDLE POINTS']:
        for point in row['MIDDLE POINTS'].split('; '):
            lat, lon = map(float, point.split(','))
            points.append((lon, lat))
    points.append((row['END LONGITUDE'], row['END LATITUDE']))
    return LineString(points)

def get_distance(point1, point2):
    point1_xy = (point1.y, point1.x)
    point2_xy = (point2.y, point2.x)
    return geopy.distance.geodesic(point1_xy, point2_xy).feet

def find_intersection(line1, line2, buffer_distance=1e-4):
    if line1.is_empty or line2.is_empty:
        return None
    buffered_line1 = line1.buffer(buffer_distance)
    buffered_line2 = line2.buffer(buffer_distance)
    intersection = buffered_line1.intersection(buffered_line2)
    if not intersection.is_empty:
        if isinstance(intersection, MultiPoint):
            return intersection.centroid
        return intersection.centroid
    return None

def split_line_between_points(line, from_point, to_point):
    if line.is_empty:
        return None
    if line.project(from_point) > line.project(to_point):
        from_point, to_point = to_point, from_point
    coords = list(line.coords)
    from_index = find_nearest_index(coords, from_point)
    to_index = find_nearest_index(coords, to_point)
    from_index = max(0, min(from_index, len(coords) - 1))
    to_index = max(0, min(to_index, len(coords) - 1))
    if abs(to_index - from_index) < 2:
        logging.warning(f"Not enough coordinates to split line: from {from_point} to {to_point}")
        return line
    split_coords = coords[from_index:to_index + 1]
    return LineString(split_coords)

def find_nearest_index(coords, point, tolerance=1e-9):
    distances = np.array([get_distance(Point(lon, lat), point) for lon, lat in coords])
    return np.argmin(distances)

# Manual corrections dictionary
manual_corrections = {
    "211TH ST": "211TH", "212TH ST": "212TH", "213TH ST": "213TH", "214TH ST": "214TH",
    "215TH ST": "215TH", "216TH ST": "216TH", "221ST ST": "221ST", "222ND ST": "222ND",
    "223RD ST": "223RD", "224TH ST": "224TH", "226TH ST": "226TH",
    "ALLEY E/ NORWALK BLVD": "ALY010", "ALLEY S/ CARSON ST": "ALY030", "ALLEY W/ NORWALK BLVD": "ALY020",
    "ALLEY S/ 214TH ST": "ALY060", "ALLEY S/ 215TH ST": "ALY050", "ALLEY S/ 216TH ST": "ALY040",
    "ARLINE AVE": "ARLIN", "BELSHIRE AVE": "BELSH", "BLOOMFIELD AVE": "BLOOM",
    "BRITTAIN ST": "BRITT", "CANADA DR": "CANAD", "CARSON ST": "CARSN",
    "CIVIC CENTER DR": "CIVIC", "CLARETTA AVE": "CLARE", "CLARKDALE AVE": "CLARK",
    "CORTNER AVE": "CORTN", "DEVLIN AVE": "DEVLI", "ELAINE AVE": "ELAIN",
    "FARLOW ST": "FARLO", "FUNSTON AVE": "FUNST", "HAWAIIAN AVE": "HAWAI",
    "HORST AVE": "HORST", "IBEX AVE": "IBEX", "JOLIET AVE": "JOLIE",
    "JUAN AVE": "JUAN", "NORWALK BLVD": "NORWLK", "PIONEER BLVD": "PIONE",
    "SCHULTZE DR": "SCHUL", "SEINE AVE": "SEINE", "TILBURY ST": "TILBU",
    "VERNE AVE": "VERNE", "VIOLETA AVE": "VIOLE", "WARDHAM AVE": "WARDH",
    "NORWALK CHANNEL": "NORWLK", "CITY LIMIT": "CITY LIMITS", "END": "END",
    "CENTRALIA": "CENTRALIA ST"
}

# Read input files
input_shapefile = r"/content/Output/roads_with_coordinates.shp"
input_gdf = gpd.read_file(input_shapefile)
input_crs = input_gdf.crs

# Process and merge segments
marged_seg = merge_segments(df_shapefile)
marged_seg = marged_seg.rename(columns={'FullName': 'STREET NAME'})
marged_seg['STREET NAME'] = marged_seg['STREET NAME'].str.upper()
marged_seg['geometry'] = marged_seg.apply(create_linestring, axis=1)
gdf_lines = gpd.GeoDataFrame(marged_seg, geometry='geometry', crs=input_crs)

street_lines = {row['STREET NAME']: row['geometry'] for idx, row in gdf_lines.iterrows() if row['STREET NAME'] is not None}

# Initialize DataFrame for split lines
all_split_lines_df = pd.DataFrame(columns=['StreetID', 'SectionID', 'STREET NAME', 'FROM', 'TO',
                                           'Functional Class', 'Surface Type', 'Lanes', 'Length', 'Width', 'Area', 'geometry'])

skipped_rows = []

# Initialize counters
total_rows = len(df_section)
processed_rows = 0
skipped_rows_count = 0

# Process each section
for index, row in df_section.iterrows():
    logging.info(f"Processing row {index}: {row['Name']} from {row['From']} to {row['To']}")

    main_line_name = manual_corrections.get(row["Name"], row["Name"])
    from_street = manual_corrections.get(row['From'], row['From'])
    to_street = manual_corrections.get(row['To'], row['To'])

    direction_suffix = ''
    for suffix in ['NB', 'SB', 'EB', 'WB']:
        if main_line_name.endswith(suffix):
            direction_suffix = suffix
            main_line_name = main_line_name[:-len(suffix)].strip()
            break

    main_line_geometry = get_partial_match(main_line_name, street_lines.keys())
    from_line_geometry = get_partial_match(from_street, street_lines.keys())
    to_line_geometry = get_partial_match(to_street, street_lines.keys())

    if main_line_geometry and from_line_geometry and to_line_geometry:
        main_line_geometry = street_lines[main_line_geometry]
        from_line_geometry = street_lines[from_line_geometry]
        to_line_geometry = street_lines[to_line_geometry]

        if main_line_geometry.is_empty or from_line_geometry.is_empty or to_line_geometry.is_empty:
            logging.warning(f"Empty geometry found for {main_line_name}{direction_suffix}, using full line.")
            temp_split_lines_df = pd.DataFrame({
                'StreetID': [row['StreetID']],
                'SectionID': [row['SectionID']],
                'STREET NAME': [f"{main_line_name}{direction_suffix}"],
                'FROM': [from_street],
                'TO': [to_street],
                'Functional Class': [row['Functional Class']],
                'Surface Type': [row['Surface Type']],
                'Lanes': [row['Lanes']],
                'Length': [row['Length']],
                'Width': [row['Width']],
                'Area': [row['Area']],
                'geometry': [main_line_geometry]
            })
            all_split_lines_df = pd.concat([all_split_lines_df, temp_split_lines_df], ignore_index=True)
            processed_rows += 1
            continue

        from_intersection = find_intersection(main_line_geometry, from_line_geometry)
        to_intersection = find_intersection(main_line_geometry, to_line_geometry)

        logging.info(f"From Intersection: {from_intersection}")
        logging.info(f"To Intersection: {to_intersection}")

        if from_intersection and to_intersection:
            try:
                split_line = split_line_between_points(main_line_geometry, from_intersection, to_intersection)

                temp_split_lines_df = pd.DataFrame({
                    'StreetID': [row['StreetID']],
                    'SectionID': [row['SectionID']],
                    'STREET NAME': [f"{main_line_name}{direction_suffix}"],
                    'FROM': [from_street],
                    'TO': [to_street],
                    'Functional Class': [row['Functional Class']],
                    'Surface Type': [row['Surface Type']],
                    'Lanes': [row['Lanes']],
                    'Length': [row['Length']],
                    'Width': [row['Width']],
                    'Area': [row['Area']],
                    'geometry': [split_line]
                })
                all_split_lines_df = pd.concat([all_split_lines_df, temp_split_lines_df], ignore_index=True)
                processed_rows += 1
            except Exception as e:
                logging.error(f"Error splitting line for {main_line_name}{direction_suffix}: {str(e)}")
                skipped_rows_count += 1
                skipped_rows.append({'STREET NAME': f"{main_line_name}{direction_suffix}", 'FROM': from_street, 'TO': to_street, 'Reason': f'Error splitting line: {str(e)}'})
        else:
            logging.warning(f"Not enough intersection points found for {main_line_name}{direction_suffix}, using full line.")
            temp_split_lines_df = pd.DataFrame({
                'StreetID': [row['StreetID']],
                'SectionID': [row['SectionID']],
                'STREET NAME': [f"{main_line_name}{direction_suffix}"],
                'FROM': [from_street],
                'TO': [to_street],
                'Functional Class': [row['Functional Class']],
                'Surface Type': [row['Surface Type']],
                'Lanes': [row['Lanes']],
                'Length': [row['Length']],
                'Width': [row['Width']],
                'Area': [row['Area']],
                'geometry': [main_line_geometry]
            })
            all_split_lines_df = pd.concat([all_split_lines_df, temp_split_lines_df], ignore_index=True)
            processed_rows += 1
    else:
        logging.error(f"Unrecognized street name(s): {main_line_name}{direction_suffix}, {from_street}, {to_street}. Skipping.")
        skipped_rows_count += 1
        skipped_rows.append({'STREET NAME': f"{main_line_name}{direction_suffix}", 'FROM': from_street, 'TO': to_street, 'Reason': 'Unrecognized street name'})

# Log the results after processing
logging.info(f"Total rows: {total_rows}")
logging.info(f"Processed rows: {processed_rows}")
logging.info(f"Skipped rows: {skipped_rows_count}")

# Log the DataFrame information at various stages
logging.info(f"Rows in all_split_lines_df: {len(all_split_lines_df)}")

# Create 'StreetID - SectionID' column
all_split_lines_df['StreetID - SectionID'] = all_split_lines_df['StreetID'].astype(str) + ' - ' + all_split_lines_df['SectionID'].astype(str)

# Reorder columns
column_order = ['StreetID', 'SectionID', 'StreetID - SectionID', 'STREET NAME', 'FROM', 'TO',
                'Functional Class', 'Surface Type', 'Lanes', 'Length', 'Width', 'Area', 'geometry']
all_split_lines_df = all_split_lines_df[column_order]

# Create GeoDataFrame and save as shapefile
output_directory = r"/content/Output/Split Output/Split Shapefile"
os.makedirs(output_directory, exist_ok=True)
OUTPUT_MERGED_SHAPEFILE = os.path.join(output_directory, "output_shapefile_with_attributes.shp")
all_split_lines_gdf = gpd.GeoDataFrame(all_split_lines_df, geometry='geometry', crs=input_crs)
all_split_lines_gdf.to_file(OUTPUT_MERGED_SHAPEFILE, driver='ESRI Shapefile')
logging.info(f"Shapefile with reordered columns has been saved to {OUTPUT_MERGED_SHAPEFILE}")

# Create KMZ file
OUTPUT_KMZ_MERGED_SEGMENTS = os.path.join(output_directory, "output_merged_segments.kmz")
kml = simplekml.Kml()

for idx, row in all_split_lines_gdf.iterrows():
    if row['geometry'] is None or row['geometry'].is_empty:
        logging.warning(f"Skipping row {idx} with None or empty geometry.")
        skipped_rows.append({'STREET NAME': row['STREET NAME'], 'FROM': row['FROM'], 'TO': row['TO'], 'Reason': 'None or empty geometry'})
        continue

    linestring = kml.newlinestring(name=row['STREET NAME'])
    linestring.coords = list(row['geometry'].coords)

    for column_name, value in row.items():
        if column_name != 'geometry':
            linestring.extendeddata.schemadata.newsimpledata(column_name, str(value))

kml.savekmz(OUTPUT_KMZ_MERGED_SEGMENTS)
logging.info(f"KMZ file has been saved to {OUTPUT_KMZ_MERGED_SEGMENTS}")

# Save skipped rows
skipped_rows_df = pd.DataFrame(skipped_rows)
OUTPUT_SKIPPED_ROWS_EXCEL = os.path.join(output_directory, "skipped_rows.xlsx")
skipped_rows_df.to_excel(OUTPUT_SKIPPED_ROWS_EXCEL, index=False)
logging.info(f"Skipped rows have been saved to {OUTPUT_SKIPPED_ROWS_EXCEL}")

# Clean The Output
none_values = all_split_lines_df[all_split_lines_df['geometry'].notna()]
logging.info(f"Rows after removing null geometries: {len(none_values)}")

final_df = none_values[none_values['geometry'] != 'GEOMETRYCOLLECTION EMPTY']
logging.info(f"Rows in final_df: {len(final_df)}")

# Save final output
final_output_directory = "/content/Output/Final Output"
os.makedirs(final_output_directory, exist_ok=True)
output_shapefile_path = os.path.join(final_output_directory, "output_shape.shp")
output_kmz_path = os.path.join(final_output_directory, "output_shape.kmz")

# Save the final GeoDataFrame as a shapefile
final_gdf = gpd.GeoDataFrame(final_df, geometry='geometry', crs='EPSG:4326')
logging.info(f"Rows in final GeoDataFrame: {len(final_gdf)}")

final_gdf.to_file(output_shapefile_path)
logging.info(f"Final shapefile saved with {len(final_gdf)} rows")

# Additional check for duplicate geometries
duplicate_geometries = final_gdf[final_gdf.duplicated(subset='geometry', keep=False)]
logging.info(f"Number of rows with duplicate geometries: {len(duplicate_geometries)}")
if len(duplicate_geometries) > 0:
    logging.warning("Duplicate geometries found. This may result in fewer rows in the final shapefile.")
    logging.warning(duplicate_geometries[['STREET NAME', 'FROM', 'TO']].to_string())

# Create a final KMZ file
final_kml = simplekml.Kml()

for idx, row in final_gdf.iterrows():
    if row['geometry'] is None or row['geometry'].is_empty:
        logging.warning(f"Skipping row {idx} in final output with None or empty geometry.")
        continue

    linestring = final_kml.newlinestring(name=row['STREET NAME'])
    linestring.coords = list(row['geometry'].coords)

    for column_name, value in row.items():
        if column_name != 'geometry':
            linestring.extendeddata.schemadata.newsimpledata(column_name, str(value))

# Save the final KMZ file
final_kml.savekmz(output_kmz_path)
logging.info(f"Final KMZ file has been saved to {output_kmz_path}")

# Create a zip file of the entire Output folder
output_zip_path = "/content/Output.zip"
shutil.make_archive("/content/Output", 'zip', "/content/Output")
logging.info(f"All output has been zipped to {output_zip_path}")

# Print summary
logging.info("\nSummary:")
logging.info(f"Total rows in input: {total_rows}")
logging.info(f"Successfully processed rows: {len(final_df)}")
logging.info(f"Skipped rows: {len(skipped_rows)}")
logging.info(f"\nCheck {OUTPUT_SKIPPED_ROWS_EXCEL} for details on skipped rows.")
logging.info(f"\nFinal output is available in {final_output_directory}")
logging.info(f"All output files are zipped in {output_zip_path}")

print("\nProcessing complete. Check processing_log.txt for detailed information.")

  return lib.line_locate_point(line, other)
ERROR:root:Unrecognized street name(s): 215TH, BELSHIRE, CITY LIMITS. Skipping.
ERROR:root:Unrecognized street name(s): ALY010, BRITTAIN, 224TH. Skipping.
ERROR:root:Unrecognized street name(s): ALY010, 224TH, 223RD. Skipping.
ERROR:root:Unrecognized street name(s): ALY010, 223RD, 222ND. Skipping.
ERROR:root:Unrecognized street name(s): ALY010, 222ND, 221ST. Skipping.
ERROR:root:Unrecognized street name(s): ALY010, TILBURY, 216TH. Skipping.
ERROR:root:Unrecognized street name(s): ALY010, 216TH, 215TH. Skipping.
ERROR:root:Unrecognized street name(s): ALY060, HORST, ALLEY W/ NORWALK. Skipping.
ERROR:root:Unrecognized street name(s): ALY050, HORST, ALLEY W/ NORWALK. Skipping.
ERROR:root:Unrecognized street name(s): ALY040, HORST, ALLEY W/ NORWALK. Skipping.
ERROR:root:Unrecognized street name(s): ALY030, ARLINE, CLARKDALE. Skipping.
ERROR:root:Unrecognized street name(s): ALY030, DEVLIN, ELAINE. Skipping.
ERROR:root:Unrecognized street name(s):


Processing complete. Check processing_log.txt for detailed information.


In [None]:
# Clean The Output
none_values = all_split_lines_df[all_split_lines_df['geometry'].notna()]
final_df = none_values[none_values['geometry']!= 'GEOMETRYCOLLECTION EMPTY']

###Add Other Related Attributes

### Save to Shapefile and KML

In [None]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
import simplekml
import os

# Assuming final_df is your DataFrame with necessary data
# Ensure 'geometry' column is present and convert DataFrame to GeoDataFrame
if 'geometry' not in final_df.columns:
    final_df['geometry'] = final_df.apply(lambda row: LineString([(row['BEGIN LONGITUDE'], row['BEGIN LATITUDE']),
                                                                  (row['END LONGITUDE'], row['END LATITUDE'])]), axis=1)

gdf = gpd.GeoDataFrame(final_df, geometry='geometry', crs='EPSG:4326')

# Define the output directory and ensure it exists
output_directory = "/content/Output/Final Output"
os.makedirs(output_directory, exist_ok=True)
output_shapefile_path = os.path.join(output_directory, "output_shape.shp")
output_kmz_path = os.path.join(output_directory, "output_shape.kmz")

# Save the GeoDataFrame as a shapefile
gdf.to_file(output_shapefile_path)
print(f"Shapefile has been saved to {output_shapefile_path}")

# Create a KMZ file
kml = simplekml.Kml()

for idx, row in gdf.iterrows():
    linestring = kml.newlinestring(name=row['STREET NAME'])
    linestring.coords = list(row['geometry'].coords)

    # Add extended data to the placemark
    for column_name, value in row.items():
        if column_name != 'geometry':
            linestring.extendeddata.schemadata.newsimpledata(column_name, str(value))

# Save the KMZ file
kml.savekmz(output_kmz_path)
print(f"KMZ file has been saved to {output_kmz_path}")


  gdf.to_file(output_shapefile_path)


Shapefile has been saved to /content/Output/Final Output/output_shape.shp
KMZ file has been saved to /content/Output/Final Output/output_shape.kmz


In [None]:
!zip -r /content/Output.zip /content/Output

  adding: content/Output/ (stored 0%)
  adding: content/Output/roads_with_coordinates.cpg (stored 0%)
  adding: content/Output/Split Output/ (stored 0%)
  adding: content/Output/Split Output/Split Shapefile/ (stored 0%)
  adding: content/Output/Split Output/Split Shapefile/output_shapefile_with_attributes.cpg (stored 0%)
  adding: content/Output/Split Output/Split Shapefile/output_shapefile_with_attributes.dbf (deflated 97%)
  adding: content/Output/Split Output/Split Shapefile/skipped_rows.xlsx (deflated 10%)
  adding: content/Output/Split Output/Split Shapefile/Split Output/ (stored 0%)
  adding: content/Output/Split Output/Split Shapefile/output_shapefile_with_attributes.shx (deflated 53%)
  adding: content/Output/Split Output/Split Shapefile/output_shapefile_with_attributes.prj (deflated 17%)
  adding: content/Output/Split Output/Split Shapefile/output_merged_segments.kmz (deflated 1%)
  adding: content/Output/Split Output/Split Shapefile/output_shapefile_with_attributes.shp (defla