## Road Surface Classification-StreetSurfaceVis

NOTE from the document

### Image Preprocessing
Apply the recommended cropping to focus on the lower and middle half of the image, as this is where the labeled road surface is located.

### Train-Test Split
Respect the original split provided in the dataset (train=True for training data and train=False for test data) to avoid data leakage and ensure the model generalizes to unseen regions.

### Data Exploration
Verify that the training and testing data are geospatially distinct and meaningful, as described in the dataset documentation.

In [None]:
import pandas as pd
import plotly.express as px
import os

In [None]:

# Load the dataset
data_dir = "data/StreetSurfaceVis_1024"
csv_path = os.path.join(data_dir, "streetSurfaceVis_v1_0.csv")
df = pd.read_csv(csv_path)


In [None]:

# Add a column to distinguish train and test data
df["split"] = df["train"].apply(lambda x: "Train" if x else "Test")


In [None]:

# Plot geographical distribution
fig = px.scatter_geo(df, 
                     lat="latitude", 
                     lon="longitude", 
                     color="split", 
                     title="Geographical Distribution of Train and Test Data",
                     scope="world",
                     hover_name="surface_type",
                     hover_data=["surface_quality", "captured_at"])
fig.update_layout(legend_title_text="Data Split")
fig.show()

In [None]:
# Distribution of surface types
print("Surface Type Distribution:")
print(df["surface_type"].value_counts())

# Distribution of surface quality
print("\nSurface Quality Distribution:")
print(df["surface_quality"].value_counts())

# Correlation between surface type and quality
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="surface_type", y="surface_quality")
plt.title("Surface Quality by Surface Type")
plt.xlabel("Surface Type")
plt.ylabel("Surface Quality")
plt.show()