<img 
    style="position: absolute; 
           left: 60%; 
           top: 0; /* Added to ensure proper positioning */
           height: 900px; 
           width: 40%; /* Maintain the original width */
           object-fit: cover; /* Adjust if necessary */
           clip-path: inset(0px 50px 0px 50px round 10px);" 
    src="_data\imagedb\image_14570_s_a.png" 
/>
</figure>


<h1 style="width: 60%; color: #EC6842; font-size: 55pt;">
    <Strong>
        Streetscapes
        <h2 style="color: #EC6842; font-size: 40pt;"> 
            <Strong>
                Urban classification using street-level images and its embeddings
            </Strong>
        </h2>
    </Strong>
</h1>

<h3 id="Background"><B>
    Rationale for the project<a class="anchor-link" href="#Background">&#182;</a>
    </B>
</h3>
<p style="text-align: justify; width: 60%; font-weight: normal;">
     Structures that make optimal use of the material they are made of reduces the cost and environmental impact of their construction as the amount of material required. 
</p>


## <strong> X | Imports</strong>

In [1]:
import os
import h5py
import numpy as np
import pandas as pd
from collections import Counter

import streetscapes.models as sm
import streetscapes.processing as sp

## <strong> 0 | Import the data</strong>

Images features have already been extracted with `feature_extraction.py`. The approach entailed using a pretrained `VIT` on `ImageNet` and taking the final vector features before the classification takes place. 

From now onwards the following code concerns the analysis of these features..

In [2]:
# Open Panoid GeoJson
directory = '_data\\geo_json\\panoids'
gpd_df = {}

for root, _, files in os.walk(directory):
    for filename in files:
        if filename.endswith('.geojson'):
            gpd_df[filename] = sp.geojson(root, filename)

# Open panoid features
features_file = '_data\\processed\\features.h5'
labels_file = '_data\\processed\\labels.txt'

with h5py.File(features_file, 'r') as f:
    features_np = f['features'][:]
with open(labels_file, 'r') as f:
    labels = [line.strip() for line in f]

## <strong> 2 | Dimensionality reduction</strong>

Since not all features extracted from the images are useful, and also due to the large amount of data, the non relevant data is purged. This has been done through two methods PCA and a enconder only VAE. The following is the code for the PCA which has been programmed to select the number of latent dimensions that capture 95% of the  feature c varaicne

In [8]:
pca_model = sm.PCA(n_components=2) # Select an arbitrary number of components
optimal_n, variance_ratios = pca_model.find_optimal_components(features_np, threshold=0.8)
pca_model.n_components = optimal_n
print(f"Optimal number of components: {optimal_n}")
print(f"Explained variance with {optimal_n} components: {sum(variance_ratios[:optimal_n]):.3f}")

# Apply PCA with the optimal numbe
data = pca_model.fit_transform(features_np)

Optimal number of components: 119
Explained variance with 119 components: 0.801


## <strong> 3 | Classification </strong>

Having now the reduced set of important features now a unsupervised classification is performed. Two methods have been used: 

1. K Means 
2. Gaussian Mixture

Both of these methods have been set to find the optimal number of clusters by evaluating the gradient of the number of changed panoids. Once the the gradient plateaus it is assummed the optimal number of clusters has been found. This has been caped to a maximum number of 5 clusters to facilitate the Town wise classification. The following only shows the KMeans classification

In [12]:
km = sm.KM()
k_range = range(2, 10) 
optimal_k = km.find_optimal_k(data, k_range)
km.plot_elbo_scores(k_range)

km.fit(data, 6)
cluster_labels = km.get_cluster_assignments(data)

## <strong> 4 | Data Processing </strong>

In [13]:

STREETSCAPES_df = gpd_df['panoids.geojson'].drop(columns=['year', 'month', 'owner', 'ask_lng', 'ask_lat', 'consulted', 'url_side_a', 'url_front', 'url_side_b', 'url_back'])
cols = ['im_side_a', 'im_front', 'im_side_b', 'im_back']
new_cols = [f'{col}_cluster' for col in cols]
STREETSCAPES_df[new_cols] = pd.DataFrame([[None]*len(new_cols)], index=STREETSCAPES_df.index)

label_to_cluster = dict(zip(labels, cluster_labels))

for col, new_col in zip(cols, new_cols):
    for idx, row in STREETSCAPES_df.iterrows():
        image_label = row[col]
        if image_label in label_to_cluster:
            STREETSCAPES_df.at[idx, new_col] = label_to_cluster[image_label]

def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371000
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    return 6371000 * 2 * np.arcsin(np.sqrt(a))

coords = STREETSCAPES_df[['lat', 'lng']].values
threshold_distance = 20
cluster_cols = ['im_side_a_cluster', 'im_front_cluster', 'im_side_b_cluster', 'im_back_cluster']
used = set()
groups = []

for i in range(len(coords)):
    if i in used: continue
    group = [i]
    used.add(i)
    for j in range(i + 1, len(coords)):
        distance = haversine_distance(coords[i][0], coords[i][1], coords[j][0], coords[j][1])
        if distance < threshold_distance:
            group.append(j)
            used.add(j)
        else: break
    if len(group) > 1: groups.append(group)

STREETSCAPES_df['grouped_cluster'] = None
STREETSCAPES_df['grouped_labels'] = None

for group in groups:   
    all_clusters = [c for idx in group for c in STREETSCAPES_df.loc[STREETSCAPES_df.index[idx], cluster_cols].dropna().values]
    if all_clusters:
        majority_cluster = Counter(all_clusters).most_common(1)[0][0]
        all_labels = [l for idx in group for l in STREETSCAPES_df.loc[STREETSCAPES_df.index[idx], ['im_side_a', 'im_front', 'im_side_b', 'im_back']].dropna().values]
        first_idx = STREETSCAPES_df.index[group[0]]
        STREETSCAPES_df.at[first_idx, 'grouped_cluster'] = majority_cluster
        STREETSCAPES_df.at[first_idx, 'grouped_labels'] = ','.join(all_labels)

indices_to_drop = [idx for group in groups for idx in group[1:]]
STREETSCAPES_df = STREETSCAPES_df.drop(STREETSCAPES_df.index[indices_to_drop]).drop(columns=cluster_cols)
STREETSCAPES_df = STREETSCAPES_df[STREETSCAPES_df['grouped_cluster'].notnull()]

## <strong> 5 | Plotting the results </strong>

In [14]:
sp.plot_feature_classes_kmeans(STREETSCAPES_df['lat'].values,
                               STREETSCAPES_df['lng'].values,
                               STREETSCAPES_df['grouped_cluster'].values,
                               STREETSCAPES_df['grouped_labels'].values)