###  Neighbourhood Similarity Map

The **Neighbourhood Similarity Map** uses a neural embedding model (Word2Vec) to learn how Warsaw’s districts relate to each other through real user movement patterns.
By analyzing anonymized mobility sequences, it identifies districts that often appear together in people’s daily routes — revealing hidden connections between residential, business, and recreational areas.
If many users frequently move between Ursynów and Mokotów, for example, the model learns that these districts share a similar lifestyle and functional character.
This data-driven proximity allows the system to suggest: *“If you like Ursynów, you’ll probably enjoy living in Mokotów or Wilanów.”*
It’s a lightweight neural approach that captures the city’s behavioral geography through how people actually move.

In [3]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from unidecode import unidecode

In [4]:
df = pd.read_csv("./data/hackplay_warszawa_with_districts.csv", usecols=["user_id","district","start_dttm"])
df["start_dttm"] = pd.to_datetime(df["start_dttm"], errors="coerce")
df = df.dropna(subset=["start_dttm","district","user_id"]).sort_values(["user_id","start_dttm"])

In [5]:
def norm(s): return unidecode(str(s).strip().lower().replace(" ", "_"))
df["district_tok"] = df["district"].map(norm)

In [6]:
df["prev_district"] = df.groupby("user_id")["district_tok"].shift(1)
df = df[df["district_tok"] != df["prev_district"]].drop(columns="prev_district")

In [10]:
sequences = df.groupby("user_id")["district_tok"].apply(list).tolist()

In [9]:
sequences = [seq for seq in sequences if len(seq) >= 2]

In [12]:
model = Word2Vec(
    sentences=sequences,
    vector_size=16,
    window=3,
    min_count=1,
    sg=1,
    negative=10,
    epochs=80,
    seed=42,
    workers=4
)
wv: KeyedVectors = model.wv
district_vocab = sorted(wv.key_to_index.keys())


In [13]:
def recommend_districts(district_name: str, topn: int = 5):
    key = norm(district_name)
    if key not in wv:
        raise ValueError(f"Unknown district token: {district_name} -> {key}")
    sim = wv.most_similar(key, topn=topn)
    # pretty print with original-looking names
    def prettify(tok): return tok.replace("_", " ").title()
    return [(prettify(t), round(score, 3)) for t, score in sim]

In [18]:
print("Similar to 'Ursynów':", recommend_districts("Ursynow"))


Similar to 'Ursynów': [('Praga Polnoc', 0.914), ('Praga Poludnie', 0.893), ('Mokotow', 0.851), ('Wawer', 0.823), ('Rembertow', 0.8)]


In [19]:
emb = np.vstack([wv[t] for t in district_vocab])
emb_df = pd.DataFrame(emb, columns=[f"emb_{i}" for i in range(emb.shape[1])])
emb_df.insert(0, "district_tok", district_vocab)

In [20]:
emb_df["district"] = emb_df["district_tok"].str.replace("_"," ").str.title()
emb_df.to_csv("./data/district_embeddings.csv", index=False)


In [21]:
sim_mat = np.zeros((len(district_vocab), len(district_vocab)))
for i, a in enumerate(district_vocab):
    for j, b in enumerate(district_vocab):
        sim_mat[i, j] = wv.similarity(a, b)
sim_df = pd.DataFrame(sim_mat, index=district_vocab, columns=district_vocab)
sim_df.to_csv("./data/district_similarity_matrix.csv")