WHAT THIS NOTEBOOK DOES:
For each ego-user, we load their ego-network (a graph of friends and friend-to-friend links) and their circle memberships (which friends belong to which groups). Using this, we label every pair of users as 1 if they appear together in at least one circle, and 0 otherwise. We then compute a set of graph-based structural features for each pair — such as common neighbors, Jaccard similarity, Adamic–Adar score, shortest path length, and degree statistics — which capture how strongly the two users are connected in the network. After repeating this for all egos with circle data, we combine everything into a single DataFrame and save it as pair_graph_features_only.csv. This CSV is the final feature matrix that you can use to train and analyze prediction models.

In [21]:
from pathlib import Path
import networkx as nx
from itertools import combinations
import random
import pandas as pd

Step 1: Build undirected graph

In [4]:
BASE_DIR = Path("/Users/siyakamboj/Downloads/CSE_158/datasets/learning-social-circles/") #path to dataset directory
EGONET_DIR = BASE_DIR / "egonets"

In [5]:
def load_egonet(ego_id: int, egonet_dir: Path = EGONET_DIR) -> nx.Graph:
    """
    Load a single ego-network as an undirected NetworkX graph.

    Nodes:
        ego_id + all friends that appear in the file (left of ':' or in neighbor lists).
    Edges:
        - ego_id -- friend  (for every friend listed)
        - friend -- neighbor  (for every ID on that friend's line)

    Parameters
    ----------
    ego_id : int
        The ego user ID (e.g., 0, 239).
    egonet_dir : Path
        Directory containing '<ego_id>.egonet' files.

    Returns
    -------
    G : nx.Graph
        Undirected graph for this ego-network.
    """
    ego_file = egonet_dir / f"{ego_id}.egonet"
    if not ego_file.exists():
        raise FileNotFoundError(f"Egonet file not found: {ego_file}")

    G = nx.Graph()

    # Add ego node explicitly
    G.add_node(ego_id, is_ego=True)

    with ego_file.open("r") as f:
        for raw_line in f:
            line = raw_line.strip()
            if not line:
                continue  # skip empty lines

            # Example line: "1: 146 189 229 201 ..."
            if ":" not in line:
                # in case of weird malformed lines
                continue

            left, right = line.split(":", 1)
            u = int(left.strip())

            # mark as non-ego friend
            if u not in G:
                G.add_node(u, is_ego=False)

            # ego -- friend edge
            G.add_edge(ego_id, u)

            neighbors_str = right.strip()
            if neighbors_str:
                for tok in neighbors_str.split():
                    v = int(tok)
                    if v not in G:
                        G.add_node(v, is_ego=False)
                    G.add_edge(u, v)

    return G


In [18]:
# sanity check on ego 0
G0 = load_egonet(0)

print("Nodes:", G0.number_of_nodes())
print("Edges:", G0.number_of_edges())
print("First 10 nodes:", list(G0.nodes())[:10])

print("Ego node present?", 0 in G0)
print("Degree of ego 0:", G0.degree[0])

# Look at one friend from the file, e.g. node 1
print("Neighbors of 1:", list(G0.neighbors(1))[:20])


Nodes: 239
Edges: 4448
First 10 nodes: [0, 1, 146, 189, 229, 201, 204, 60, 215, 35]
Ego node present? True
Degree of ego 0: 238
Neighbors of 1: [0, 146, 189, 229, 201, 204, 60, 215, 35, 91, 238, 88, 28, 166, 218, 156, 6, 2, 8, 231]


Step 2: parse the .circles files and build labels for pairs (u, v)

In [9]:
CIRCLES_DIR = BASE_DIR / "Training"

In [10]:
def load_circles(ego_id: int, circles_dir: Path = CIRCLES_DIR):
    """
    Load circles for a given ego.

    Returns
    -------
    circles : dict[str, set[int]]
        Mapping from circle name (e.g. 'circle17') to set of member IDs.
    pair_labels : dict[frozenset, int]
        Mapping from unordered pair {u,v} to 1 if they share >=1 circle.
        (We only store positives here; negatives will be generated later.)
    """
    circle_file = circles_dir / f"{ego_id}.circles"
    if not circle_file.exists():
        raise FileNotFoundError(f"Circle file not found: {circle_file}")

    circles: dict[str, set[int]] = {}
    pair_labels: dict[frozenset[int], int] = {}

    with circle_file.open("r") as f:
        for raw_line in f:
            line = raw_line.strip()
            if not line:
                continue

            name_part, members_part = line.split(":", 1)
            circle_name = name_part.strip()      # e.g. 'circle17'
            members_str = members_part.strip()
            if not members_str:
                continue

            members = {int(tok) for tok in members_str.split()}
            circles[circle_name] = members

            # mark all pairs inside this circle as positive
            for u, v in combinations(sorted(members), 2):
                pair = frozenset((u, v))
                pair_labels[pair] = 1   # if pair appears in multiple circles, still 1

    return circles, pair_labels

In [17]:
# sanity check with ego 239
circles_239, pos_pairs_239 = load_circles(239)

print("Num circles for ego 239:", len(circles_239))
for cname, members in circles_239.items():
    print(cname, "size:", len(members))

print("Num positive pairs:", len(pos_pairs_239))
print("Example positive pairs:", list(pos_pairs_239.items())[:10])


Num circles for ego 239: 4
circle17 size: 27
circle16 size: 46
circle19 size: 13
circle18 size: 19
Num positive pairs: 1634
Example positive pairs: [(frozenset({248, 247}), 1), (frozenset({249, 247}), 1), (frozenset({250, 247}), 1), (frozenset({255, 247}), 1), (frozenset({265, 247}), 1), (frozenset({281, 247}), 1), (frozenset({287, 247}), 1), (frozenset({289, 247}), 1), (frozenset({290, 247}), 1), (frozenset({296, 247}), 1)]


Step 3: Build the labeled pair dataset for one ego

In [15]:
# GENERATE ALL CANDIDAET PAIRS AND SAMPLE NEGATIVES

def make_labeled_pairs_for_ego(
    ego_id: int,
    circles: dict,
    pos_pairs: dict,
    negative_ratio: int = 2,
    seed: int = 42,
):
    """
    For a given ego, construct a list of (ego_id, u, v, label) rows.

    - Positives: all pairs that co-occur in >=1 circle (from pos_pairs).
    - Negatives: sampled from pairs of users that appear in at least one circle
      but are NOT in pos_pairs.

    negative_ratio: how many negatives per positive (e.g. 2x).
    """
    rng = random.Random(seed)

    # 1) all users that are in any circle for this ego
    members = set()
    for mset in circles.values():
        members.update(mset)
    members = sorted(members)

    # 2) generate all unordered pairs among these members
    all_pairs = [frozenset((u, v)) for u, v in combinations(members, 2)]

    pos_set = set(pos_pairs.keys())

    # 3) separate positives and negatives
    positive_pairs = [p for p in all_pairs if p in pos_set]
    negative_pairs = [p for p in all_pairs if p not in pos_set]

    # 4) sample negatives
    max_neg = negative_ratio * len(positive_pairs)
    if len(negative_pairs) > max_neg:
        negative_pairs = rng.sample(negative_pairs, max_neg)

    # 5) build rows
    rows = []
    for p in positive_pairs:
        u, v = sorted(p)
        rows.append((ego_id, u, v, 1))
    for p in negative_pairs:
        u, v = sorted(p)
        rows.append((ego_id, u, v, 0))

    return rows


In [None]:
#sanity check for ego 239
rows_239 = make_labeled_pairs_for_ego(239, circles_239, pos_pairs_239, negative_ratio=2)

print("Total rows:", len(rows_239))
print("First 5 rows:", rows_239[:5])

# Check label balance
num_pos = sum(1 for _, _, _, y in rows_239 if y == 1)
num_neg = sum(1 for _, _, _, y in rows_239 if y == 0)
print("Positives:", num_pos, "Negatives:", num_neg)


Total rows: 4902
First 5 rows: [(239, 240, 241, 1), (239, 240, 242, 1), (239, 240, 243, 1), (239, 240, 251, 1), (239, 240, 253, 1)]
Positives: 1634 Negatives: 3268


STEP 4: graph structural features

In [22]:
def build_graph_features_for_ego(G: nx.Graph, rows):
    """
    G: ego's graph (networkx.Graph)
    rows: list of (ego_id, u, v, label)

    Returns
    -------
    df : pandas.DataFrame
        Columns: ego_id, u, v, label, and graph-based features.
    """
    # 1) Extract (u, v) pair list
    pair_list = [(u, v) for _, u, v, _ in rows]

    # 2) Precompute degree dict for speed
    deg = dict(G.degree())

    # 3) Precompute similarity metrics using NetworkX generators
    jaccard = {(u, v): score for u, v, score in nx.jaccard_coefficient(G, pair_list)}
    adamic = {(u, v): score for u, v, score in nx.adamic_adar_index(G, pair_list)}
    res_alloc = {(u, v): score for u, v, score in nx.resource_allocation_index(G, pair_list)}
    pref_attach = {(u, v): score for u, v, score in nx.preferential_attachment(G, pair_list)}

    # 4) Build feature rows
    records = []
    for (ego_id, u, v, label) in rows:
        # make sure key ordering matches what we used in the dicts
        key = (u, v)

        # common neighbors
        cn = len(list(nx.common_neighbors(G, u, v)))

        # path length (handle disconnected case)
        try:
            spl = nx.shortest_path_length(G, u, v)
            same_comp = 1
        except nx.NetworkXNoPath:
            spl = -1   # or some large number like 999
            same_comp = 0

        is_edge = int(G.has_edge(u, v))

        du, dv = deg.get(u, 0), deg.get(v, 0)
        deg_min = min(du, dv)
        deg_max = max(du, dv)
        deg_diff = abs(du - dv)

        records.append({
            "ego_id": ego_id,
            "u": u,
            "v": v,
            "label": label,
            "common_neighbors": cn,
            "jaccard": jaccard.get(key, 0.0),
            "adamic_adar": adamic.get(key, 0.0),
            "resource_allocation": res_alloc.get(key, 0.0),
            "preferential_attachment": pref_attach.get(key, 0.0),
            "shortest_path_len": spl,
            "same_component": same_comp,
            "is_edge": is_edge,
            "deg_u": du,
            "deg_v": dv,
            "deg_min": deg_min,
            "deg_max": deg_max,
            "deg_diff": deg_diff,
        })

    df = pd.DataFrame.from_records(records)
    return df


In [23]:
# run for ego 239: sanity check
G239 = load_egonet(239)
graph_df_239 = build_graph_features_for_ego(G239, rows_239)
print(graph_df_239.head())
print(graph_df_239.describe())


   ego_id    u    v  label  common_neighbors   jaccard  adamic_adar  \
0     239  240  241      1                 6  0.375000     2.828988   
1     239  240  242      1                 4  0.285714     1.607905   
2     239  240  243      1                 1  0.045455     0.214871   
3     239  240  251      1                 1  0.041667     0.214871   
4     239  240  253      1                 1  0.058824     0.214871   

   resource_allocation  preferential_attachment  shortest_path_len  \
0             0.746825                      105                  1   
1             0.359524                       77                  2   
2             0.009524                      112                  2   
3             0.009524                      126                  2   
4             0.009524                       77                  2   

   same_component  is_edge  deg_u  deg_v  deg_min  deg_max  deg_diff  
0               1        1      7     15        7       15         8  
1         

Step 5: Run all egos & combine datasets

In [24]:
def get_ego_ids_with_circles(circles_dir: Path = CIRCLES_DIR):
    ego_ids = []
    for path in circles_dir.glob("*.circles"):
        # filename like '239.circles' → '239'
        ego_str = path.stem
        try:
            ego_ids.append(int(ego_str))
        except ValueError:
            # skip weird files if any
            continue
    ego_ids = sorted(ego_ids)
    return ego_ids

In [25]:
ego_ids = get_ego_ids_with_circles()
print("Egos with circles:", ego_ids)
print("Num egos:", len(ego_ids))

Egos with circles: [239, 345, 611, 1357, 1839, 1968, 2255, 2365, 2738, 2790, 2895, 3059, 3735, 4406, 4829, 5212, 5494, 5881, 6413, 6726, 7667, 8100, 8239, 8553, 8777, 9103, 9642, 9846, 9947, 10395, 10929, 11014, 11186, 11364, 11410, 12800, 13353, 13789, 15672, 16203, 16378, 16642, 16869, 17951, 18005, 18543, 19129, 19788, 22650, 22824, 23157, 23299, 24758, 24857, 25159, 25568, 25773, 26321, 26492, 27022]
Num egos: 60


In [26]:
def build_full_graph_feature_dataset(
    ego_ids,
    negative_ratio: int = 2,
):
    """
    Build a single DataFrame with graph-structural features
    for all egos that have circle labels.

    Returns
    -------
    full_df : pandas.DataFrame
    """
    all_dfs = []

    for ego_id in ego_ids:
        print(f"Processing ego {ego_id} ...")

        # 1) load graph
        G = load_egonet(ego_id)

        # 2) load circles + positive pairs
        circles, pos_pairs = load_circles(ego_id)

        # 3) build labeled pairs (pos + sampled neg)
        rows = make_labeled_pairs_for_ego(
            ego_id, circles, pos_pairs,
            negative_ratio=negative_ratio,
            # optional: different seed per ego if you want
            # seed=42 + ego_id
        )

        # 4) compute structural features
        df_ego = build_graph_features_for_ego(G, rows)

        print(f"  rows: {len(df_ego)}, positives: {df_ego['label'].sum()}")

        all_dfs.append(df_ego)

    # 5) concatenate
    full_df = pd.concat(all_dfs, ignore_index=True)
    return full_df

In [27]:
full_graph_df = build_full_graph_feature_dataset(ego_ids, negative_ratio=2)

print("Total rows across all egos:", len(full_graph_df))
print(full_graph_df.head())

Processing ego 239 ...
  rows: 4902, positives: 1634
Processing ego 345 ...
  rows: 31375, positives: 12410
Processing ego 611 ...
  rows: 24405, positives: 8135
Processing ego 1357 ...
  rows: 8128, positives: 2996
Processing ego 1839 ...
  rows: 6903, positives: 2710
Processing ego 1968 ...
  rows: 25032, positives: 8344
Processing ego 2255 ...
  rows: 4851, positives: 4293
Processing ego 2365 ...
  rows: 4035, positives: 1345
Processing ego 2738 ...
  rows: 201, positives: 67
Processing ego 2790 ...
  rows: 2080, positives: 1505
Processing ego 2895 ...
  rows: 2278, positives: 921
Processing ego 3059 ...
  rows: 12561, positives: 4490
Processing ego 3735 ...
  rows: 203589, positives: 67863
Processing ego 4406 ...
  rows: 88831, positives: 60860
Processing ego 4829 ...
  rows: 31626, positives: 12373
Processing ego 5212 ...
  rows: 47007, positives: 15669
Processing ego 5494 ...
  rows: 27495, positives: 9603
Processing ego 5881 ...
  rows: 12939, positives: 4313
Processing ego 6413

STEP 6: Profile similarity functions & attach to dataframe

LAST STEP: SAVE AS CSV TO BE USED BY DATA MODELERS

In [37]:
OUT_PATH = "pair_graph_features_only.csv"
full_graph_df.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH, "with shape", full_graph_df.shape)

Saved: pair_graph_features_only.csv with shape (1394586, 17)
