# PageRank

On calcule PageRank avec la méthode de puissance :

```text
p^(k+1) = T * p^(k)
```

Règles du modèle (énoncé) :

- T[i, j] = probabilité d'aller de la page j vers la page i (colonne = page de départ)
- Si une page n'a aucun lien sortant (sink) : prochaine page uniforme (1/N)
- Arrêt quand :

```text
||T p^(k) - p^(k)||_1 / ||p^(k)||_1 <= eps
```


In [1]:
import numpy as np
import pandas as pd


## 1) Lire les CSV
- names.csv : colonne Name
- edges.csv : colonnes FromNode, ToNode (IDs 1-based)


In [None]:
names = pd.read_csv("names.csv")     # Name
edges = pd.read_csv("edges.csv")       # FromNode, ToNode

N = len(names)

src = edges["FromNode"].to_numpy() - 1   # j 
dst = edges["ToNode"].to_numpy() - 1     # i

N, len(edges)


(199903, 10722190)

## 2) Degrés sortants + sinks

In [3]:
outdeg = np.bincount(src, minlength=N)
sink = (outdeg == 0)

int(sink.sum())


0

## 3) Calculer T @ p sans construire T

- Pour chaque arête j -> i : on ajoute p[j] / outdeg[j] à p_next[i]
- On ajoute la masse des sinks uniformément : (sum_{j sink} p[j]) / N


In [None]:
def T_times_p(p):
    p_next = np.zeros(N, dtype=float)

    # liens : j -> i
    contrib = p[src] / outdeg[src]
    np.add.at(p_next, dst, contrib)

    # sinks : uniforme
    p_next += p[sink].sum() / N

    return p_next


## 4) Méthode de puissance + critère d'arrêt

In [5]:
eps = 1e-8

p = np.ones(N) / N
k = 0

while True:
    p_next = T_times_p(p)
    err = np.abs(p_next - p).sum() / np.abs(p).sum()

    p = p_next
    k += 1

    if err <= eps:
        break

k, err, p.sum()


(113, np.float64(9.177365089017451e-09), np.float64(1.0))

## 5) Top pages

In [6]:
top_k = 20
idx = np.argsort(-p)[:top_k]

pd.DataFrame({
    "rank": np.arange(1, top_k + 1),
    "node_id": idx + 1,
    "name": names["Name"].iloc[idx].to_numpy(),
    "pagerank": p[idx],
})


Unnamed: 0,rank,node_id,name,pagerank
0,1,112356,United States,0.002491
1,2,168241,United Kingdom,0.00139
2,3,138128,World War II,0.001131
3,4,184958,Latin,0.001084
4,5,60041,France,0.001077
5,6,138420,Germany,0.000919
6,7,49148,English language,0.000839
7,8,149853,China,0.000797
8,9,151511,Canada,0.000791
9,10,145591,India,0.000789


## 6) Recherche basique (titre contient le mot-clé)

In [7]:
def search(query, k=10):
    q = query.lower()
    mask = names["Name"].str.lower().str.contains(q, na=False).to_numpy()
    idx = np.where(mask)[0]
    idx = idx[np.argsort(-p[idx])][:k]

    return pd.DataFrame({
        "rank": np.arange(1, len(idx) + 1),
        "node_id": idx + 1,
        "name": names["Name"].iloc[idx].to_numpy(),
        "pagerank": p[idx],
    })

search("python", 10)


Unnamed: 0,rank,node_id,name,pagerank
0,1,3918,Python (programming language),6e-05
1,2,112275,Monty Python,1e-05
2,3,112276,Monty Python's Flying Circus,9e-06
3,4,185388,Pythonidae,5e-06
4,5,113410,Monty Python and the Holy Grail,3e-06
5,6,113411,Monty Python's Life of Brian,2e-06
6,7,187085,Burmese Python,2e-06
7,8,171945,Python (mythology),2e-06
8,9,4103,CPython,2e-06
9,10,187079,Python reticulatus,1e-06
