# Notebook 3 — Regras de Associação (Apriori)
## Projeto IA — Mundial (World Cups + FIFA Ranking)
**Grupo:** G02  
**Autores:** <António Ferreira – nº 9657>, <Mafalda Barão - nº 20446>,  <Ruben Dias - nº 23033>, <Gonçalo Gomes- nº 23039>, <João Morais - nº 23041>  
**Docente:** <Rui Fernandes>  
**Data:** <2025-12-28>

Objetivo: descobrir padrões do tipo **SE (antecedente) ENTÃO (consequente)** em transações (team, year).


### Dados e fontes

Este projeto usa **apenas dados públicos** de futebol:

- **World Cups / Matches** (`WorldCups.csv`, `WorldCupMatches.csv`) — Kaggle: https://www.kaggle.com/datasets/abecklas/fifa-world-cup
- **FIFA World Ranking** (`fifa_ranking-2024-06-20.csv`) — Kaggle: https://www.kaggle.com/datasets/cashncarry/fifaworldranking

> Nota: os ficheiros CSV estão na pasta `/data` do repositório, para execução local sem alteração de caminhos.


In [None]:
# Imports
import numpy as np
import pandas as pd
from itertools import combinations

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

from IPython.display import display


### 1) Carregar e construir o dataset (team, year)
Reutilizamos a mesma lógica do Notebook 2 para ter um registo por equipa e ano.

In [None]:
DATA_DIR = "data"

matches = pd.read_csv(f"{DATA_DIR}/WorldCupMatches.csv")
cups = pd.read_csv(f"{DATA_DIR}/WorldCups.csv")
rank = pd.read_csv(f"{DATA_DIR}/fifa_ranking-2024-06-20.csv")

rank["rank_date"] = pd.to_datetime(rank["rank_date"], errors="coerce")

def norm_team(x):
    if pd.isna(x):
        return x
    return str(x).strip()

matches["Home Team Name"] = matches["Home Team Name"].map(norm_team)
matches["Away Team Name"] = matches["Away Team Name"].map(norm_team)

cups["Winner"] = cups["Winner"].map(norm_team)

rank["country_full"] = rank["country_full"].map(norm_team)
rank = rank.dropna(subset=["rank_date"]).copy()
rank = rank.sort_values(["country_full", "rank_date"]).rename(columns={"country_full": "team"})

home = matches[["Year","Stage","Home Team Name","Home Team Goals","Away Team Goals"]].copy()
home.columns = ["Year","Stage","team","goals_for","goals_against"]

away = matches[["Year","Stage","Away Team Name","Away Team Goals","Home Team Goals"]].copy()
away.columns = ["Year","Stage","team","goals_for","goals_against"]

long = pd.concat([home, away], ignore_index=True)

long = long.dropna(subset=["Year","team","goals_for","goals_against"]).copy()
long["Year"] = pd.to_numeric(long["Year"], errors="coerce").astype("Int64")
long["goals_for"] = pd.to_numeric(long["goals_for"], errors="coerce")
long["goals_against"] = pd.to_numeric(long["goals_against"], errors="coerce")
long = long.dropna(subset=["Year","goals_for","goals_against"]).copy()
long["Year"] = long["Year"].astype(int)

long["goal_diff"] = long["goals_for"] - long["goals_against"]
long["result"] = np.where(long["goal_diff"] > 0, "W", np.where(long["goal_diff"] < 0, "L", "D"))
long["points"] = np.select(
    [long["result"] == "W", long["result"] == "D", long["result"] == "L"],
    [3, 1, 0],
    default=0
)

def stage_level(stage):
    if pd.isna(stage):
        return np.nan
    s = str(stage).strip().lower()
    if s.startswith("group"):
        return 1
    if s in ["first round", "preliminary round"]:
        return 1
    if s == "round of 16":
        return 2
    if "quarter" in s:
        return 3
    if "semi" in s:
        return 4
    if "third" in s or "match for third" in s or "play-off for third" in s:
        return 5
    if s == "final":
        return 6
    return np.nan

def stage_label(level):
    if pd.isna(level):
        return "Unknown"
    level = int(level)
    return {
        1: "Group/1st round",
        2: "Round of 16",
        3: "Quarter-finals",
        4: "Semi-finals",
        5: "Third-place match",
        6: "Final",
    }.get(level, "Unknown")

long["stage_level"] = long["Stage"].map(stage_level)

team_year = long.groupby(["Year","team"], as_index=False).agg(
    games_played=("result","size"),
    wins=("result", lambda x: (x=="W").sum()),
    draws=("result", lambda x: (x=="D").sum()),
    losses=("result", lambda x: (x=="L").sum()),
    goals_for=("goals_for","sum"),
    goals_against=("goals_against","sum"),
    goal_diff=("goal_diff","sum"),
    points_earned=("points","sum"),
    max_stage_level=("stage_level","max"),
)
team_year["stage_reached"] = team_year["max_stage_level"].map(stage_label)

team_year["cutoff_date"] = pd.to_datetime(team_year["Year"].astype(str) + "-06-01")
team_year = team_year.sort_values(["cutoff_date","team"]).reset_index(drop=True)
rank_sorted = rank.sort_values(["rank_date","team"]).reset_index(drop=True)

team_year = pd.merge_asof(
    team_year,
    rank_sorted,
    left_on="cutoff_date",
    right_on="rank_date",
    by="team",
    direction="backward"
)

winners = cups[["Year","Winner"]].dropna().copy()
winners["Year"] = winners["Year"].astype(int)

def champion_before(team, year):
    return int(((winners["Year"] < year) & (winners["Winner"] == team)).any())

team_year["champion_before"] = team_year.apply(lambda r: champion_before(r["team"], r["Year"]), axis=1)

team_year = team_year[team_year["Year"] >= 1994].copy()
team_year = team_year.dropna(subset=["rank","total_points","confederation"]).copy()

print("team_year:", team_year.shape)
display(team_year[["Year","team","rank","total_points","confederation","stage_reached","wins","goal_diff","champion_before"]].head(10))


team_year: (174, 21)


Unnamed: 0,Year,team,rank,total_points,confederation,stage_reached,wins,goal_diff,champion_before
242,1994,Argentina,6.0,55.0,CONMEBOL,Round of 16,2,2.0,1
243,1994,Belgium,34.0,40.0,UEFA,Round of 16,2,0.0,0
244,1994,Bolivia,43.0,35.0,CONMEBOL,Group/1st round,0,-3.0,0
245,1994,Brazil,1.0,59.0,CONMEBOL,Final,5,8.0,1
246,1994,Bulgaria,29.0,44.0,UEFA,Third-place match,3,-1.0,0
247,1994,Cameroon,24.0,46.0,CAF,Group/1st round,0,-8.0,0
248,1994,Colombia,18.0,50.0,CONMEBOL,Group/1st round,1,-1.0,0
249,1994,Germany,2.0,59.0,UEFA,Quarter-finals,3,2.0,0
250,1994,Greece,32.0,41.0,UEFA,Group/1st round,0,-10.0,0
251,1994,Italy,16.0,50.0,UEFA,Final,4,3.0,1


### 2) Construir transações
Para aplicar Apriori precisamos de itens categóricos. Vamos **discretizar** (binning) variáveis numéricas e criar itens do tipo `RankTier=Top10`.


In [None]:
def rank_tier(r):
    r = int(r)
    if r <= 10:
        return "Top10"
    if r <= 25:
        return "11-25"
    if r <= 50:
        return "26-50"
    return "51+"

df = team_year.copy()
df["RankTier"] = df["rank"].map(rank_tier)
df["PointsTier"] = pd.qcut(df["total_points"], q=4, labels=["Q1_low","Q2","Q3","Q4_high"])
df["GoalDiffSign"] = np.where(df["goal_diff"] > 0, "Pos", np.where(df["goal_diff"] < 0, "Neg", "Zero"))
df["Champion"] = np.where(df["champion_before"] == 1, "Yes", "No")
df["WinsTier"] = pd.cut(df["wins"], bins=[-1, 1, 3, 10], labels=["Low","Mid","High"])

transactions = []
for _, r in df.iterrows():
    t = [
        f"Confed={r['confederation']}",
        f"RankTier={r['RankTier']}",
        f"PointsTier={r['PointsTier']}",
        f"Stage={r['stage_reached']}",
        f"ChampionBefore={r['Champion']}",
        f"GoalDiff={r['GoalDiffSign']}",
        f"WinsTier={r['WinsTier']}",
    ]
    transactions.append(t)

print("N transações:", len(transactions))
print("Exemplo:", transactions[0])


N transações: 174
Exemplo: ['Confed=CONMEBOL', 'RankTier=Top10', 'PointsTier=Q1_low', 'Stage=Round of 16', 'ChampionBefore=Yes', 'GoalDiff=Pos', 'WinsTier=Mid']


### 3) Apriori (implementação simples)
Se tiveres `mlxtend`, podes usar diretamente. Aqui usamos uma implementação simples (suficiente para este dataset).
Depois geramos regras `A → B` (consequente com 1 item) e avaliamos com **support**, **confidence** e **lift**.


In [None]:
def apriori_frequent_itemsets(transactions, min_support=0.1, max_len=3):
    """Devolve dict {itemset(tuple ordenado): support}"""
    n = len(transactions)

    # 1-itemsets
    item_counts = {}
    for t in transactions:
        for item in t:
            item_counts[item] = item_counts.get(item, 0) + 1

    freq = {}
    L1 = {(item,): c / n for item, c in item_counts.items() if (c / n) >= min_support}
    freq.update(L1)

    prev = set(L1.keys())
    k = 2

    while prev and k <= max_len:
        prev_list = sorted(prev)
        candidates = set()

        # join step
        for i in range(len(prev_list)):
            for j in range(i + 1, len(prev_list)):
                a = prev_list[i]
                b = prev_list[j]
                if a[:-1] == b[:-1]:
                    cand = tuple(sorted(set(a).union(b)))
                    if len(cand) == k:
                        # prune: todos os (k-1)-subconjuntos têm de ser frequentes
                        if all(tuple(sorted(sub)) in prev for sub in combinations(cand, k - 1)):
                            candidates.add(cand)

        # count
        cand_counts = {c: 0 for c in candidates}
        for t in transactions:
            tset = set(t)
            for c in candidates:
                if set(c).issubset(tset):
                    cand_counts[c] += 1

        Lk = {c: cnt / n for c, cnt in cand_counts.items() if (cnt / n) >= min_support}
        freq.update(Lk)
        prev = set(Lk.keys())
        k += 1

    return freq

def generate_rules(freq_itemsets, min_confidence=0.6):
    """Gera regras A->B (B com 1 item)"""
    rules = []
    for itemset, supp in freq_itemsets.items():
        if len(itemset) < 2:
            continue
        items = set(itemset)
        for conseq in itemset:
            A = tuple(sorted(items - {conseq}))
            B = (conseq,)
            if A in freq_itemsets:
                conf = supp / freq_itemsets[A]
                if conf >= min_confidence:
                    lift = conf / freq_itemsets[B] if B in freq_itemsets else np.nan
                    rules.append({
                        "antecedent": A,
                        "consequent": B,
                        "support": supp,
                        "confidence": conf,
                        "lift": lift
                    })
    rules = sorted(rules, key=lambda r: (-r["lift"], -r["confidence"], -r["support"]))
    return rules


### 4) Ajuste de parâmetros
Vamos testar diferentes `min_support` e ver quantos itemsets/regras surgem. Depois escolhemos um compromisso que produza regras interpretáveis.


In [None]:
support_grid = [0.05, 0.08, 0.10, 0.12]
min_conf = 0.65

rows = []
for ms in support_grid:
    freq = apriori_frequent_itemsets(transactions, min_support=ms, max_len=3)
    rules = generate_rules(freq, min_confidence=min_conf)
    rows.append({
        "min_support": ms,
        "min_confidence": min_conf,
        "n_itemsets": len(freq),
        "n_rules": len(rules)
    })

tuning = pd.DataFrame(rows)
display(tuning)


Unnamed: 0,min_support,min_confidence,n_itemsets,n_rules
0,0.05,0.65,376,286
1,0.08,0.65,216,187
2,0.1,0.65,157,139
3,0.12,0.65,117,100


### 5) Regras finais (Top por lift)
Escolha recomendada: `min_support=0.10` e `min_confidence=0.65` (ajusta se ficar com regras a mais/menos).

In [None]:
MIN_SUPPORT = 0.10
MIN_CONFIDENCE = 0.65

freq = apriori_frequent_itemsets(transactions, min_support=MIN_SUPPORT, max_len=3)
rules = generate_rules(freq, min_confidence=MIN_CONFIDENCE)

rules_df = pd.DataFrame(rules)
rules_df = rules_df.sort_values(["lift","confidence","support"], ascending=False)

# Filtrar regras mais "fortes" (lift >= 1.2) para facilitar interpretação
rules_strong = rules_df[rules_df["lift"] >= 1.2].head(20).copy()

# Format bonito
rules_strong["antecedent"] = rules_strong["antecedent"].apply(lambda x: ", ".join(x))
rules_strong["consequent"] = rules_strong["consequent"].apply(lambda x: ", ".join(x))
rules_strong[["support","confidence","lift"]] = rules_strong[["support","confidence","lift"]].round(3)

display(rules_strong)


Unnamed: 0,antecedent,consequent,support,confidence,lift
0,Stage=Quarter-finals,WinsTier=Mid,0.103,0.75,2.9
1,WinsTier=High,GoalDiff=Pos,0.132,0.958,2.647
2,Stage=Quarter-finals,GoalDiff=Pos,0.126,0.917,2.532
3,ChampionBefore=Yes,RankTier=Top10,0.132,0.742,2.436
4,"Confed=UEFA, WinsTier=Mid",GoalDiff=Pos,0.103,0.783,2.161
5,WinsTier=Mid,GoalDiff=Pos,0.19,0.733,2.025
6,"Confed=CAF, WinsTier=Low",Stage=Group/1st round,0.109,0.905,1.968
7,"Confed=AFC, WinsTier=Low",GoalDiff=Neg,0.103,1.0,1.955
8,"ChampionBefore=No, WinsTier=Mid",GoalDiff=Pos,0.138,0.686,1.894
9,"PointsTier=Q1_low, WinsTier=Low",Stage=Group/1st round,0.132,0.852,1.853


### 6) Exportar resultados


In [None]:
# --------------------------
# (Opcional) Exportar resultados
# --------------------------
# Frequent itemsets (converter frozenset -> string)
itemsets_df = pd.DataFrame([
    {"itemset": ", ".join(sorted(list(k))), "support": v}
    for k, v in freq.items()
]).sort_values("support", ascending=False)

itemsets_df.to_csv("apriori_frequent_itemsets.csv", index=False)
rules_df.to_csv("apriori_rules_all.csv", index=False)
rules_strong.to_csv("apriori_rules_top.csv", index=False)

print("Guardado: apriori_frequent_itemsets.csv, apriori_rules_all.csv, apriori_rules_top.csv")


### 7) Discussão

As regras obtidas são coerentes com a lógica do futebol e ajudam a identificar padrões entre performance (vitórias / diferença de golos) e fase atingida.

**Regras mais “fortes” observadas:**
- `Stage=Quarter-finals → WinsTier=Mid` (confidence ≈ 0.75, lift ≈ 2.90): chegar aos **quartos** está associado a um número de vitórias **médio/alto**, o que faz sentido porque é necessário ganhar jogos (especialmente na fase de grupos e/ou eliminação).
- `WinsTier=High → GoalDiff=Pos` (confidence ≈ 0.96, lift ≈ 2.65) e `Stage=Quarter-finals → GoalDiff=Pos` (confidence ≈ 0.92, lift ≈ 2.53): seleções com muitas vitórias e que chegam mais longe tendem a apresentar **diferença de golos positiva**.
- `ChampionBefore=Yes → RankTier=Top10` (confidence ≈ 0.74, lift ≈ 2.44): seleções que já foram campeãs aparecem com maior frequência no **Top 10 do ranking**, sugerindo consistência histórica de qualidade.

**Padrões ligados a baixo desempenho:**
- `Stage=Group/1st round → WinsTier=Low` (confidence = 1.00, lift ≈ 1.66) e regras semelhantes com `GoalDiff=Neg` e `RankTier=26-50`: equipas eliminadas cedo tendem a ter **poucas vitórias** e **diferença de golos negativa**, o que também é esperado.

**Interpretação das métricas:**
- As regras com **lift > 1** indicam associação positiva (acontecem mais do que o esperado “por acaso”).  
  Neste conjunto, várias regras apresentam lift **> 2**, o que sugere relações fortes e informativas.

**Ajuste de parâmetros:**
- Ao aumentar `min_support` (ex.: 0.10–0.12), o número de itemsets e regras diminui, ficando apenas as relações mais frequentes/estáveis.
- Para obter regras mais “interessantes” e menos óbvias, pode-se manter suporte moderado e impor filtros como `lift ≥ 1.5` e/ou `confidence ≥ 0.75`, reduzindo regras triviais do tipo “fase de grupos → poucas vitórias”.

### Lições aprendidas

- Em Apriori, a forma como se **discretizam** variáveis contínuas (rank, pontos, vitórias, goal diff) condiciona fortemente as regras encontradas.
- Há um trade-off claro entre **min_support/min_confidence** e a utilidade das regras: thresholds baixos geram muitas regras (incluindo redundantes), thresholds altos podem “matar” padrões raros mas relevantes.
- O **lift** ajuda a filtrar associações realmente informativas; ainda assim, regras são **associações** (não causalidade) e devem ser validadas com conhecimento do domínio.