# FIFA 2026 Winner Prediction 
**By: Sahar Karimi** | CS 401 - Software Engineering

This notebook predicts the 2026 FIFA World Cup winner using only recent international match results from (2022‚Äì2025).
The data files I used are `recent_wc_matches.csv` (match results) and `worldcup_predictor_teams.csv` (qualified teams).
When the notebook runs: `predictions_2026.csv` (filtered ranking used for prediction) and `predictions_2026_full.csv` (full unfiltered ranking).

For every team we look at how many games they won and how many total games they played (both home and away). Then we calculate their win rate (wins √∑ total games).
But we ignore teams that played fewer than 5 games, because their win rate might be misleading if the sample is too small and it will be biased.
We choose the team with the highest win rate and call it our prediction for the best team.

In [1]:
# We only need pandas 
# pandas will read CSVs and handle tables (DataFrames)
import pandas as pd


## Step 1: Load data and show samples

We load recent games and the list of qualified teams so we only make predictions about teams that are actually in the tournament.

In [2]:
# Read the CSV files into pandas DataFrames
matches = pd.read_csv('recent_wc_matches.csv')
teams_df = pd.read_csv('worldcup_predictor_teams.csv')

# Load the recent match data and the list of qualified teams, and only make predictions using teams that actually entered the tournament
team_col = 'team' if 'team' in teams_df.columns else teams_df.columns[0]
qualified_teams = set(teams_df[team_col].astype(str).str.strip())

# Print info so we can verify data loaded correctly
print(f'Loaded {len(matches)} recent matches (rows in recent_wc_matches.csv)')
print(f'Found {len(qualified_teams)} qualified teams (from column: {team_col})')
print('\nSample rows from matches:')
display(matches[['date','home_team','away_team','home_score','away_score']].head())

Loaded 930 recent matches (rows in recent_wc_matches.csv)
Found 32 qualified teams (from column: Team)

Sample rows from matches:


Unnamed: 0,date,home_team,away_team,home_score,away_score
0,2022-01-27,Jamaica,Mexico,1,2
1,2022-01-27,United States,El Salvador,1,0
2,2022-01-27,Honduras,Canada,0,2
3,2022-01-27,Costa Rica,Panama,1,0
4,2022-01-27,Lebanon,South Korea,0,1


## Step 2: Compute wins, matches, and win rates 

Use statistics to find out how many games they played how many games they won, and what percent of their games were wins.

In [3]:
# Build a list of dictionaries with statistics for each qualified team
stats = []

for team in sorted(qualified_teams):
    # Select matches where this team was the home team and where it was the away team
    home = matches[matches['home_team'] == team]
    away = matches[matches['away_team'] == team]

    # Count wins at home and away separately, then add them
    home_wins = (home['home_score'] > home['away_score']).sum()
    away_wins = (away['away_score'] > away['home_score']).sum()
    total_matches = len(home) + len(away)
    total_wins = int(home_wins + away_wins)

    # Avoid division by zero when a team has no matches in the dataset
    win_rate = (total_wins / total_matches * 100) if total_matches > 0 else 0
    stats.append({'Team': team, 'Matches': total_matches, 'Wins': total_wins, 'Win_Rate': round(win_rate,1)})

# Convert to a DataFrame and sort by win rate (highest first)
stats = pd.DataFrame(stats).sort_values('Win_Rate', ascending=False).reset_index(drop=True)

print("Top qualified teams by win rate:")
display(stats.head(15))



Top qualified teams by win rate:


Unnamed: 0,Team,Matches,Wins,Win_Rate
0,Norway,6,6,100.0
1,England,11,9,81.8
2,Japan,23,17,73.9
3,France,11,8,72.7
4,Netherlands,11,8,72.7
5,Portugal,11,8,72.7
6,Italy,7,5,71.4
7,Austria,7,5,71.4
8,Morocco,17,12,70.6
9,Argentina,29,19,65.5


## Step 3: Filter small samples and choose a predicted winner 

Teams that only played 1‚Äì2 games can look ugly so we set a minimum number of games required. After filtering out teams with too few games, we pick the team with the highest win rate.

In [4]:
# Set a minimum matches filter
MIN_MATCHES = 5  # default: require at least 5 recent matches to be considered
print(f'Using MIN_MATCHES = {MIN_MATCHES}')

# Filter out teams with too few matches.
if MIN_MATCHES > 0:
    candidates = stats[stats['Matches'] >= MIN_MATCHES].reset_index(drop=True)
else:
    candidates = stats.copy()
 
# If the filter removes every team then we use the original list so we still have a team to pick
if len(candidates) == 0:
    print('No qualified teams meet the minimum-match threshold; falling back to unfiltered ranking')
    candidates = stats.copy()

print('\nTop candidate teams used for prediction (after filtering):')
display(candidates.head(10))

# Prediction: choose the top ranked team by win rate
if len(candidates) > 0:
    top = candidates.iloc[0]
    print('\n' + '='*40)
    print('üèÜ Predicted 2026 World Cup Winner')
    print('='*40)
    print(f"Predicted winner: {top['Team']}")
    print(f"Win Rate: {top['Win_Rate']}% (from {int(top['Matches'])} matches)")
else:
    print('No candidate available to predict.')


Using MIN_MATCHES = 5

Top candidate teams used for prediction (after filtering):


Unnamed: 0,Team,Matches,Wins,Win_Rate
0,Norway,6,6,100.0
1,England,11,9,81.8
2,Japan,23,17,73.9
3,France,11,8,72.7
4,Netherlands,11,8,72.7
5,Portugal,11,8,72.7
6,Italy,7,5,71.4
7,Austria,7,5,71.4
8,Morocco,17,12,70.6
9,Argentina,29,19,65.5



üèÜ Predicted 2026 World Cup Winner
Predicted winner: Norway
Win Rate: 100.0% (from 6 matches)
