# Advantages in International Football

## Project Purpose

The purpose of this project is to explore how **FIFA ranking**, **population size**, and **match location** (Home / Away / Neutral ground) affect the outcome of international football matches.  
We aim to investigate whether there are measurable advantages associated with:

- Playing at home
- Having a larger population
- Holding a higher FIFA ranking

We will analyze how these factors influence:

- Win percentage
- Average points per match
- Goal difference

---

## Data Overview

Our analysis is based on **three datasets** that we have combined:

- **International matches** dataset (since 1992)
- **Country population** data
- **FIFA ranking** data at the time of each match

We have created **four versions** of the combined dataset:

1. **Original data** (full dataset)
2. **Without friendlies** (only competitive matches)
3. **Without neutral matches** (only matches played at home or away)
4. **Without friendlies and neutral matches** (pure competitive home/away games)

---

## Course of Action

1. **Data Cleaning**  
   Remove rows with any missing values to ensure high-quality input.

2. **Data Splitting**  
   Create the four dataset versions for comparison purposes.

3. **Unsupervised Learning**  
   Use clustering techniques (e.g., DBSCAN, PCA) to identify outliers and gain a better structural understanding of the data.

4. **Statistical Analysis**  
   Analyze basic statistics to assess:
   - Whether a home advantage exists
   - If the advantage varies between datasets

5. **Supervised Learning Models**  
   Build predictive models to forecast match outcomes based on match location, population, and FIFA ranking:
   - **Linear Regression** (predict points earned)
   - **Random Forest Regression** (predict points earned)
   - **Logistic Regression** (predict win vs non-win)

6. **Conclusions**  
   Summarize findings, focusing on:
   - The existence and strength of home advantage
   - The predictive power of population and ranking differences

---


In the making of this project we have used an AI-tool (ChatGPT) to help improve our code base, writing comments, general bug fixing, and used as a partner for discussing ideas and results.  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import os
import statsmodels.api as sm
from IPython.display import display

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_squared_error,
    r2_score,
    accuracy_score,
    confusion_matrix,
    classification_report
)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


populationCsvPath = 'CsvFilesExam/world_population.csv'
resultsCsvPath = 'CsvFilesExam/results.csv'
rankingCsvPath = 'CsvFilesExam/fifa_mens_rank.csv'
mergedDataCsvPath = 'CsvFilesExam/merged_data.csv'

populationData = pd.read_csv(populationCsvPath)
resultsData = pd.read_csv(resultsCsvPath)
rankingData = pd.read_csv(rankingCsvPath)



In [None]:
def load_or_create_merged_data(results_path, population_path, ranking_path, merged_output_path):
    """
    Loads merged data if it exists, otherwise creates it and saves to CSV.
    """
    if os.path.exists(merged_output_path):
        print(f"Loading existing merged dataset from {merged_output_path}...")
        mergedData = pd.read_csv(merged_output_path)
        
        mergedData['date'] = pd.to_datetime(mergedData['date'])
        
    else:
        print("Merged dataset not found, creating merged data...")
        
        # Load original data
        resultsData = pd.read_csv(results_path)
        populationData = pd.read_csv(population_path)
        rankingData = pd.read_csv(ranking_path)
        
        # Merge
        mergedData = add_population_and_ranking_to_results(resultsData, populationData, rankingData)
        
        # Save merged dataset
        mergedData.to_csv(merged_output_path, index=False)
        print(f"Merged dataset created and saved to {merged_output_path}")
    
    return mergedData

def add_population_and_ranking_to_results(results: pd.DataFrame, 
                                          world_population: pd.DataFrame, 
                                          ranking_data: pd.DataFrame) -> pd.DataFrame:
    # Parse dates
    results['date'] = pd.to_datetime(results['date'])
    ranking_data['date'] = pd.to_datetime(ranking_data['date'])

    # Extract year from match date
    results['year'] = results['date'].dt.year

    # Rename country column for easier matching
    world_population = world_population.rename(columns={'Country/Territory': 'Country'})

    # Define available years in world_population
    available_years = [2022, 2020, 2015, 2010, 2000, 1990, 1980, 1970]

    # Helper function to find best available year
    def closest_year(match_year):
        for y in sorted(available_years, reverse=True):
            if match_year >= y:
                return y
        return min(available_years)

    # Apply closest year
    results['population_year'] = results['year'].apply(closest_year)

    # Prepare population data
    pop_data = {}
    for _, row in world_population.iterrows():
        pop_data[row['Country']] = {year: row.get(f'{year} Population') for year in available_years}

    # Function to get population
    def get_population(team, year):
        country_data = pop_data.get(team)
        if country_data:
            return country_data.get(year)
        return None

    results['home_population'] = results.apply(lambda row: get_population(row['home_team'], row['population_year']), axis=1)
    results['away_population'] = results.apply(lambda row: get_population(row['away_team'], row['population_year']), axis=1)

    # ---------------------------------------
    # Add home_ranking and away_ranking
    # ---------------------------------------

    # Prepare: for easier lookup, sort ranking_data
    ranking_data = ranking_data.sort_values('date')

    def find_latest_ranking(team, match_date):
        team_rankings = ranking_data[ranking_data['team'] == team]  # fixed!
        team_rankings = team_rankings[team_rankings['date'] <= match_date]
        if not team_rankings.empty:
            return team_rankings.iloc[-1]['rank']
        else:
            return None

    # Apply for home and away teams
    results['home_ranking'] = results.apply(lambda row: find_latest_ranking(row['home_team'], row['date']), axis=1)
    results['away_ranking'] = results.apply(lambda row: find_latest_ranking(row['away_team'], row['date']), axis=1)

    # Clean up helper columns
    results = results.drop(columns=['year', 'population_year'])

    return results


def remove_nan_population_rows(results: pd.DataFrame) -> pd.DataFrame:
    # Remove rows where home_population or away_population is NaN
    results = results.dropna(subset=['home_population', 'away_population'])
    return results

def remove_nan_fifa_ranking_rows(results: pd.DataFrame) -> pd.DataFrame:
    # Remove rows where home_population or away_population is NaN
    results = results.dropna(subset=['home_ranking', 'away_ranking'])
    return results    

  

In [None]:
#This takes quite a while - comment to be deleted
print('Creating merged data')
mergedData = load_or_create_merged_data(resultsCsvPath, populationCsvPath, rankingCsvPath, mergedDataCsvPath)
print(mergedData.describe())

In [None]:
# ─────────────── PCA on mergeddata  ───────────────

scaler = StandardScaler()
pca    = PCA(n_components=2, random_state=42)

# 2. Clean NaNs
mergedData = remove_nan_population_rows(mergedData).copy()
mergedData = remove_nan_fifa_ranking_rows(mergedData).copy()

# 3. Difference features
mergedData['population_difference'] = mergedData['home_population'] - mergedData['away_population']
mergedData['ranking_difference']    = mergedData['away_ranking']  - mergedData['home_ranking']

# 4. Feature matrix & scaling
feature_cols = [
    'home_score', 'away_score',
    'home_population', 'away_population',
    'home_ranking', 'away_ranking',
    'population_difference', 'ranking_difference'
]
X = mergedData[feature_cols]
X_scaled = scaler.fit_transform(X)  # assuming StandardScaler() is defined as 'scaler'

# 5. PCA and plot
X_pca = pca.fit_transform(X_scaled)  # assuming PCA(n_components=2, random_state=42) is 'pca'
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
plt.title("PCA on mergeddata (cleaned & with difference features)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

# 6. Explained variance
print("Explained variance ratio:", pca.explained_variance_ratio_)


In [None]:
# ─────────────── K-Means with Elbow & Silhouette Analysis ───────────────
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# 1. Assemble feature matrix & scale
feature_cols = [
    'home_score', 'away_score',
    'home_population', 'away_population',
    'home_ranking', 'away_ranking',
    'population_difference', 'ranking_difference'
]
X = mergedData[feature_cols]
X_scaled = scaler.transform(X)

# 2. Compute inertia and silhouette for k = 2…10
Ks = range(2, 11)
inertias = []
sil_scores = []
for k_ in Ks:
    km_ = KMeans(n_clusters=k_, random_state=42).fit(X_scaled)
    inertias.append(km_.inertia_)
    sil_scores.append(silhouette_score(X_scaled, km_.labels_))

# 3. Plot both metrics
fig, ax1 = plt.subplots(figsize=(8, 4))
ax1.plot(Ks, inertias, '-o', label='Inertia')
ax1.set_xlabel('Number of clusters k')
ax1.set_ylabel('Inertia', color='tab:blue')
ax1.tick_params(axis='y', labelcolor='tab:blue')

ax2 = ax1.twinx()
ax2.plot(Ks, sil_scores, '-o', color='tab:orange', label='Silhouette Score')
ax2.set_ylabel('Silhouette Score', color='tab:orange')
ax2.tick_params(axis='y', labelcolor='tab:orange')

ax1.set_xticks(Ks)
fig.suptitle('Elbow Method & Silhouette Analysis')
fig.tight_layout()
plt.show()

# 4. Pick your best k (e.g. where the “elbow” occurs and/or silhouette peaks)
best_k = 5

# 5. Fit final K-Means and plot clusters
km = KMeans(n_clusters=best_k, random_state=42).fit(X_scaled)
labels = km.labels_
X_pca = pca.transform(X_scaled)

plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='tab10', alpha=0.6)
plt.title(f"K-Means (k={best_k}) on PCA embedding")
plt.xlabel("PC1"); plt.ylabel("PC2")
plt.legend(*scatter.legend_elements(), title="Cluster")
plt.show()

# 6. Print the average silhouette score for the chosen k
print(f"Average silhouette score (k={best_k}):",
      silhouette_score(X_scaled, labels))


In [None]:

# 1. Remove Friendlies
mergedData_no_friendlies = mergedData[mergedData['tournament'] != 'Friendly'].copy()

# 2. Remove Neutral Matches
mergedData_no_neutral = mergedData[mergedData['neutral'] == False].copy()

# 3. Remove Friendlies AND Neutral Matches
mergedData_no_friendlies_no_neutral = mergedData[
    (mergedData['tournament'] != 'Friendly') & (mergedData['neutral'] == False)].copy()

# Optional: print quick summaries
print(f"Original dataset size: {mergedData.shape[0]} matches")
print(f"Without Friendlies: {mergedData_no_friendlies.shape[0]} matches")
print(f"Without Neutral Matches: {mergedData_no_neutral.shape[0]} matches")
print(f"Without Friendlies AND Neutral Matches: {mergedData_no_friendlies_no_neutral.shape[0]} matches")


In [None]:
def determine_result(row):
    if row['home_score'] > row['away_score']:
        return 'home_win'
    elif row['home_score'] < row['away_score']:
        return 'away_win'
    else:
        return 'draw'
        
def calculate_points(results: pd.DataFrame) -> pd.DataFrame:
    def home_points(row):
        if row['outcome'] == 'home_win':
            return 3
        elif row['outcome'] == 'draw':
            return 1
        else:
            return 0

    def away_points(row):
        if row['outcome'] == 'away_win':
            return 3
        elif row['outcome'] == 'draw':
            return 1
        else:
            return 0

    results['home_points'] = results.apply(home_points, axis=1)
    results['away_points'] = results.apply(away_points, axis=1)

    average_home_points = results['home_points'].mean()
    average_away_points = results['away_points'].mean()

    print(f"Average points earned by home teams: {average_home_points:.2f} per match")
    print(f"Average points earned by away teams: {average_away_points:.2f} per match")

    return results

def plot_home_advantage(results: pd.DataFrame) -> None:
    """
    Plots a pie chart of match outcomes and a histogram of goal differences
    for home teams based on the provided results DataFrame.
    """
    # Pie chart of match outcomes
    results['outcome'].value_counts().plot(kind='pie', autopct='%1.1f%%')
    plt.title('Home Team Match Outcomes')
    plt.ylabel('')  # Remove y-axis label for cleaner pie chart
    plt.show()

    # Histogram of home team goal difference
    plt.hist(results['goal_difference'], bins=30, edgecolor='black')
    plt.title('Distribution of Home Team Goal Difference')
    plt.xlabel('Goal Difference (Home - Away)')
    plt.ylabel('Number of Matches')
    plt.show()
    

In [None]:
print("--- Original mergedData ---")
mergedData['outcome'] = mergedData.apply(determine_result, axis=1)
outcome_counts = mergedData['outcome'].value_counts(normalize=True) * 100
print(outcome_counts)

print("\n--- No Friendlies ---")
mergedData_no_friendlies['outcome'] = mergedData_no_friendlies.apply(determine_result, axis=1)
outcome_counts_no_friendlies = mergedData_no_friendlies['outcome'].value_counts(normalize=True) * 100
print(outcome_counts_no_friendlies)

print("\n--- No Neutral Matches ---")
mergedData_no_neutral['outcome'] = mergedData_no_neutral.apply(determine_result, axis=1)
outcome_counts_no_neutral = mergedData_no_neutral['outcome'].value_counts(normalize=True) * 100
print(outcome_counts_no_neutral)

print("\n--- No Friendlies and No Neutral Matches ---")
mergedData_no_friendlies_no_neutral['outcome'] = mergedData_no_friendlies_no_neutral.apply(determine_result, axis=1)
outcome_counts_no_friendlies_no_neutral = mergedData_no_friendlies_no_neutral['outcome'].value_counts(normalize=True) * 100
print(outcome_counts_no_friendlies_no_neutral)


In [None]:
print("--- Original mergedData ---")
mergedData['goal_difference'] = mergedData['home_score'] - mergedData['away_score']
average_goal_difference = mergedData['goal_difference'].mean()
print(f"Average Home Goal Difference: {average_goal_difference:.2f}")

print("\n--- No Friendlies ---")
mergedData_no_friendlies['goal_difference'] = mergedData_no_friendlies['home_score'] - mergedData_no_friendlies['away_score']
average_goal_difference_no_friendlies = mergedData_no_friendlies['goal_difference'].mean()
print(f"Average Home Goal Difference (No Friendlies): {average_goal_difference_no_friendlies:.2f}")

print("\n--- No Neutral Matches ---")
mergedData_no_neutral['goal_difference'] = mergedData_no_neutral['home_score'] - mergedData_no_neutral['away_score']
average_goal_difference_no_neutral = mergedData_no_neutral['goal_difference'].mean()
print(f"Average Home Goal Difference (No Neutral Matches): {average_goal_difference_no_neutral:.2f}")

print("\n--- No Friendlies and No Neutral Matches ---")
mergedData_no_friendlies_no_neutral['goal_difference'] = mergedData_no_friendlies_no_neutral['home_score'] - mergedData_no_friendlies_no_neutral['away_score']
average_goal_difference_no_friendlies_no_neutral = mergedData_no_friendlies_no_neutral['goal_difference'].mean()
print(f"Average Home Goal Difference (No Friendlies & No Neutral Matches): {average_goal_difference_no_friendlies_no_neutral:.2f}")


In [None]:
# Apply determine_result and calculate_points properly for all datasets

print("--- Original mergedData ---")
mergedData['outcome'] = mergedData.apply(determine_result, axis=1)
mergedData = calculate_points(mergedData)

print("\n--- No Friendlies ---")
mergedData_no_friendlies['outcome'] = mergedData_no_friendlies.apply(determine_result, axis=1)
mergedData_no_friendlies = calculate_points(mergedData_no_friendlies)

print("\n--- No Neutral Matches ---")
mergedData_no_neutral['outcome'] = mergedData_no_neutral.apply(determine_result, axis=1)
mergedData_no_neutral = calculate_points(mergedData_no_neutral)

print("\n--- No Friendlies and No Neutral Matches ---")
mergedData_no_friendlies_no_neutral['outcome'] = mergedData_no_friendlies_no_neutral.apply(determine_result, axis=1)
mergedData_no_friendlies_no_neutral = calculate_points(mergedData_no_friendlies_no_neutral)


In [None]:


# Plotting
print("--- Original mergedData ---")
plot_home_advantage(mergedData)

print("\n--- No Friendlies ---")
plot_home_advantage(mergedData_no_friendlies)

print("\n--- No Neutral Matches ---")
plot_home_advantage(mergedData_no_neutral)

print("\n--- No Friendlies and No Neutral Matches ---")
plot_home_advantage(mergedData_no_friendlies_no_neutral)


The data suggest the existence of a home advantage, which becomes more pronounced in competitive matches played on non-neutral grounds.
In such matches, there is an increase in home wins, points accumulated, and goal difference in favor of the home team.
Additionally, the likelihood of a draw appears to decrease slightly when the match is competitive.

In [None]:
def analyze_population_impact(results: pd.DataFrame, scale_population: bool = True):
    """
    Analyzes the impact of population difference on home points.
    """

    # 1. Remove NaN populations
    results = remove_nan_population_rows(results.copy())

    # 2. Create population difference
    results['population_difference'] = results['home_population'] - results['away_population']
    if scale_population:
        results['population_difference'] = results['population_difference'] / 1_000_000

    # 3. Prepare data
    X = results[['population_difference']]
    y = results['home_points']

    # 4. Train model
    model = LinearRegression()
    model.fit(X, y)

    # 5. Statsmodels for p-value
    X_with_const = sm.add_constant(X)
    sm_model = sm.OLS(y, X_with_const).fit()

    # 6. Print model info
    print("Linear Regression Model:")
    print(f"Home Points = {model.coef_[0]:.6f} * Population Difference + {model.intercept_:.6f}")
    print(f"R² Score: {model.score(X, y):.4f}")
    print(f"P-value for Population Difference: {sm_model.pvalues['population_difference']:.6f}")

    # 7. Impact conclusion
    print("\nImpact Conclusion:")
    if scale_population:
        print(f"For every 1 million more people at home vs away, home teams earn {model.coef_[0]:.8f} additional points on average.")
    else:
        print(f"For every 1 person more, home teams earn {model.coef_[0]:.8f} additional points on average.")

    # 8. Plot
    predictions = model.predict(X)
    plt.scatter(X, y, alpha=0.3, label="Actual matches")
    plt.plot(X, predictions, color='red', label="Regression Line")
    plt.xlabel('Population Difference (Home - Away) [Millions]' if scale_population else 'Population Difference')
    plt.ylabel('Home Points')
    plt.title('Impact of Population Difference on Home Points')
    plt.legend()
    plt.show()

    return model

In [None]:
print("--- Original mergedData ---")
modelPop = analyze_population_impact(mergedData)

print("\n--- No Friendlies ---")
modelPop_no_friendlies = analyze_population_impact(mergedData_no_friendlies)

print("\n--- No Neutral Matches ---")
modelPop_no_neutral = analyze_population_impact(mergedData_no_neutral)

print("\n--- No Friendlies and No Neutral Matches ---")
modelPop_no_friendlies_no_neutral = analyze_population_impact(mergedData_no_friendlies_no_neutral)


In [None]:
def analyze_ranking_impact(results: pd.DataFrame):
    """
    Analyzes the impact of ranking difference on home points.
    """

    # 1. Remove NaN rankings
    results = remove_nan_fifa_ranking_rows(results.copy())

    # 2. Create ranking difference
    results['ranking_difference'] = results['away_ranking'] - results['home_ranking']
    # Notice: Higher ranking (smaller number) is better, so flip away - home.

    # 3. Prepare data
    X = results[['ranking_difference']]
    y = results['home_points']

    # 4. Train model
    model = LinearRegression()
    model.fit(X, y)

    # 5. Predictions
    predictions = model.predict(X)

    # 6. Print model info
    print("Linear Regression Model:")
    print(f"Home Points = {model.coef_[0]:.6f} * Ranking Difference + {model.intercept_:.6f}")
    print(f"R² Score: {model.score(X, y):.4f}")

    # 7. Impact explanation
    print("\nImpact:")
    print(f"For every 1 place better home ranking vs away, home teams earn {model.coef_[0]:.6} additional points on average.")

    # 8. Plot
    plt.scatter(X, y, alpha=0.3, label="Actual matches")
    plt.plot(X, predictions, color='red', label="Regression Line")
    plt.xlabel('Ranking Difference (Away Rank - Home Rank)')
    plt.ylabel('Home Points')
    plt.title('Impact of Ranking Difference on Home Points')
    plt.legend()
    plt.show()

    return model


In [None]:
print("--- Original mergedData ---")
modelRank = analyze_ranking_impact(mergedData)

print("\n--- No Friendlies ---")
modelRank_no_friendlies = analyze_ranking_impact(mergedData_no_friendlies)

print("\n--- No Neutral Matches ---")
modelRank_no_neutral = analyze_ranking_impact(mergedData_no_neutral)

print("\n--- No Friendlies and No Neutral Matches ---")
modelRank_no_friendlies_no_neutral = analyze_ranking_impact(mergedData_no_friendlies_no_neutral)


In [None]:
def analyze_combined_impact(results: pd.DataFrame, scale_population: bool = True):
    """
    Analyzes the combined impact of population difference and ranking difference on home points.
    """

    # 1. Remove NaN
    results = results.dropna(subset=['home_population', 'away_population', 'home_ranking', 'away_ranking']).copy()

    # 2. Create features
    results['population_difference'] = results['home_population'] - results['away_population']
    results['ranking_difference'] = results['away_ranking'] - results['home_ranking']

    if scale_population:
        results['population_difference'] = results['population_difference'] / 1_000_000

    # 3. Prepare data
    X = results[['population_difference', 'ranking_difference']]
    y = results['home_points']

    # 4. Train model
    model = LinearRegression()
    model.fit(X, y)

    # 5. Statsmodels for p-value
    X_with_const = sm.add_constant(X)
    sm_model = sm.OLS(y, X_with_const).fit()

    # 6. Print model info
    print("Multiple Linear Regression Model:")
    print(f"Home Points = ({model.coef_[0]:.6f} * Population Difference) + ({model.coef_[1]:.6f} * Ranking Difference) + {model.intercept_:.6f}")
    print(f"R² Score: {model.score(X, y):.4f}")
    print(f"P-value for Population Difference: {sm_model.pvalues['population_difference']:.6f}")
    print(f"P-value for Ranking Difference: {sm_model.pvalues['ranking_difference']:.6f}")

    # 7. Impact conclusion
    print("\nImpact Conclusion:")
    if scale_population:
        print(f"For every 1 million more people (home vs away), home teams earn {model.coef_[0]:.8f} additional points.")
    else:
        print(f"For every 1 person more (home vs away), home teams earn {model.coef_[0]:.8f} additional points.")
    
    print(f"For every 1 place better home ranking vs away, home teams earn {model.coef_[1]:.6f} additional points.")

    # 8. 3D Plot
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(results['population_difference'], results['ranking_difference'], y, alpha=0.3, label='Actual matches')

    x_surf, y_surf = np.meshgrid(
        np.linspace(results['population_difference'].min(), results['population_difference'].max(), 100),
        np.linspace(results['ranking_difference'].min(), results['ranking_difference'].max(), 100)
    )
    z_surf = model.intercept_ + model.coef_[0] * x_surf + model.coef_[1] * y_surf

    ax.plot_surface(x_surf, y_surf, z_surf, color='red', alpha=0.5)
    ax.set_xlabel('Population Difference (Millions)' if scale_population else 'Population Difference')
    ax.set_ylabel('Ranking Difference')
    ax.set_zlabel('Home Points')
    ax.set_title('3D Impact of Population and Ranking Differences')
    plt.show()

    return model

In [None]:
print("--- Original mergedData ---")
modelCombined = analyze_combined_impact(mergedData)

print("\n--- No Friendlies ---")
modelCombined_no_friendlies = analyze_combined_impact(mergedData_no_friendlies)

print("\n--- No Neutral Matches ---")
modelCombined_no_neutral = analyze_combined_impact(mergedData_no_neutral)

print("\n--- No Friendlies and No Neutral Matches ---")
modelCombined_no_friendlies_no_neutral = analyze_combined_impact(mergedData_no_friendlies_no_neutral)


In [None]:
def prepare_classification_target(results: pd.DataFrame) -> pd.DataFrame:
    
    results = results.copy()
    results['home_win'] = results['outcome'].apply(lambda x: 1 if x == 'home_win' else 0)
    return results

    
def run_logistic_regression(results: pd.DataFrame):
    """
    Runs Logistic Regression to predict home win based on population and ranking differences.
    Includes a probability distribution plot.
    """

    # Prepare dataset
    results = results.dropna(subset=['population_difference', 'ranking_difference'])
    results = prepare_classification_target(results)
    X = results[['population_difference', 'ranking_difference']]
    y = results['home_win']

    # Train model
    model = LogisticRegression(max_iter=1000)
    model.fit(X, y)

    # Predict and evaluate
    predictions = model.predict(X)

    acc = accuracy_score(y, predictions)
    cm = confusion_matrix(y, predictions)
    report = classification_report(y, predictions)

    print("=== Logistic Regression Results ===")
    print(f"Accuracy: {acc:.4f}")
    print("Confusion Matrix:")
    print(cm)
    print("\nClassification Report:")
    print(report)

    # Predicted probabilities
    probs = model.predict_proba(X)[:, 1]  # probability of home win

    plt.figure(figsize=(8,6))
    plt.hist(probs, bins=30, edgecolor='black')
    plt.xlabel('Predicted Probability of Home Win')
    plt.ylabel('Number of Matches')
    plt.title('Logistic Regression: Predicted Probabilities')
    plt.grid()
    plt.show()

    return model
    

def run_random_forest_regression(results: pd.DataFrame):
    """
    Runs Random Forest Regression to predict home points based on population and ranking differences.
    Includes scatter plot of true vs predicted points and a visualization of a decision tree.
    """
    from sklearn.tree import plot_tree

    # Prepare dataset
    results = results.dropna(subset=['population_difference', 'ranking_difference'])
    X = results[['population_difference', 'ranking_difference']]
    y = results['home_points']

    # Train model
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X, y)

    # Predict and evaluate
    predictions = model.predict(X)

    mse = mean_squared_error(y, predictions)
    r2 = r2_score(y, predictions)

    print("=== Random Forest Regression Results ===")
    print(f"Mean Squared Error: {mse:.4f}")
    print(f"R² Score: {r2:.4f}")

    # Feature importances
    feature_importances = model.feature_importances_
    for feat, importance in zip(X.columns, feature_importances):
        print(f"{feat}: {importance:.4f}")

    # --- Scatter plot: Actual vs Predicted ---
    plt.figure(figsize=(8,6))
    plt.scatter(y, predictions, alpha=0.5)
    plt.plot([0, 3], [0, 3], 'r--')  # ideal line
    plt.xlabel('Actual Home Points')
    plt.ylabel('Predicted Home Points')
    plt.title('Random Forest: Actual vs Predicted Home Points')
    plt.grid()
    plt.show()

    # --- Visualize one Decision Tree inside the Forest ---
    estimator = model.estimators_[0]  # Pick the first tree for visualization

    plt.figure(figsize=(20, 10))
    plot_tree(
        estimator,
        feature_names=X.columns,
        filled=True,
        rounded=True,
        max_depth=3,  # Only show first 3 levels to keep it readable
        fontsize=10
    )
    plt.title('Visualization of one Decision Tree inside Random Forest')
    plt.show()

    return model




In [None]:
# --- Step 1: Make sure difference columns are created for all datasets ---

datasets = {
    'Original mergedData': mergedData,
    'No Friendlies': mergedData_no_friendlies,
    'No Neutral Matches': mergedData_no_neutral,
    'No Friendlies and No Neutral Matches': mergedData_no_friendlies_no_neutral
}



# --- Step 2: Run models for all datasets and store results ---

logistic_models = {}
random_forest_models = {}
logistic_accuracies = {}
random_forest_r2_scores = {}

for name, data in datasets.items():
    print(f"\n--- Logistic Regression: {name} ---")
    logistic_model = run_logistic_regression(data)
    logistic_models[name] = logistic_model

    print(f"\n--- Random Forest Regression: {name} ---")
    random_forest_model = run_random_forest_regression(data)
    random_forest_models[name] = random_forest_model
    
    # Prepare X and y cleanly for scoring:
    X_log = data[['population_difference', 'ranking_difference']].dropna()
    y_log = prepare_classification_target(data.dropna(subset=['population_difference', 'ranking_difference']))['home_win']
    logistic_accuracies[name] = accuracy_score(y_log, logistic_model.predict(X_log))

    X_rf = data[['population_difference', 'ranking_difference']].dropna()
    y_rf = data.dropna(subset=['population_difference', 'ranking_difference'])['home_points']
    random_forest_r2_scores[name] = r2_score(y_rf, random_forest_model.predict(X_rf))


summary = pd.DataFrame({
    'Logistic Regression Accuracy': logistic_accuracies,
    'Random Forest Regression R² Score': random_forest_r2_scores
})

print("\n=== Summary Table ===")
print(summary)


In [None]:
# Define the new match input
new_data = pd.DataFrame({
    'population_difference': [100],
    'ranking_difference': [25]
})

# Pull all models properly
model_sets = {
    'Original mergedData': (logistic_models['Original mergedData'], random_forest_models['Original mergedData'], modelCombined),
    'No Friendlies': (logistic_models['No Friendlies'], random_forest_models['No Friendlies'], modelCombined_no_friendlies),
    'No Neutral Matches': (logistic_models['No Neutral Matches'], random_forest_models['No Neutral Matches'], modelCombined_no_neutral),
    'No Friendlies and No Neutral Matches': (logistic_models['No Friendlies and No Neutral Matches'], random_forest_models['No Friendlies and No Neutral Matches'], modelCombined_no_friendlies_no_neutral)
}

# Run predictions for each model set
for name, (logistic_model, rf_model, combined_model) in model_sets.items():
    print(f"\n=== Predictions using model: {name} ===")
    
    # Logistic Regression prediction (Home Win)
    predicted_home_win = logistic_model.predict(new_data)
    print(f"Predicted outcome (1 = Home win, 0 = Not Home win): {predicted_home_win[0]}")
    
    # Random Forest Regression prediction (Home Points)
    predicted_home_points_rf = rf_model.predict(new_data)
    print(f"Predicted home points (Random Forest): {predicted_home_points_rf[0]:.4f}")
    
    # Combined Linear Regression prediction (Home Points)
    predicted_home_points_combined = combined_model.predict(new_data)
    print(f"Predicted home points (Combined Linear Regression): {predicted_home_points_combined[0]:.4f}")


In [None]:
# Define the match data
matches = {
    'Denmark vs England': pd.DataFrame({
        'population_difference': [-52.057728],
        'ranking_difference': [-18]
    }),
    'England vs Denmark': pd.DataFrame({
        'population_difference': [52.057728],
        'ranking_difference': [18]
    }),
    'Spain vs Germany': pd.DataFrame({
        'population_difference': [47.889958-84.075075],
        'ranking_difference': [8]
    }),
    'Germany vs Spain': pd.DataFrame({
        'population_difference': [84.075075-47.889958],
        'ranking_difference': [-8]
    }),
}

# Store results
prediction_results = []

# Loop over matches and model sets
for match_name, match_data in matches.items():
    for dataset_name, (logistic_model, rf_model, combined_model) in model_sets.items():
        predicted_home_win = logistic_model.predict(match_data)[0]
        predicted_home_points_rf = rf_model.predict(match_data)[0]
        predicted_home_points_linear = combined_model.predict(match_data)[0]

        prediction_results.append({
            'Match': match_name,
            'Dataset': dataset_name,
            'Predicted Home Win (1=Yes) (Logistic model)': int(predicted_home_win),
            'Home Points (Random Forest)': round(predicted_home_points_rf, 2),
            'Home Points (Linear Regression)': round(predicted_home_points_linear, 2)
        })

# Create DataFrame
summary_df = pd.DataFrame(prediction_results)

# Reorder columns a bit
summary_df = summary_df[['Match', 'Dataset', 
                         'Predicted Home Win (1=Yes) (Logistic model)', 
                         'Home Points (Random Forest)', 
                         'Home Points (Linear Regression)']]

# Show it nicely in Jupyter
display(summary_df)