# Premier League Match Prediction Analysis

This notebook demonstrates the complete machine learning pipeline for predicting Premier League match outcomes using our custom classes and sample data.

## Project Overview
- **Data Source**: Sample Premier League match data
- **Target**: Predict match outcomes (Home Win, Draw, Away Win)
- **Models**: Random Forest, XGBoost, LightGBM
- **Evaluation**: Accuracy, Precision, Recall, F1-Score

## 1. Import Required Libraries

Import essential libraries for data analysis, machine learning, and visualization.

In [None]:
# Standard library imports
import sys
import os
from pathlib import Path

# Add src directory to path
sys.path.append(str(Path().parent / "src"))

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine learning
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Custom modules
from src.data_preprocessing.data_loader import DataLoader
from src.model_training.trainer import ModelTrainer
from src.evaluation.evaluator import ModelEvaluator

# MLflow for experiment tracking
import mlflow

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")

print("All libraries imported successfully!")

## 2. Load and Explore Data

Load the Premier League match data and perform exploratory data analysis.

In [None]:
# Initialize data loader
data_loader = DataLoader("../data/")

# Load raw data
raw_data = data_loader.load_raw_data()

# Display basic information about the dataset
print("Dataset Shape:", raw_data.shape)
print("\nColumn Names:")
print(raw_data.columns.tolist())
print("\nFirst 5 rows:")
print(raw_data.head())
print("\nDataset Info:")
print(raw_data.info())
print("\nMissing Values:")
print(raw_data.isnull().sum())

In [None]:
# Basic statistics
print("Descriptive Statistics:")
print(raw_data.describe())

# Score distribution
if 'home_score' in raw_data.columns and 'away_score' in raw_data.columns:
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Home score distribution
    axes[0].hist(raw_data['home_score'], bins=range(0, 8), alpha=0.7, color='skyblue', edgecolor='black')
    axes[0].set_title('Home Team Score Distribution')
    axes[0].set_xlabel('Goals Scored')
    axes[0].set_ylabel('Frequency')
    
    # Away score distribution
    axes[1].hist(raw_data['away_score'], bins=range(0, 8), alpha=0.7, color='lightcoral', edgecolor='black')
    axes[1].set_title('Away Team Score Distribution')
    axes[1].set_xlabel('Goals Scored')
    axes[1].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()

# Team performance analysis
teams = set(raw_data['home_team'].unique()) | set(raw_data['away_team'].unique())
print(f"\nNumber of teams in dataset: {len(teams)}")
print("Teams:", sorted(teams))

## 3. Data Preprocessing

Clean and preprocess the data using our custom DataLoader class.

In [None]:
# Preprocess the data
processed_data = data_loader.preprocess_data(raw_data)

print("Processed Data Shape:", processed_data.shape)
print("\nNew columns added:")
new_columns = set(processed_data.columns) - set(raw_data.columns)
print(new_columns)

print("\nProcessed Data Sample:")
print(processed_data.head())

# Check for any missing values after preprocessing
print("\nMissing values after preprocessing:")
print(processed_data.isnull().sum())

# Analyze the target variable (match results)
if 'result' in processed_data.columns:
    result_counts = processed_data['result'].value_counts()
    print("\nMatch Results Distribution:")
    print(result_counts)
    
    # Visualize result distribution
    plt.figure(figsize=(8, 6))
    colors = ['lightcoral', 'lightblue', 'lightgreen']
    bars = plt.bar(result_counts.index, result_counts.values, color=colors)
    plt.title('Match Results Distribution')
    plt.xlabel('Match Result')
    plt.ylabel('Count')
    
    # Add value labels on bars
    for bar, count in zip(bars, result_counts.values):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                str(count), ha='center', va='bottom')
    
    plt.show()

## 4. Feature Engineering

Analyze the engineered features and their relationships with match outcomes.

In [None]:
# Analyze engineered features
numeric_features = ['goal_difference', 'total_goals', 'month']
available_features = [col for col in numeric_features if col in processed_data.columns]

if available_features:
    # Feature correlation analysis
    corr_matrix = processed_data[available_features].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True)
    plt.title('Feature Correlation Matrix')
    plt.show()
    
    # Feature distributions by match result
    if 'result' in processed_data.columns:
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.flatten()
        
        for i, feature in enumerate(available_features):
            if i < len(axes):
                for result in processed_data['result'].unique():
                    subset = processed_data[processed_data['result'] == result][feature]
                    axes[i].hist(subset, alpha=0.7, label=f'Result: {result}', bins=10)
                
                axes[i].set_title(f'{feature} Distribution by Match Result')
                axes[i].set_xlabel(feature)
                axes[i].set_ylabel('Frequency')
                axes[i].legend()
        
        # Remove empty subplot
        if len(available_features) < len(axes):
            fig.delaxes(axes[-1])
        
        plt.tight_layout()
        plt.show()

# Goal difference analysis
if 'goal_difference' in processed_data.columns and 'result' in processed_data.columns:
    goal_diff_result = processed_data.groupby('result')['goal_difference'].agg(['mean', 'std', 'count'])
    print("\nGoal Difference Statistics by Result:")
    print(goal_diff_result)

## 5. Model Training

Train machine learning models using our custom ModelTrainer class.

In [None]:
# Split data into training and validation sets
train_data, val_data = data_loader.load_and_split(test_size=0.3)

print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")

# Initialize and train the model
print("\nTraining Random Forest model...")
trainer = ModelTrainer(model_type="random_forest")

# Start MLflow experiment
mlflow.set_experiment("premier_league_prediction_notebook")

# Train the model
try:
    model = trainer.train(train_data, val_data)
    print("Model training completed successfully!")
    
    if model is not None:
        # Display model information
        print(f"\nModel type: {type(model).__name__}")
        print(f"Number of estimators: {model.n_estimators}")
        print(f"Max depth: {model.max_depth}")
        
        # Feature importance (if available)
        if hasattr(model, 'feature_importances_'):
            # Get feature names from trainer
            feature_names = ['home_team', 'away_team', 'month', 'goal_difference', 'total_goals']
            available_features = [name for name in feature_names if name in train_data.columns]
            
            if len(available_features) > 0:
                importances = model.feature_importances_
                feature_importance_df = pd.DataFrame({
                    'feature': available_features,
                    'importance': importances
                }).sort_values('importance', ascending=False)
                
                print("\nFeature Importance:")
                print(feature_importance_df)
                
                # Plot feature importance
                plt.figure(figsize=(10, 6))
                sns.barplot(data=feature_importance_df, x='importance', y='feature')
                plt.title('Feature Importance')
                plt.xlabel('Importance')
                plt.tight_layout()
                plt.show()
    
except Exception as e:
    print(f"Error during training: {e}")
    model = None

## 6. Model Evaluation

Evaluate the trained model using our custom ModelEvaluator class.

In [None]:
# Evaluate the model
if model is not None and len(val_data) > 0:
    print("Evaluating model performance...")
    
    # Initialize evaluator
    evaluator = ModelEvaluator()
    
    # Evaluate model
    try:
        metrics = evaluator.evaluate(trainer, val_data)
        
        print("\nModel Performance Metrics:")
        print("-" * 40)
        for metric_name, value in metrics.items():
            print(f"{metric_name}: {value:.4f}")
        
        # Make predictions for detailed analysis
        predictions = trainer.predict(val_data)
        actual_results = val_data['result'] if 'result' in val_data.columns else []
        
        if len(predictions) > 0 and len(actual_results) > 0:
            # Create detailed results DataFrame
            results_df = pd.DataFrame({
                'Actual': actual_results,
                'Predicted': predictions,
                'Correct': actual_results == predictions
            })
            
            print(f"\nPrediction Results:")
            print(f"Total predictions: {len(predictions)}")
            print(f"Correct predictions: {sum(results_df['Correct'])}")
            print(f"Accuracy: {sum(results_df['Correct']) / len(predictions):.4f}")
            
            # Display some sample predictions
            print("\nSample Predictions:")
            sample_size = min(10, len(results_df))
            print(results_df.head(sample_size))
            
            # Manual confusion matrix visualization
            conf_matrix = confusion_matrix(actual_results, predictions)
            
            plt.figure(figsize=(8, 6))
            sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
                       xticklabels=['Away Win', 'Draw', 'Home Win'],
                       yticklabels=['Away Win', 'Draw', 'Home Win'])
            plt.title('Confusion Matrix')
            plt.ylabel('Actual Result')
            plt.xlabel('Predicted Result')
            plt.show()
            
    except Exception as e:
        print(f"Error during evaluation: {e}")
        
else:
    print("No model available for evaluation or no validation data.")

## 7. Model Inference

Use the trained model to make predictions on new data.

In [None]:
# Create sample new data for prediction
sample_new_data = pd.DataFrame({
    'home_team': ['Arsenal', 'Chelsea', 'Liverpool'],
    'away_team': ['Manchester United', 'Manchester City', 'Tottenham'],
    'home_score': [2, 1, 3],  # These would be unknown in real prediction
    'away_score': [1, 2, 1],  # These would be unknown in real prediction
    'date': pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03']),
    'season': ['2023-24', '2023-24', '2023-24']
})

print("Sample new data for prediction:")
print(sample_new_data)

# Preprocess the new data
processed_new_data = data_loader.preprocess_data(sample_new_data)
print("\nProcessed new data:")
print(processed_new_data)

# Make predictions
if model is not None:
    try:
        new_predictions = trainer.predict(processed_new_data)
        
        print(f"\nPredictions for new matches:")
        print("-" * 40)
        
        for i, (_, row) in enumerate(processed_new_data.iterrows()):
            if i < len(new_predictions):
                home_team = sample_new_data.iloc[i]['home_team']
                away_team = sample_new_data.iloc[i]['away_team']
                prediction = new_predictions[i]
                
                # Convert prediction to readable format
                result_map = {'H': 'Home Win', 'A': 'Away Win', 'D': 'Draw'}
                readable_prediction = result_map.get(prediction, prediction)
                
                print(f"{home_team} vs {away_team}: {readable_prediction}")
        
        # Create visualization of predictions
        if len(new_predictions) > 0:
            pred_counts = pd.Series(new_predictions).value_counts()
            
            plt.figure(figsize=(8, 6))
            colors = ['lightcoral', 'lightblue', 'lightgreen']
            bars = plt.bar(pred_counts.index, pred_counts.values, color=colors[:len(pred_counts)])
            plt.title('Predicted Results for New Matches')
            plt.xlabel('Predicted Result')
            plt.ylabel('Count')
            
            # Add value labels on bars
            for bar, count in zip(bars, pred_counts.values):
                plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                        str(count), ha='center', va='bottom')
            
            plt.show()
            
    except Exception as e:
        print(f"Error making predictions: {e}")
        
else:
    print("No trained model available for making predictions.")

## Conclusion

This notebook demonstrates the complete machine learning pipeline for Premier League match prediction:

1. **Data Loading**: Successfully loaded and explored the sample match data
2. **Data Preprocessing**: Applied feature engineering including goal difference, total goals, and temporal features
3. **Model Training**: Trained a Random Forest classifier using our custom ModelTrainer class
4. **Model Evaluation**: Evaluated performance using multiple metrics and visualization
5. **Model Inference**: Made predictions on new, unseen data

### Key Takeaways:
- The model can learn patterns from historical match data
- Feature engineering (goal difference, total goals) provides valuable insights
- The modular design allows easy experimentation with different models
- MLflow integration enables experiment tracking and reproducibility

### Next Steps:
1. Collect more comprehensive data (player stats, team form, etc.)
2. Experiment with different algorithms (XGBoost, LightGBM, Neural Networks)
3. Implement more sophisticated features (rolling averages, team strength ratings)
4. Deploy the model as a web service for real-time predictions
5. Set up monitoring and retraining pipelines

The foundation is now in place for a production-ready ML system!