# Revised Quiniela Model
This notebook aims to improve the prediction model for the 'Quiniela' game by enhancing feature engineering, trying advanced models, and improving the evaluation process. This includes recent team performance metrics, head-to-head data, and advanced model interpretation.

Let's dive into the implementation steps.

## Step 1: Data Preprocessing
We'll start by loading the dataset and performing initial data preprocessing. This includes handling missing values, encoding categorical data, and preparing features for modeling.

In [None]:

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
# Assuming dataset.csv is the data file (replace with actual data file path)
data = pd.read_csv("dataset.csv")

# Check for missing values and handle them
data.fillna(method='ffill', inplace=True)  # Forward fill as a basic approach; customize as needed

# Encode categorical features if any
# Example: data = pd.get_dummies(data, columns=['Category1', 'Category2'])

# Define features (X) and target (y)
X = data.drop(columns=['Match_Result'])  # Assuming 'Match_Result' is the target column
y = data['Match_Result']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("Data Preprocessing completed.")


## Step 2: Feature Engineering
We'll introduce new features to improve model performance, including recent team performance metrics, head-to-head stats, and home/away-specific metrics.

In [None]:

# Adding new features (example placeholders)

# Recent team performance: average goals scored/conceded over last 5 matches
data['Avg_Goals_Scored_Last_5'] = data.groupby('Team')['Goals_Scored'].transform(lambda x: x.rolling(5, 1).mean())
data['Avg_Goals_Conceded_Last_5'] = data.groupby('Team')['Goals_Conceded'].transform(lambda x: x.rolling(5, 1).mean())

# Head-to-Head performance: Last 5 games between two teams
data['Head_to_Head_Last_5'] = data.groupby(['Home_Team', 'Away_Team'])['Match_Result'].transform(lambda x: x.rolling(5, 1).apply(lambda y: sum(y == 'Home Win') / 5, raw=True))

# Home/Away specific performance metrics
data['Home_Avg_Goals'] = data[data['Home_Team'] == data['Team']]['Goals_Scored'].transform(lambda x: x.rolling(5, 1).mean())
data['Away_Avg_Goals'] = data[data['Away_Team'] == data['Team']]['Goals_Scored'].transform(lambda x: x.rolling(5, 1).mean())

print("Feature Engineering completed.")


## Step 3: Model Training and Hyperparameter Tuning
We'll test models like Random Forest and XGBoost and tune hyperparameters for optimal performance.

In [None]:

# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# Define model and parameters for tuning
rf = RandomForestClassifier(random_state=42)
xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Parameter grids
param_grid_rf = {'n_estimators': [50, 100, 150], 'max_depth': [10, 20, 30]}
param_grid_xgb = {'n_estimators': [50, 100], 'max_depth': [3, 5], 'learning_rate': [0.01, 0.1]}

# Grid search for Random Forest
grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train)
best_rf = grid_rf.best_estimator_

# Grid search for XGBoost
grid_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, scoring='accuracy', n_jobs=-1)
grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_

print(f"Best Random Forest Model: {best_rf}")
print(f"Best XGBoost Model: {best_xgb}")


## Step 4: Model Evaluation and Interpretation
We'll evaluate the models using accuracy, confusion matrix, and feature importance. If possible, SHAP values will be used for model interpretation.

In [None]:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Predictions using the best models
y_pred_rf = best_rf.predict(X_test)
y_pred_xgb = best_xgb.predict(X_test)

# Evaluation for Random Forest
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))

# Evaluation for XGBoost
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))

# Confusion matrix for the best model
conf_matrix = confusion_matrix(y_test, y_pred_xgb)  # using XGBoost as example
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Home Win', 'Draw', 'Away Win'], yticklabels=['Home Win', 'Draw', 'Away Win'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for XGBoost')
plt.show()
