<a href="https://www.kaggle.com/code/dascient/sports-bets-winning-algorithm?scriptVersionId=226563445" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Sports Bets Winning Algorithm

Authors: [Marcus](https://www.linkedin.com/in/marcus-szczerbacki-220b0760/), [Corey](https://www.linkedin.com/in/coreyslocum/), [Cole](https://www.linkedin.com/in/william-c-wright-1614bb279/), [Don](https://www.linkedin.com/in/dontadaya/)

A [DaScient Propreitary Analytic Intelligence Artifact](https://www.dascient.com).

In [None]:
"""############################################################
#               MODEL & DATA QUALITY MONITORING            #
#           CONTINUOUS MONITORING & ALERTING GUIDE         #
############################################################

INTRODUCTION
============
This document details a comprehensive strategy for continuous monitoring 
and alerting of model performance and data quality in a production 
environment. The goal is to ensure your predictive system remains 
robust, reliable, and performant under all conditions, detecting issues 
like model drift, data anomalies, and system errors in real time.

ARCHITECTURE OVERVIEW
=====================
1. Monitoring Infrastructure:
   - Metrics Collection: Use Prometheus to collect application, system,
     and custom metrics.
   - Visualization: Deploy Grafana dashboards for real-time visualization 
     of performance and data quality metrics.
   - Log Aggregation: Utilize the ELK stack (Elasticsearch, Logstash, Kibana) 
     or similar platforms (DataDog, Splunk) for centralized log analysis.
   - Distributed Tracing: Integrate Jaeger or Zipkin to track service latency 
     and dependencies.

2. Data Quality Monitoring:
   - Implement automated data validation using tools like Great Expectations 
     or Deequ.
   - Track key data metrics: means, medians, standard deviations, missing 
     value ratios, and distribution shifts.
   - Perform schema validation to ensure data integrity at every ingestion point.

3. Model Performance Monitoring:
   - Continuously compute key performance indicators (KPIs) such as:
       • Accuracy, Precision, Recall, F1 Score, ROC-AUC.
       • Confusion Matrix trends.
   - Monitor prediction latency, throughput, and error rates.
   - Detect model drift using statistical tests (e.g., PSI, KL divergence) 
     and drift detection frameworks (e.g., Evidently AI).

4. Alerting Mechanisms:
   - Define threshold-based alerts for both model performance and data quality:
       • Trigger alerts on significant drops in model accuracy or surges in error rates.
       • Alert on data anomalies like sudden increases in missing values or 
         unexpected shifts in feature distributions.
   - Integrate alerting with systems like PagerDuty, OpsGenie, or Slack 
     for immediate notifications.
   - Use Prometheus Alertmanager and Grafana alerting for seamless integration.

DETAILED IMPLEMENTATION STEPS
=============================

Step 1: Set Up the Monitoring Infrastructure
--------------------------------------------
- Install and configure Prometheus to scrape metrics from your application.
- Deploy Grafana and create dashboards for:
    • System metrics (CPU, memory, network).
    • Custom model metrics (accuracy, prediction latency, error rates).
    • Data quality metrics (feature distribution, missing values).
- Configure a centralized logging system (ELK stack or similar) to collect 
  logs from all components.

Step 2: Instrument the Model and Data Pipelines
------------------------------------------------
- Add instrumentation to your production code using a Prometheus client library 
  (e.g., for Python, use `prometheus_client`).
- Expose endpoints for model metrics (e.g., /metrics) to be scraped by Prometheus.
- Integrate detailed logging (e.g., using Loguru or Python’s logging module) 
  to capture key events, errors, and data statistics.
- Implement data quality checks using Great Expectations:
    • Define expectations for data schema, value ranges, and distribution.
    • Automate periodic validation runs on incoming data batches.

Step 3: Establish Data Quality Checks and Alerts
--------------------------------------------------
- Develop automated tests for data quality using Great Expectations or Deequ.
- Monitor key data metrics such as:
    • Mean, median, standard deviation per feature.
    • Missing value ratios and outlier detection.
- Set up alerts in Prometheus/Grafana for:
    • Sudden deviations in feature distributions.
    • Schema violations or high missing value rates.
- Schedule these tests to run at regular intervals (e.g., via cron or Airflow DAGs).

Step 4: Monitor Model Performance Continuously
------------------------------------------------
- Periodically calculate performance metrics (accuracy, ROC-AUC, etc.) on 
  incoming data or shadow deployments.
- Use statistical tests to compare current performance against historical baselines.
- Implement drift detection:
    • Compute the Population Stability Index (PSI) on feature distributions.
    • Set thresholds to trigger alerts if PSI exceeds a set limit.
- Maintain a holdout validation set for regular performance benchmarking.

Step 5: Configure Alerting Systems
----------------------------------
- Define alert rules in Prometheus Alertmanager:
    • Example: Alert if model accuracy drops below a threshold (e.g., 80%).
    • Example: Alert if the rate of 500 errors exceeds a certain level.
- Integrate Alertmanager with your communication channels (Slack, PagerDuty, Email).
- Configure Grafana alerts to notify on abnormal trends in dashboards.
- Test alerts by simulating failure scenarios to verify notifications.

Step 6: Automate and Orchestrate Monitoring and Retraining
-----------------------------------------------------------
- Use orchestration tools such as Apache Airflow or Kubeflow Pipelines to:
    • Schedule data quality checks.
    • Trigger model retraining when drift is detected.
    • Update dashboards and alerting rules automatically.
- Containerize monitoring components using Docker and manage with Kubernetes.
- Use CI/CD pipelines to deploy updated monitoring configurations and models.

Step 7: Documentation and Incident Response
---------------------------------------------
- Maintain comprehensive documentation (this file, runbooks, and dashboards) 
  for all monitoring and alerting configurations.
- Create an incident response playbook:
    • Define roles and responsibilities.
    • Outline steps for troubleshooting and recovery.
    • Include escalation procedures and contact information.
- Regularly review incident logs and adjust monitoring thresholds as necessary.

Step 8: Continuous Improvement and Feedback Loop
--------------------------------------------------
- Periodically review monitoring dashboards and alert logs.
- Solicit feedback from operations and data science teams to refine metrics.
- Update model and data quality monitoring strategies based on new insights.
- Invest in automated anomaly detection and ML-based monitoring systems.

DEPLOYMENT & SCALABILITY CONSIDERATIONS
========================================
- Use Docker and Kubernetes for scalable deployment of monitoring services.
- Leverage Helm charts to manage Prometheus, Grafana, and Alertmanager deployments.
- Ensure secure endpoints with TLS encryption and proper authentication.
- Regularly backup monitoring configurations and logs.

SECURITY & COMPLIANCE
=====================
- Secure all monitoring endpoints with access controls.
- Encrypt sensitive data both at rest and in transit.
- Ensure compliance with data protection regulations (GDPR, HIPAA, etc.).
- Audit and log access to monitoring systems.

CONCLUSION
==========
By following these steps, you can build a robust, continuously monitored, and 
automated system for tracking model performance and data quality. This approach 
will help detect issues early, maintain high service quality, and facilitate 
rapid response to any operational anomalies.

For further improvements, consider integrating advanced analytics and 
machine-learning-driven anomaly detection on your monitoring data.

############################################################
#                    END OF INSTRUCTIONS                 #
############################################################"""

# Syntax

In [None]:
#!/usr/bin/env python3
"""
Enhanced Continuous Monitoring & Alerting Script
-------------------------------------------------

This script sets up:
  - A Flask API with two endpoints:
      /predict : Simulated model prediction endpoint.
      /metrics : Exposes Prometheus metrics.
  - Background jobs that:
      • Evaluate model performance on a holdout set.
      • Check data quality using simulated logic (replace with real tests if needed).
  - Prometheus metrics collection using the prometheus_client.
  - Robust logging and error handling.
  - Automated scheduling via APScheduler.

Prerequisites:
--------------
pip install flask prometheus_client apscheduler

Usage:
------
python monitor.py
"""

import logging
import random
import time
from flask import Flask, Response, jsonify
from prometheus_client import start_http_server, Gauge, Counter, generate_latest, CONTENT_TYPE_LATEST
from apscheduler.schedulers.background import BackgroundScheduler

# Set up logging for detailed debug output.
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s: %(message)s')

# Define Prometheus metrics
MODEL_ACCURACY = Gauge('model_accuracy', 'Accuracy of the predictive model')
MODEL_LATENCY = Gauge('model_latency_ms', 'Prediction latency in milliseconds')
DATA_QUALITY_ISSUES = Gauge('data_quality_issues', 'Number of data quality issues detected')
TOTAL_PREDICTIONS = Counter('total_predictions', 'Total number of predictions made')

# Initialize Flask app
app = Flask(__name__)

@app.route('/metrics')
def metrics():
    """Expose Prometheus metrics."""
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

@app.route('/predict', methods=['POST'])
def predict():
    """
    Dummy prediction endpoint.
    Simulates prediction latency and returns a random binary prediction.
    """
    start_time = time.time()
    # Simulate prediction latency between 50 and 150 milliseconds.
    simulated_latency = random.uniform(0.05, 0.15)
    time.sleep(simulated_latency)
    prediction = random.choice([0, 1])
    TOTAL_PREDICTIONS.inc()
    elapsed_ms = (time.time() - start_time) * 1000
    MODEL_LATENCY.set(elapsed_ms)
    logging.info(f"Prediction made with latency {elapsed_ms:.2f} ms, prediction: {prediction}")
    return jsonify({'prediction': prediction})

def run_model_evaluation():
    """
    Simulated function to evaluate model performance on a holdout set.
    Replace this simulation with your actual model evaluation logic.
    """
    try:
        # Simulate model evaluation with a random accuracy between 80% and 95%
        simulated_accuracy = random.uniform(0.80, 0.95)
        MODEL_ACCURACY.set(simulated_accuracy)
        logging.info(f"Model evaluation complete: Accuracy = {simulated_accuracy:.4f}")
    except Exception as e:
        logging.error(f"Error during model evaluation: {e}")

def run_data_quality_check():
    """
    Simulated function to check data quality.
    Replace the simulation with real data quality tests (e.g., using Great Expectations).
    """
    try:
        # Simulate data quality check: random number of issues (0 to 5)
        simulated_issues = random.randint(0, 5)
        DATA_QUALITY_ISSUES.set(simulated_issues)
        if simulated_issues > 0:
            logging.warning(f"Data quality check detected {simulated_issues} issues.")
        else:
            logging.info("Data quality check passed with no issues.")
    except Exception as e:
        logging.error(f"Error during data quality check: {e}")

def start_scheduler():
    """
    Start background scheduler to periodically run model evaluation and data quality checks.
    """
    scheduler = BackgroundScheduler()
    scheduler.add_job(run_model_evaluation, 'interval', seconds=60, id='model_evaluation_job')
    scheduler.add_job(run_data_quality_check, 'interval', seconds=60, id='data_quality_job')
    scheduler.start()
    logging.info("Scheduler started with model evaluation and data quality check jobs.")

def main():
    # Start the Prometheus HTTP metrics server on port 8000.
    start_http_server(8000)
    logging.info("Prometheus metrics server started on port 8000.")
    
    # Start background jobs for continuous monitoring.
    start_scheduler()
    
    # Run the Flask application to serve prediction endpoint and metrics.
    app.run(host='0.0.0.0', port=5000)

if __name__ == '__main__':
    main()

In [2]:
# Project by Marcus, Corey, Cole, & Don.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/collegefootballpoll-com/weekly-picks-and-scores_fbs_101524-3.pdf
/kaggle/input/nfl-scores-and-betting-data/nfl_stadiums.csv
/kaggle/input/nfl-scores-and-betting-data/nfl_teams.csv
/kaggle/input/nfl-scores-and-betting-data/spreadspoke_scores.csv
/kaggle/input/nfl-scores-and-betting-data/spreadspoke.R


In [6]:
from IPython.display import clear_output

!pip install tabula
!pip install pdfminer
clear_output()

import pandas as pd
import seaborn as sns
import numpy as np
import re
import os
import matplotlib.pyplot as plt
import pdfminer as pdf2txt
import io
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

In [9]:
import tabula 
csv = "/kaggle/input/nfl-scores-and-betting-data/spreadspoke_scores.csv"
pdf = "/kaggle/input/collegefootballpoll-com/weekly-picks-and-scores_fbs_101524-3.pdf"

#output = tabula.convert_into(file, "converted.csv", output_format="csv", lattice=True, stream=False,  pages="all" )

In [None]:
#!/usr/bin/env python3
"""
Enhanced Sports Data Pipeline and Predictive Modeling Script
=============================================================

This script performs the following operations:
  1. Ingests NFL scores and betting data from a CSV file.
  2. Ingests College Football weekly picks and scores from a PDF.
  3. Cleans and augments both datasets with feature engineering.
     - For NFL data, it creates outcome and margin features.
     - For College Football data, it renames duplicate SCORE columns,
       converts dates/scores, and creates an outcome (favorite win) feature.
  4. Performs exploratory data analysis (EDA) with summary statistics and plots.
  5. Trains a basic predictive model (logistic regression) on NFL data.
  6. Logs detailed progress and errors for debugging and monitoring.

Data Sources:
  - NFL CSV: "/kaggle/input/nfl-scores-and-betting-data/spreadspoke_scores.csv"
  - College Football PDF: "/kaggle/input/collegefootballpoll-com/weekly-picks-and-scores_fbs_101524-3.pdf"
    (Expected columns: DATE, SCORE, FAVORITE, LINE, COMP, UNDERDOG, SCORE)

Prerequisites:
  pip install pandas numpy matplotlib seaborn scikit-learn tabula-py

Usage:
  python enhanced_pipeline.py
"""

import os
import sys
import time
import logging
import pickle
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For PDF extraction (ensure Java is installed for tabula)
import tabula

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Configure logging for detailed debug output.
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s: %(message)s')

# Define file paths (adjust if needed)
NFL_CSV_PATH = "/kaggle/input/nfl-scores-and-betting-data/spreadspoke_scores.csv"
CF_PDF_PATH = "/kaggle/input/collegefootballpoll-com/weekly-picks-and-scores_fbs_101524-3.pdf"

def process_nfl_csv(csv_path):
    """
    Process NFL scores and betting data from CSV.
    Returns a cleaned pandas DataFrame.
    """
    try:
        df = pd.read_csv(csv_path)
        logging.info(f"NFL CSV loaded with shape: {df.shape}")
        
        # Basic cleaning: remove rows with missing values.
        df.dropna(inplace=True)
        
        # If a date column exists, parse it; adjust column name as needed.
        if 'date' in df.columns:
            df['date'] = pd.to_datetime(df['date'], errors='coerce')
        else:
            logging.warning("No 'date' column found in NFL CSV.")
        
        # Create outcome and margin features if score columns exist.
        # We assume the CSV contains 'home_score' and 'away_score'. If not, adjust accordingly.
        if 'home_score' in df.columns and 'away_score' in df.columns:
            df['home_win'] = (df['home_score'] > df['away_score']).astype(int)
            df['margin'] = df['home_score'] - df['away_score']
            logging.info("Outcome and margin features created for NFL data.")
        else:
            logging.warning("Score columns not found in NFL CSV; skipping outcome creation.")
        
        return df
    except Exception as e:
        logging.error(f"Error processing NFL CSV: {e}")
        sys.exit(1)

def process_cf_pdf(pdf_path):
    """
    Process College Football picks and scores from a PDF.
    Returns a cleaned pandas DataFrame.
    
    Expected PDF columns (in order):
      DATE, SCORE, FAVORITE, LINE, COMP, UNDERDOG, SCORE
    We'll rename the duplicate SCORE columns to 'favorite_score' and 'underdog_score'.
    """
    try:
        # Extract all tables from the PDF (assumes the PDF is tabular in nature)
        dfs = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
        if not dfs:
            logging.error("No tables found in College Football PDF.")
            sys.exit(1)
        # Concatenate tables if multiple are found.
        df = pd.concat(dfs, ignore_index=True)
        logging.info(f"College Football PDF loaded with shape: {df.shape}")
        
        # Check the number of columns and rename appropriately.
        if len(df.columns) >= 7:
            # Rename columns assuming the order:
            # DATE, favorite_score, FAVORITE, LINE, COMP, UNDERDOG, underdog_score
            df.columns = ['DATE', 'favorite_score', 'FAVORITE', 'LINE', 'COMP', 'UNDERDOG', 'underdog_score']
            logging.info("PDF columns renamed to standardized names.")
        else:
            logging.warning("Unexpected number of columns in PDF table.")
        
        # Convert DATE column to datetime.
        df['DATE'] = pd.to_datetime(df['DATE'], errors='coerce')
        
        # Convert score columns to numeric.
        df['favorite_score'] = pd.to_numeric(df['favorite_score'], errors='coerce')
        df['underdog_score'] = pd.to_numeric(df['underdog_score'], errors='coerce')
        
        # Remove rows with missing crucial values.
        df.dropna(subset=['DATE', 'favorite_score', 'underdog_score'], inplace=True)
        
        # Create outcome column: 1 if favorite wins, 0 otherwise.
        df['favorite_win'] = (df['favorite_score'] > df['underdog_score']).astype(int)
        df['margin'] = df['favorite_score'] - df['underdog_score']
        logging.info("Outcome and margin features created for College Football data.")
        return df
    except Exception as e:
        logging.error(f"Error processing College Football PDF: {e}")
        sys.exit(1)

def eda_nfl(df):
    """
    Perform exploratory data analysis on the NFL data.
    """
    logging.info("Performing EDA on NFL data.")
    print("NFL Data (First 5 Rows):")
    print(df.head())
    print("\nNFL Data Summary:")
    print(df.describe())
    
    # Plot distribution of score margins if available.
    if 'margin' in df.columns:
        plt.figure(figsize=(8,6))
        sns.histplot(df['margin'], bins=20, kde=True)
        plt.title("Distribution of Score Margin (NFL)")
        plt.xlabel("Margin")
        plt.ylabel("Frequency")
        plt.show()

def eda_cf(df):
    """
    Perform exploratory data analysis on the College Football data.
    """
    logging.info("Performing EDA on College Football data.")
    print("College Football Data (First 5 Rows):")
    print(df.head())
    print("\nCollege Football Data Summary:")
    print(df.describe())
    
    # Plot distribution of score margins.
    if 'margin' in df.columns:
        plt.figure(figsize=(8,6))
        sns.histplot(df['margin'], bins=20, kde=True)
        plt.title("Distribution of Score Margin (College Football)")
        plt.xlabel("Margin")
        plt.ylabel("Frequency")
        plt.show()

def train_model_nfl(df):
    """
    Train a predictive model on NFL data using logistic regression.
    Assumes that the CSV contains a target column 'home_win' and at least one numerical feature.
    Here we attempt to use betting lines as features (e.g., column 'LINE' or 'spread').
    If not available, a dummy feature based on margin is used.
    """
    logging.info("Training predictive model on NFL data.")
    
    # Determine feature columns based on available columns.
    feature_cols = []
    if 'LINE' in df.columns:
        feature_cols.append('LINE')
    elif 'spread' in df.columns:
        feature_cols.append('spread')
    
    if not feature_cols:
        logging.warning("No explicit betting feature found; using 'margin' as a dummy feature.")
        if 'margin' not in df.columns:
            logging.error("No feature available for model training.")
            sys.exit(1)
        df['dummy_feature'] = df['margin']
        feature_cols = ['dummy_feature']
    
    if 'home_win' not in df.columns:
        logging.error("Target column 'home_win' not found in NFL data.")
        sys.exit(1)
    
    X = df[feature_cols]
    y = df['home_win']
    
    # Split data into training and testing sets.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    logging.info(f"Model trained. Test accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, preds))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, preds))
    
    return model

def main():
    # Process and analyze NFL CSV data.
    logging.info("Starting NFL CSV processing...")
    nfl_df = process_nfl_csv(NFL_CSV_PATH)
    eda_nfl(nfl_df)
    
    # Process and analyze College Football PDF data.
    logging.info("Starting College Football PDF processing...")
    cf_df = process_cf_pdf(CF_PDF_PATH)
    eda_cf(cf_df)
    
    # Train a predictive model on NFL data.
    model = train_model_nfl(nfl_df)
    
    # Save the trained model to disk for future use.
    model_filename = "nfl_model.pkl"
    with open(model_filename, "wb") as f:
        pickle.dump(model, f)
    logging.info(f"Predictive model saved to {model_filename}")
    
    logging.info("Data processing and model training complete.")

if __name__ == '__main__':
    main()

In [18]:
!pip install PyPDF2
!pip install pdfplumber
clear_output()

In [19]:
from PyPDF2 import PdfReader

reader = PdfReader(pdf)
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

In [None]:
def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = io.StringIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)
    for page in PDFPage.get_pages(filepath, caching=True,check_extractable=True):
        #print(page[0])
        interpreter.process_page(page)
        text = retstr.getvalue()
    filepath.close()
    device.close()
    retstr.close()
    return text

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        fn=os.path.join(dirname, filename)
        print(fn)
        #if fn.endswith('.pdf'):
        #   pdf2txt.append(fn)

In [None]:
stadiums = pd.read_csv("/kaggle/input/nfl-scores-and-betting-data/nfl_stadiums.csv",header=0,encoding='unicode_escape')
stadiums

In [None]:
stadiums = pd.read_csv("/kaggle/input/nfl-scores-and-betting-data/nfl_stadiums.csv",header=0,encoding='unicode_escape')
teams = pd.read_csv("/kaggle/input/nfl-scores-and-betting-data/nfl_teams.csv",header=0)
scores = pd.read_csv("/kaggle/input/nfl-scores-and-betting-data/spreadspoke_scores.csv",header=0)


#stadiums['stadium_weather_station_code'] = stadiums['stadium_weather_station_code'].astype('float')
#stadiums['stadium_capacity'] = stadiums['stadium_capacity'].astype('float')
stadiums['LATITUDE'] = stadiums['LATITUDE'].astype('float')
stadiums['LONGITUDE'] = stadiums['LONGITUDE'].astype('float')
stadiums['ELEVATION'] = stadiums['ELEVATION'].astype('float')


scores['schedule_date'] = scores['schedule_date'].astype('datetime64[ns]')
scores['schedule_season'] = scores['schedule_season'].astype('datetime64[ns]').dt.year
scores['weather_temperature'] = scores['weather_temperature'].astype('float')
scores['score_home'] = scores['score_home'].astype('float')
scores['score_away'] = scores['score_away'].astype('float')
scores['weather_temperature'] = scores['weather_temperature'].astype('float')
scores['weather_wind_mph'] = scores['weather_wind_mph'].astype('float')
scores['weather_humidity'] = scores['weather_humidity'].astype('float')
scores['schedule_season'] = scores['schedule_season'].astype('float')

# Create a winners list.
winner =[]
# Obtain the scores for each area
for i,v in scores.score_home.items():
    if scores.score_home[i]>scores.score_away[i]:
        winner.append(scores.team_home[i])
    
    elif scores.score_home[i]<scores.score_away[i]:
        winner.append(scores.team_away[i])
    else:
        winner.append('Tie')

scores['winner'] = winner

# What does our data look like?

In [None]:
# show sample of dataset
scores.sample(25).sort_values('schedule_date',ascending=False).reset_index(drop=True)

# And we have a winner!

In [None]:
scores[['team_home','score_home','score_away','team_away','winner']].sample(25)

# Atlanta Falcons - Wins
Correlation map.

In [None]:
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

atlanta = scores[scores.team_home == 'Atlanta Falcons'].sort_values('schedule_date',ascending=False).reset_index(drop=True)

sns.set_theme(style="white")


# Compute the correlation matrix
corr = atlanta.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.xticks(rotation=45)
plt.show()

### Correlation map of factors when Falcons win.

In [None]:
# When do the Falcons win?

atlanta_wins = atlanta[atlanta.score_home > atlanta.score_away]

from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")


# Compute the correlation matrix
corr = atlanta_wins.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.xticks(rotation=45)
plt.show()


# Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["xtick.labelsize"] = 6
# plot a sample of 100 observations that lasted under 60 minutes
# need to get a smaller sample of city-set, the x-axis is way too muddled.
sns.catplot(data=atlanta_wins.sample(100), x="schedule_date", 
            y="score_home", 
            hue="team_away", 
            kind="swarm", 
            height=10, 
            aspect=2, 
            #size='score_away',
            #size_max=20
           )
plt.xticks(rotation=45)
plt.ylim(bottom=0)
plt.show()

In [None]:
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

fig = px.scatter_3d(scores, z='team_home', y='score_home', x='stadium',
              color='stadium',
              size = 'score_home',
              symbol = 'winner',
              #hover_name = 'shape',
              hover_data=['winner','schedule_date','weather_humidity','weather_detail','score_away','team_away'],
              opacity=0.7,
              size_max=25
                    
                   )
fig.show()

In [None]:
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

fig = px.scatter_3d(atlanta_wins, z='schedule_date', y='score_home', x='stadium',
              color='stadium',
              size = 'score_home',
              #symbol = 'schedule_playoff',
              #hover_name = 'shape',
              hover_data=['winner','schedule_date','weather_humidity','weather_detail','score_away','team_away'],
              opacity=0.7,
              size_max=25
                    
                   )
fig.show()

In [None]:
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

fig = px.scatter_3d(scores, z='schedule_date', y='team_home', x='stadium',
              color='team_away',
              size = 'score_home',
              #symbol = 'stadium',
              hover_data=['winner','schedule_date','weather_humidity','weather_detail','score_away','team_away'],
              opacity=0.7,
              size_max=25
                    
                   )
fig.show()

# Predictive Modeling - Who will win?

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix

# encoding
from sklearn.preprocessing import LabelEncoder

def encode(df):
    lb_make = LabelEncoder()
    columns = df.columns.values.tolist()
    df_encoded = df[columns].copy()

    # categorize/encode
    for i in columns:
        df_encoded[i] = lb_make.fit_transform(df[i])

    # encoded
    return df_encoded

# create X,y variables for ML
from sklearn.model_selection import train_test_split
def X_y_sets(df, target):
    X = df.dropna().drop(columns=[target]).copy()
    y = df.dropna()[target].ravel().copy()
    
    return train_test_split(X, y, test_size=0.33, random_state=42), X, y


# encoded variable re-mapping
def encoding_remap(df, df_encoded, target):
    
    X_test = X_y_sets(df, target)[0][0]
    
    remap = pd.merge(df_encoded.loc[df_encoded.index.isin(X_test.index.values)][target].reset_index(),
              df.loc[df.index.isin(X_test.index.values)][[target]].reset_index(),on=['index'])
    
    remap[target] = [str(remap[f'{target}_y'][i]) for i,v in remap[f'{target}_x'].items()]
    remap['index'] = np.array([str(remap[f'{target}_x'][i]) for i,v in remap[f'{target}_x'].items()]).astype(int)
    remap=remap[[target,'index']]
    remap = remap.set_index('index').drop_duplicates().sort_values('index')
    
    return remap


# pairplot
import seaborn as sns
def pairplot(df, target):
    return sns.pairplot(df,hue=target)
    
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier

# classifier iteration
def classification_feat_importance(df_encoded):
    
    # iterate through each column variable as classification targets
    for target in df_encoded.columns.values:
        X = df_encoded.drop(columns=[target]).copy()
        y = df_encoded[target].ravel().copy()

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
        
    
        # classifiers
        #clf1 = GradientBoostingClassifier(criterion="friedman_mse", init=None, learning_rate=0.3338, loss='deviance', max_depth=19, max_features=None, max_leaf_nodes=None, min_samples_leaf=6, min_samples_split=120, min_weight_fraction_leaf=0.0, n_estimators=500, random_state=42, subsample=1.0, verbose=1, warm_start=False).fit(X_train, y_train)
        #clf2 = GradientBoostingClassifier(criterion="squared_error", init=None, learning_rate=0.2222, loss='deviance', max_depth=19, max_features=None, max_leaf_nodes=None, min_samples_leaf=6, min_samples_split=120, min_weight_fraction_leaf=0.0, n_estimators=500, random_state=42, subsample=1.0, verbose=1, warm_start=False).fit(X_train, y_train)
        clf3 = RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=42).fit(X_train, y_train)
        clf4 = ExtraTreesClassifier(n_estimators=200, random_state=42).fit(X_train, y_train)
        clf5 = AdaBoostClassifier(n_estimators=8000, random_state=42).fit(X_train, y_train)
        clf6 = MLPClassifier(alpha=1, max_iter=500).fit(X_train, y_train)
        clf7 = KNeighborsClassifier(n_neighbors=9).fit(X_train, y_train)
        classifiers = [
                       #clf1, 
                       #clf2, 
                       clf3, 
                       clf4, 
                       clf5,
                       clf6,
                       #clf7
                      ]

        for classifier in classifiers:
            results = []
            results.append({"classifier":str(classifier).split("(")[0],"target":target,"test_score":classifier.score(X_test, y_test)})
            for i in results:
                if target == 'verified':
                    print("\nClassifier:",str(classifier).split("(")[0],"\nTarget:",target,"\nScore:",classifier.score(X_test, y_test))
        
        test_matrix = confusion_matrix(y_test, clf.predict(X_test)) 
        results = pd.DataFrame(results)
        
    return results,test_matrix

print("To analyze which target-classifier would yield the best results: \nUncomment (#) the code below.")

In [None]:
# is scaling necessary?
# construction of ML dataframes
target = 'winner'

# copy
a = scores.copy()

# for the sake of computationa efficiency
a = a

In [None]:
# find random sample & save index for defining an encoded use-case
from random import randrange
idx = randrange(len(a))

# print random configuration item
print("\nThis is a randomly chosen subject we will try to predict.")
b = pd.DataFrame(a.loc[idx]).T
print(f"\nTarget:'{target}' value is: ",b.reset_index()[target][0],"\n")

# store sol'n
solution = str(b.reset_index()[target][0])

# print data point
b
# if this cell fails, try it again from step 1 - you ran into a null variable (i'll fix that soon enough)


In [None]:
# categorize/encode entire dataframe(a)
c = encode(a)
print("\nOriginal dataframe encoded into something we can run a classifier against.\n")
c.sample(10).reset_index(drop=True).style.background_gradient(cmap ='Pastel1').set_properties(**{'font-size': '10px'})

In [None]:
# 'comments' & 'country' - out
sns.pairplot(c.copy(),
             hue=f'{target}',
             kind="kde",
             corner=True,
             palette="Paired"
            )

In [None]:
# print encoded item
use_case = pd.DataFrame(c.loc[idx]).T.drop(columns=[target]) 

#c

# print encoded item w/out target info
data = c.drop(columns=[target,'score_home','score_away']) 

print("\nThis is what our encoded 'use-case' looks like - number form, just the way the machine likes it.\n")

use_case.style.background_gradient(cmap ='twilight').set_properties(**{'font-size': '10px'})

## Here we train the machine using previous scores, weather, stadium, & the over_under_lines variables. 
### However, we do remove the score_home & score_away variables from our use-case because we don't want the machine to be godly omniscient.

In [None]:
# create X,y variables for ML
# save trainer
print("\nResetting train data...\nCreating X-matrix & y-vector (target) for classification.")
trainer = c.loc[c.index!=idx].copy()
X, y =  trainer.drop(columns=[target,'score_home','score_away']), trainer[target].ravel()
X_train, X_test, y_train, y_test = X_y_sets(trainer, target)[0]

In [None]:
X_train['target'] = pd.Series(y_train)
X_train.dropna().head().reset_index(drop=True).reset_index(drop=True).style.background_gradient(cmap ='twilight').set_properties(**{'font-size': '10px'})

In [None]:
# for the sake of adding the 'target' column above for sake of layman's explanation
X_train, X_test, y_train, y_test = X_y_sets(trainer, target)[0]

In [None]:
# encoded variable re-mapping
# specific to our current target choice
d = encoding_remap(a, c, target)
print("\nDecoding our encoded dataframe to correlate with the initial randomly chosen subject.\n")

In [None]:
print("\n-Live prediction-\nThinking...\n")

# choose classifier
#clf = GradientBoostingClassifier(criterion="friedman_mse", init=None, learning_rate=0.3338, loss='deviance', max_depth=19, max_features=None, max_leaf_nodes=None, min_samples_leaf=6, min_samples_split=120, min_weight_fraction_leaf=0.0, n_estimators=500, random_state=42, subsample=1.0, verbose=1, warm_start=False).fit(X_train, y_train)
#clf = GradientBoostingClassifier(criterion="squared_error", init=None, learning_rate=0.2222, loss='deviance', max_depth=19, max_features=None, max_leaf_nodes=None, min_samples_leaf=6, min_samples_split=120, min_weight_fraction_leaf=0.0, n_estimators=500, random_state=42, subsample=1.0, verbose=1, warm_start=False).fit(X_train, y_train)

# these ones run just a little more efficiently for now
#clf = RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=42).fit(X_train, y_train)
#clf = ExtraTreesClassifier(n_estimators=1000, random_state=42).fit(X_train, y_train)
#clf = AdaBoostClassifier(n_estimators=1500, random_state=42).fit(X_train, y_train)
clf = MLPClassifier(alpha=0.666, max_iter=666).fit(X_train, y_train)
#clf = KNeighborsClassifier(n_neighbors=9).fit(X_train, y_train)

print()
print("Test score (confidence): ",clf.score(X_test, y_test)*100,"%")
print()
prediction = clf.predict(use_case)[0]
print(f"Prediction {target} index:",prediction)

# print decoded prediction
print("\nPrediction Decoded")
e = d[d.index == prediction]
e

# Was our prediction model successful?

In [None]:
solved = str(e.winner[e.index[0]])
if solution == solved:
    print(f"\nYUP!\n\nThe machine correctly predicted the '{target}'!\n")
else:
    print("\nNOPE!\nThe machine's prediction was incorrect :(")
    
print()

# Done for now!