# Introduction

We are now moving to the final part of the workshop, which involves formulating business recommendations. Our tasks are:
- Determining a global betting odds,
- Dividing the dataset into categories: A, B, C, D, where A is the best group and D is the weakest group,
- Determining the risk of odds based on accepted parameters for each category.

As the last task, in a discussion format, we must consider the fact that we are a new betting company. When formulating our recommendations, we need to identify the risks that may affect our operations. We will perform this task together in a brainstorming session.

# Notebook Configuration

## Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

import warnings
warnings.filterwarnings("ignore")

plt.rcParams["figure.figsize"] = (8, 5)

print("Imports OK")

Imports OK


## Loading data into the workspace

> Remember to correctly specify the column separator

In [2]:
file_path = Path(r"C:\Users\mjemelka\Desktop\Python\Workshop_-_files\data\processed\hockey_teams.csv")

df = pd.read_csv(file_path, sep=";")

print(f"Načteno {len(df)} řádků a {df.shape[1]} sloupců.\n")
print("Náhled dat:")
display(df.head())

Načteno 607 řádků a 10 sloupců.

Náhled dat:


Unnamed: 0,team,season,victories,defeats,overtime_defeats,victory_percentage,scored_goals,received_goals,goal_difference,goals_ratio
0,Boston Bruins,1990,44,24,0,0.55,299,264,35,1.132576
1,Buffalo Sabres,1990,31,30,0,0.388,292,278,14,1.05036
2,Calgary Flames,1990,46,26,0,0.575,344,263,81,1.307985
3,Chicago Blackhawks,1990,49,23,0,0.613,284,211,73,1.345972
4,Detroit Red Wings,1990,34,38,0,0.425,273,298,-25,0.916107


### Checking data loading accuracy

In [3]:
print("Základní informace o datasetu:\n")
df.info()

print("\nKontrola prázdných hodnot:\n")
print(df.isna().sum())

print("\nStatistické shrnutí číselných sloupců:\n")
display(df.describe())

print("\nNáhled prvních 3 řádků:")
display(df.head(3))

Základní informace o datasetu:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   team                607 non-null    object 
 1   season              607 non-null    int64  
 2   victories           607 non-null    int64  
 3   defeats             607 non-null    int64  
 4   overtime_defeats    607 non-null    int64  
 5   victory_percentage  607 non-null    float64
 6   scored_goals        607 non-null    int64  
 7   received_goals      607 non-null    int64  
 8   goal_difference     607 non-null    int64  
 9   goals_ratio         607 non-null    float64
dtypes: float64(2), int64(7), object(1)
memory usage: 47.6+ KB

Kontrola prázdných hodnot:

team                  0
season                0
victories             0
defeats               0
overtime_defeats      0
victory_percentage    0
scored_goals          0
received_goals 

Unnamed: 0,season,victories,defeats,overtime_defeats,victory_percentage,scored_goals,received_goals,goal_difference,goals_ratio
count,607.0,607.0,607.0,607.0,607.0,607.0,607.0,607.0,607.0
mean,2000.46458,36.841845,32.443163,4.400329,0.457389,235.818781,235.823723,-0.004942,1.018843
std,6.557311,8.902925,8.383721,4.600854,0.102289,41.128312,42.893416,45.449152,0.187374
min,1990.0,9.0,11.0,0.0,0.119,115.0,115.0,-196.0,0.506297
25%,1995.0,31.0,27.0,0.0,0.39,212.0,208.0,-27.0,0.885107
50%,2000.0,38.0,32.0,4.0,0.463,234.0,234.0,3.0,1.012987
75%,2006.0,43.0,37.0,8.0,0.5245,257.0,262.0,31.5,1.138553
max,2011.0,62.0,71.0,18.0,0.756,369.0,414.0,144.0,1.79558



Náhled prvních 3 řádků:


Unnamed: 0,team,season,victories,defeats,overtime_defeats,victory_percentage,scored_goals,received_goals,goal_difference,goals_ratio
0,Boston Bruins,1990,44,24,0,0.55,299,264,35,1.132576
1,Buffalo Sabres,1990,31,30,0,0.388,292,278,14,1.05036
2,Calgary Flames,1990,46,26,0,0.575,344,263,81,1.307985


# Determining Betting Odds

Let's review the content of the page: [click](https://trustbet.pl/kursy-bukmacherskie/), where information about methods for determining betting odds can be found. First, we will determine a global odd, which will be the starting point for our analysis (the so-called _baseline scenario_). At this point, we ignore the margin and assume that we are calculating the decimal odd.

Here is the list of steps to be performed to obtain the desired value:
- we will complete the definition of the `get_betting_odds` function, which will take `probability` of a given event as a parameter. We will use it multiple times, so it is worth preparing its implementation now
- then we need to appropriately aggregate the set and determine the **global** probability of the team's victory.

## Implementations of the `get_betting_odds` function

In [10]:
#def get_betting_odds(probability):
#    pass

def get_betting_odds(probability):
    return 1 / probability

### Some tests to check the correctness of the implementation

In [9]:
def test_get_betting_odds():
    assert get_betting_odds(1) == 1, "Expected 1"
    assert get_betting_odds(0.5) == 2, "Expected 2"
    assert get_betting_odds(0.25) == 4, "Expected 4"
    assert get_betting_odds(0.1) == 10, "Expected 10"
    try:
        get_betting_odds(0)
    except ZeroDivisionError:
        pass
    else:
        assert False, "Expected ZeroDivisionError"

    print("All tests passed!")

test_get_betting_odds()

All tests passed!


### Determining the global odds

Here, determine the probability of any team winning

In [12]:
team_win_prob = (
    df.groupby("team", as_index=False)["victory_percentage"]
      .mean()
      .rename(columns={"victory_percentage": "global_win_probability"})
)

team_win_prob["global_fair_odds"] = get_betting_odds(team_win_prob["global_win_probability"])

team_win_prob = team_win_prob.sort_values("global_fair_odds").reset_index(drop=True)

print(f"Spočtena globální pravděpodobnost výhry pro {len(team_win_prob)} týmů.\n")
print("Ukázka 10 nejlepších týmů podle férových kurzů:\n")
display(team_win_prob.head(35).round(3))

Spočtena globální pravděpodobnost výhry pro 35 týmů.

Ukázka 10 nejlepších týmů podle férových kurzů:



Unnamed: 0,team,global_win_probability,global_fair_odds
0,Detroit Red Wings,0.579,1.728
1,New Jersey Devils,0.528,1.893
2,Anaheim Ducks,0.522,1.914
3,Dallas Stars,0.517,1.935
4,Colorado Avalanche,0.516,1.938
5,Pittsburgh Penguins,0.499,2.002
6,Philadelphia Flyers,0.493,2.028
7,St. Louis Blues,0.487,2.052
8,Boston Bruins,0.486,2.057
9,Washington Capitals,0.477,2.099


Set the global rate here using the `get_betting_odds` function. Round the result to two decimal places.

In [13]:
p_global = df["victory_percentage"].mean()

global_rate = round(get_betting_odds(p_global), 2)

print(f"Global fair betting rate: {global_rate}")

Global fair betting rate: 2.19


# Team Categorization

Let's discuss how we can classify teams into _leagues_. We want to establish 4 leagues:
- A - league consisting of the best teams,
- B - league consisting of good teams,
- C - league consisting of average teams,
- D - league consisting of the weakest teams.

The above terms are quite subjective, so for the purpose of this exercise, we will adopt the following assumptions:
- A - the top 5% of teams,
- B - teams performing better than 70% of the group but worse than league A,
- C - teams performing better than 20% of the group but worse than league B,
- D - the remaining teams.

To accomplish this task, we will additionally implement the function `assign_team_to_league`.

> Note: This task looks unassuming, but it is difficult. Remember that during the class, you have access to the instructor, and later to a mentor.

## Determination of cutoff points for individual leagues

In [14]:
def assign_team_to_league(x):
    pass

In [20]:
q20 = team_win_prob["global_win_probability"].quantile(0.20)
q70 = team_win_prob["global_win_probability"].quantile(0.70)
q95 = team_win_prob["global_win_probability"].quantile(0.95)

def assign_team_to_league(prob, q20, q70, q95):
    """
    Přiřadí tým do ligy A/B/C/D podle percentilových hranic.
    - A: prob >= q95
    - B: q70 <= prob < q95
    - C: q20 <= prob < q70
    - D: prob  < q20
    """
    if prob >= q95:
        return "A"
    if prob >= q70:
        return "B"
    if prob >= q20:
        return "C"
    return "D"

team_win_prob["league"] = team_win_prob["global_win_probability"].apply(
    lambda p: assign_team_to_league(p, q20, q70, q95)
)

summary = team_win_prob["league"].value_counts().sort_index()
summary.index = range(1, len(summary) + 1)  
print("Počty týmů v ligách A–D:\n", summary.to_string())

print("\nUkázka týmů dle ligy (seřazeno od nejlepších):")
preview = (
    team_win_prob[["team", "global_win_probability", "global_fair_odds", "league"]]
    .sort_values(["league", "global_fair_odds"])
    .round({"global_win_probability": 3, "global_fair_odds": 2})
    .reset_index(drop=True)
)
preview.index = preview.index + 1

display(preview.head(15))

Počty týmů v ligách A–D:
 1     2
2     9
3    17
4     7

Ukázka týmů dle ligy (seřazeno od nejlepších):


Unnamed: 0,team,global_win_probability,global_fair_odds,league
1,Detroit Red Wings,0.579,1.73,A
2,New Jersey Devils,0.528,1.89,A
3,Anaheim Ducks,0.522,1.91,B
4,Dallas Stars,0.517,1.93,B
5,Colorado Avalanche,0.516,1.94,B
6,Pittsburgh Penguins,0.499,2.0,B
7,Philadelphia Flyers,0.493,2.03,B
8,St. Louis Blues,0.487,2.05,B
9,Boston Bruins,0.486,2.06,B
10,Washington Capitals,0.477,2.1,B


## Determination of odds per league

Here we set the betting odds for each league, which will allow us to draw final conclusions and establish the basic odds for individual teams.

> Remember: After generating the results, it is worth checking if they are reasonable.

In [21]:
import numpy as np

margin_by_league = {
    "A": 0.04,  # 4 %
    "B": 0.06,  # 6 %
    "C": 0.08,  # 8 %
    "D": 0.10,  # 10 %
}

def odds_with_margin(prob, league):
    """
    Přepočet pravděpodobnosti na kurz s marží dle ligy.
    """
    p = float(np.clip(prob, 1e-9, 1.0))
    m = margin_by_league[league]
    return (1 - m) / p

team_win_prob["league_margin"] = team_win_prob["league"].map(margin_by_league)
team_win_prob["odds_with_margin"] = team_win_prob.apply(
    lambda r: odds_with_margin(r["global_win_probability"], r["league"]), axis=1
)

out_cols = [
    "team", "league", "league_margin",
    "global_win_probability", "global_fair_odds", "odds_with_margin"
]
preview = (
    team_win_prob[out_cols]
    .sort_values(["league", "odds_with_margin"])
    .round({"league_margin": 2, "global_win_probability": 3, "global_fair_odds": 2, "odds_with_margin": 2})
    .reset_index(drop=True)         
)
preview.index = preview.index + 1   

print("Kurzy s marží podle lig byly spočteny.")
display(preview.head(35))

Kurzy s marží podle lig byly spočteny.


Unnamed: 0,team,league,league_margin,global_win_probability,global_fair_odds,odds_with_margin
1,Detroit Red Wings,A,0.04,0.579,1.73,1.66
2,New Jersey Devils,A,0.04,0.528,1.89,1.82
3,Anaheim Ducks,B,0.06,0.522,1.91,1.8
4,Dallas Stars,B,0.06,0.517,1.93,1.82
5,Colorado Avalanche,B,0.06,0.516,1.94,1.82
6,Pittsburgh Penguins,B,0.06,0.499,2.0,1.88
7,Philadelphia Flyers,B,0.06,0.493,2.03,1.91
8,St. Louis Blues,B,0.06,0.487,2.05,1.93
9,Boston Bruins,B,0.06,0.486,2.06,1.93
10,Washington Capitals,B,0.06,0.477,2.1,1.97


In [19]:
summary = (
    team_win_prob.groupby("league")["odds_with_margin"]
    .agg(["count", "min", "median", "mean", "max"])
    .round(2)
)

summary.index = range(1, len(summary) + 1)

print("\nSouhrn kurzů s marží podle ligy:")
display(summary)

violations = (team_win_prob["odds_with_margin"] > team_win_prob["global_fair_odds"]).sum()
print(f"\nSanity check – odchylky (odds_with_margin > fair_odds): {violations}")


Souhrn kurzů s marží podle ligy:


Unnamed: 0,count,min,median,mean,max
1,2,1.66,1.74,1.74,1.82
2,9,1.8,1.91,1.89,1.98
3,17,1.95,2.05,2.09,2.39
4,7,2.36,2.37,2.39,2.45



Sanity check – odchylky (odds_with_margin > fair_odds): 0


In [24]:
from reportlab.lib.pagesizes import A4, landscape
from reportlab.lib.units import cm
from reportlab.lib import colors
from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, TableStyle, Spacer
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from datetime import datetime
import pandas as pd

try:
    pdfmetrics.registerFont(TTFont("DejaVuSans", "DejaVuSans.ttf"))
    base_font = "DejaVuSans"
except Exception:
    base_font = "Helvetica"

out_path = r"C:\Users\mjemelka\Desktop\Python\Workshop_-_files\vysledky.pdf"

styles = getSampleStyleSheet()
styles.add(ParagraphStyle(name="BodyCZ", parent=styles["BodyText"], fontName=base_font, fontSize=9, leading=12))
styles.add(ParagraphStyle(name="HeaderCZ", parent=styles["Heading1"], fontName=base_font, fontSize=16, leading=20, spaceAfter=6))
styles.add(ParagraphStyle(name="SmallCZ", parent=styles["BodyText"], fontName=base_font, fontSize=8, textColor=colors.grey))

df: pd.DataFrame = preview.copy()
df = df.astype(str)

def para(text):
    return Paragraph(text.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;"), styles["BodyCZ"])

header = [Paragraph(col, styles["BodyCZ"]) for col in df.columns.tolist()]
data_rows = [[para(x) for x in row] for row in df.values.tolist()]
table_data = [header] + data_rows

max_chars = [max(len(str(c)), *(len(str(v)) for v in df.iloc[:, i])) for i, c in enumerate(df.columns)]
min_w, max_w = 2.0*cm, 6.5*cm
raw_widths = [min(max(len_ * 0.18*cm, min_w), max_w) for len_ in max_chars]

page_w, page_h = landscape(A4)
left_margin = right_margin = 1.5*cm
available_w = page_w - left_margin - right_margin
scale = available_w / sum(raw_widths)
col_widths = [w * scale for w in raw_widths]

tbl = Table(table_data, colWidths=col_widths, repeatRows=1)
tbl.setStyle(TableStyle([
    ("FONTNAME", (0, 0), (-1, -1), base_font),
    ("FONTSIZE", (0, 0), (-1, -1), 9),
    ("ALIGN", (0, 0), (-1, 0), "CENTER"),
    ("VALIGN", (0, 0), (-1, -1), "MIDDLE"),
    ("BACKGROUND", (0, 0), (-1, 0), colors.whitesmoke),
    ("TEXTCOLOR", (0, 0), (-1, 0), colors.black),
    ("GRID", (0, 0), (-1, -1), 0.25, colors.grey),
    ("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.transparent, colors.Color(0.97,0.97,0.97)]),
    ("BOTTOMPADDING", (0, 0), (-1, -1), 6),
    ("TOPPADDING", (0, 0), (-1, -1), 6),
]))

doc = SimpleDocTemplate(
    out_path,
    pagesize=landscape(A4),
    leftMargin=left_margin,
    rightMargin=right_margin,
    topMargin=1.2*cm,
    bottomMargin=1.2*cm,
)

title = Paragraph("Kurzy s marží podle lig", styles["HeaderCZ"])
subtitle = Paragraph(f"Vygenerováno: {datetime.now():%d.%m.%Y %H:%M}", styles["SmallCZ"])

story = [title, subtitle, Spacer(1, 0.5*cm), tbl]

doc.build(story)
print(f"PDF bylo vytvořeno: {out_path}")

PDF bylo vytvořeno: C:\Users\mjemelka\Desktop\Python\Workshop_-_files\vysledky.pdf


# Discussion

We have obtained certain odds values for each league. But how does this translate into real business? The entire task was about determining certain values from which a bookmaker can begin operations. Correct determination of these values is critical to attract customers to place bets with us, and on the other hand, inappropriate determination may lead to financial losses in the first days of operation.

For this reason, before translating the results and recommendations into business objectives, the analysis is subjected to discussion. Therefore, we will now take on a review role and would like to verify the steps. To that end, we will collectively discuss and critique our work by answering the following questions together:
- What elements of the analysis were simplified? What was omitted in the analysis?
- Are there any inconsistencies in the estimated odds? What are they?
- How can we improve the odds estimates?
- How can we enrich our initial dataset to make the estimates more accurate and less risky?
- How can we simulate the outcomes of our analysis to verify that they do not lead to financial losses?

This is a discussion panel, and every idea is valuable here.


ZV – Zpětná vazba / Review

Co je dobře
Kód má jasnou strukturu a logiku výpočtu – od importu dat, přes výpočet pravděpodobností a kurzů, až po generování finální reportové PDF.
To je ideální flow pro business/analytický notebook.
Výborně zvládnutá modularita a čistota funkcí:
get_betting_odds() – základní převod, správně testovaný přes asserty.
assign_team_to_league() – přehledně popsaný docstring, dobře čitelné podmínky a práce s percentily.
odds_with_margin() – využití np.clip je skvělý detail proti dělení nulou.
Použití Path místo řetězců pro cesty (v dřívějších blocích) i zde je dobrá praxe – přenositelnější a přehlednější.
Všechno čte a zapisuje z logicky rozdělených složek (processed, PDF výstupy).
Generování PDF přes reportlab:
Registrace fontů (DejaVuSans fallback na Helvetica) ukazuje zkušenost s mezinárodními znakovými sadami.
Použití landscape(A4) a dynamický výpočet colWidths podle obsahu tabulky je precizní.
Přehledné ParagraphStyle s českou diakritikou – profesionálně provedené.
Vizualizace v matplotlib (nastavení stylu, velikosti a potlačení varování) – notebook působí čistě, bez rušivých hlášek.
Celkově:
Notebook má charakter produkčního analytického pipeline — od datové analýzy po finální reporting.
Takto strukturované řešení by obstálo i ve firemním prostředí.


Co zlepšit
Testovací funkce
Test test_get_betting_odds() očekává ZeroDivisionError, ale funkce get_betting_odds() žádnou nevyhazuje — na vstupu 0 se prostě zhroutí.
Doporučuji doplnit ochranu:
def get_betting_odds(probability):
    if probability <= 0 or probability > 1:
        raise ValueError("Probability must be in range (0, 1].")
    return 1 / probability
a upravit test:
try:
    get_betting_odds(0)
except ValueError:
    pass
else:
    assert False, "Expected ValueError"
Redundance v importech
import pandas as pd je znovu uvedený v sekci PDF generování. To není chyba, ale není potřeba opakovat – působí neorganizovaně.