# Data and Analysis: Detecting Match-Fixing Patterns In Tennis

The Python code below runs the anonymized implementation of the [methodology described here](../README.md) that was used in "[The Tennis Racket](http://www.buzzfeed.com/heidiblake/the-tennis-racket)". The methodology contains many important details. Please read it before continuing here.

## Importing The Data

In [1]:
import pandas as pd
import random

In [2]:
betting_data = pd.read_csv("../data/anonymous_betting_data.csv")

## Match Selection

The code below excludes opening odds that implied probabilities more than 10 percentage points higher or lower than the median of all bookmakers’ opening odds for the match. (Otherwise the return of these odds toward the consensus could be mistaken for a sign of suspicious betting.) The code also excludes matches that were noted as "canceled" — typically a result of pre-match withdrawals — or "walkover" on OddsPortal.

In [3]:
def get_outlier_openings(match_books):
    median = match_books["implied_prob_winner_open"].median()
    return match_books[
        (match_books["implied_prob_winner_open"] - median).abs() > 0.1
    ]

In [4]:
outlier_openings = betting_data\
    .groupby("match_uid").apply(get_outlier_openings)

In [5]:
selected_betting_data = betting_data[
    ~betting_data["match_book_uid"].isin(outlier_openings["match_book_uid"]) &
    ~betting_data["is_cancelled_or_walkover"]
].copy()

In [6]:
print("The selected data removes {0} matches."\
      .format(betting_data["match_uid"].nunique() - selected_betting_data["match_uid"].nunique()))

The selected data removes 539 matches.


In [7]:
print("There are {0:,} unique matches with odds in the dataset from {1:.0f} to {2:.0f}"\
      .format(selected_betting_data["match_uid"].nunique(), selected_betting_data["year"].min(), selected_betting_data["year"].max()))

There are 25,993 unique matches with odds in the dataset from 2009 to 2015


## Odds-Movement Calculation

The code below find the odds movement for a bookmaker in a given match by calculating the difference between each player’s chance of winning implied by the opening and final odds.

In [8]:
selected_betting_data["winner_movement"] = selected_betting_data["implied_prob_winner_close"] - selected_betting_data["implied_prob_winner_open"]
selected_betting_data["loser_movement"] = selected_betting_data["implied_prob_loser_close"] - selected_betting_data["implied_prob_loser_open"]
selected_betting_data["abs_winner_movement"] = selected_betting_data["winner_movement"].abs()

## Player Selection

The code below selects only matches where, in at least one book, the odds moved more than 10 percentage points. The 10-percentage-point cutoff is based on discussions with sports-betting investigators, who said that movement above this threshold was what prompted them to give greater scrutiny to a match.

Players who lost more than 10 such “high-movement” matches are selected for analysis.

In [9]:
high_move_matches = selected_betting_data[(selected_betting_data["abs_winner_movement"] > 0.10)]\
    .sort("abs_winner_movement")\
    .drop_duplicates(subset="match_uid")\
    .copy()

In [10]:
print("There was movement greater than 10 percentage points in {0:.2f}% of matches."\
      .format(round(100.0 * len(high_move_matches) / selected_betting_data["match_uid"].nunique(), 2)))

There was movement greater than 10 percentage points in 10.76% of matches.


In [11]:
def find_high_movement_matches_for_player(name):
    high_move_matches = selected_betting_data[
            (((selected_betting_data["winner_movement"] > 0.10) & 
              (selected_betting_data["loser"] == name)) |
             ((selected_betting_data["loser_movement"] > 0.10) &
              (selected_betting_data["winner"] == name)))]\
            .sort("abs_winner_movement")\
            .drop_duplicates(subset="match_uid")\
            .copy()
    return pd.Series([name, len(high_move_matches), len(high_move_matches[high_move_matches["loser"] == name])])

In [12]:
all_players = pd.DataFrame(selected_betting_data["loser"].unique()).rename(columns={0: "name"})

In [13]:
player_high_move_counts = all_players["name"].apply(find_high_movement_matches_for_player)\
    .rename(columns={0: "name", 1: "high_move_matches", 2: "high_move_losses"})

In [14]:
selected_players = player_high_move_counts[(player_high_move_counts["high_move_losses"] > 10)].copy()

In [15]:
print("There are {0} players with more than 10 losses in high-move matches.".format(len(selected_players)))

There are 39 players with more than 10 losses in high-move matches.


## Simulation

The code below runs a series of simulations to estimate the unlikelihood of each player’s outcomes. Each simulation uses the player’s implied chance of winning — based on each match’s opening odds — to generate a set of outcomes for each string of matches. BuzzFeed News ran the simulation 1 million times per player. The result: The estimated chance that the player would have lost as many (or more) high-movement matches as the player did, if the chances implied by the opening odds were correct. 

In [16]:
class Player(object):
    def __init__(self, player_name):
        self.name = player_name
        self.matches = self.get_matches()
        self.wins = len(self.matches[self.matches["winner"] == self.name])

    def get_matches(self):
        player_matches = selected_betting_data[
            (((selected_betting_data["winner_movement"] > 0.10) & 
              (selected_betting_data["loser"] == self.name)) |
             ((selected_betting_data["loser_movement"] > 0.10) &
              (selected_betting_data["winner"] == self.name)))]\
            .sort("abs_winner_movement", ascending=False )\
            .drop_duplicates(subset="match_uid")\
            .copy()
        player_matches["player_odds_open"] = player_matches\
            .apply(lambda x: x["implied_prob_winner_open"] if x["winner"] == self.name else x["implied_prob_loser_open"],axis=1)
        player_matches["player_odds_close"] = player_matches\
            .apply(lambda x: x["implied_prob_winner_close"] if x["winner"] == self.name else x["implied_prob_loser_close"],axis=1)
        return player_matches

    def sim_once(self, odds_type="open"):
        wins = 0
        for i, m in self.matches.iterrows():
            if m["player_odds_"+odds_type] > random.random():
                wins += 1
        return wins
    
    def sim_x_times(self, x, odds_type="open"):
        return [ self.sim_once(odds_type) for n in range(x) ]
    
    def pct_sims_with_more_than_x(self, x_times, odds_type="open"):
        return float(len( [ x for x in self.sim_x_times(x_times, odds_type) if x <= self.wins ] )) / x_times

In [17]:
N_SIMULATIONS = 1000000

def get_likelihood(player_name):
    player = Player(player_name)
    return player.pct_sims_with_more_than_x(N_SIMULATIONS, "open")

In [18]:
selected_players["likelihood_open"] = selected_players["name"].apply(get_likelihood)

## Classify Likelihood

*Note on reading the `likelihood_level_open` column:*

- Players who have *Bonferroni* likelihood below 5%: \*\*\*\*
- Players who have an overall likelihood below 1%: \*\*
- Players who have an overall likelihood below 1%: \*

In [19]:
def classify_likelihood(likelihood):
    if likelihood < (0.05 / len(selected_players)): return "****"
    elif likelihood < 0.001: return "***"
    elif likelihood < 0.01: return "**"
    elif likelihood < 0.05: return "*"
    return ""

In [20]:
selected_players["likelihood_level_open"] = selected_players["likelihood_open"].apply(classify_likelihood)

In [21]:
selected_players[
    selected_players["likelihood_open"] < 0.05
].sort("likelihood_open")

Unnamed: 0,name,high_move_matches,high_move_losses,likelihood_open,likelihood_level_open
58,f16cc81d239ad735c51cc71442cda44c4d1a9323eb4101...,16,15,9.6e-05,****
235,33367d214715ab5f5e335cd67dbc90e62983b98e5278a4...,16,15,0.000178,****
293,6702a5de750846f45a3d977f50023c1b20156c61949f2f...,12,12,0.000195,****
82,9c92af8ca1b57024bd0a39b73db8be44b25bcde4115549...,15,14,0.00041,****
0,0ffe23c8b80916f6b2c23a52e08018374d68d12f49b261...,18,15,0.002259,**
86,05f3190e5053090035664800d1f52203b40a826cf7f065...,15,13,0.002737,**
304,573dad2e08250afa99aa704c7ea888b421bcf06bd00aab...,14,13,0.005258,**
13,dd83d749567ad7c7f4e89656b08d4791acefd60724cc84...,19,15,0.005684,**
3,79784720fab57e7cc611e07c258cf49f484b9cee01bf47...,14,11,0.005984,**
69,4f7f8e1b43947b2fb123afb92263b4a863daa87a4de44c...,19,13,0.01637,*


In some simulations an additional player received an estimated likelihood just barely under 0.05. To be conservative we are not including that player among our totals. 

## How Many Questionable Matches Have Players On Investigators' List Lost?

The strings below represent the anonymized names of the 28 players flagged in a 2008 report by investigators for the Assocation of Tennis Professionals. Each anonymized name is the SHA256 hash of the name plus a randomly-generated salt. 

In [22]:
report_players = [
     'f5cecec5a7714e86cf761e7cda278f144d82eac78d15c7f67aecf9ba186e7830',
     'e39d12f03f441a3e8eb207fb12eced70fdf2c06cbaf27e123d457d1780447baf',
     'fa4319726a465ed7c72f125332082b1e1afdef2d8164c4dfff237d78aed2e39e',
     '0ffe23c8b80916f6b2c23a52e08018374d68d12f49b261ccb36fecd52927cc0a',
     'b5c0e84eda074671d6a3d7edf59e65242d080e26d35fa158b11f74c9891355e4',
     '11411268e0ea9e1527a49193485d117e35b0645a17f4b0b40da262300e8d4430',
     '02a755e7afd8581feadcfd369d8a62fc7fec476ce4e0c55de5fc03c0da0f3c81',
     '47f8d9fb7d7156217c15e7aea9127cf8a7ffcabdd3e97fc16c533dc807430308',
     '2ed14b47b1c58532b757d76404dcf1a114b712e50193f0b0a5a05f52e3067134',
     '6840fadf79442f1fa10569f210305a669242159fd31abc2eaa94d158a7e3b301',
     '91066973c924f6a41cef067cb3ebdb8f6d6c6a0cdd85933bb84965c25d377c18',
     'd489880f3981ace1f6c03616fe169a0b5e513ccd5da3547ce971dde26b3bde43',
     '30b4b70b6ed9adb822559be9d7f74747e73af99a33c0649d87dd21cadedb9681',
     '5b94678362f659bd7058eba695e963a2039567f3830d502665808303c27771c4',
     'c06ec5c640acfd2a94350a468185475f73e1d614f497540cf4e05f2a905a8fac',
     '7a46553d6c2a135edb7d6a4e3408be7eb5f41953f442fb108a7b6e587ecee038',
     'dd83d749567ad7c7f4e89656b08d4791acefd60724cc848697903d2aa13731c7',
     'aa2bd77955c425c8da69a09584beaccf24a2dc15b903beecc7e9069d4c520c21',
     '55c14ebb1ec4efa5c6e3dd272c747896d2647c883ca6861ebc6f83d382075c69',
     '694668c73710b80adb51764ae06a1413fb93e7d10e0d329a63c83a14b77c3fd2',
     'dcb744cbd79602f5ad05227acabb3be17729b2b5bda60595f5b62c0f0145843f',
     '51c4b3f11032d72af378075926b7ed628360fd3ec605a9298a00e076ef797f4a',
     'd5e122c7e9bd24d1295d3bbcf29455c21676e09ff8f69255dd387c0240544d20',
     '614c2049880f015352fb695961ec2763194439ce9fbb11ece98e2264eb1942df',
     '061a49265f4f3b6970b8943181aa93431bbfcc6cc96f5a6b23590c2785fddc5a',
     '73f6d26367e4793ebd7dfe1e1ef17cb64455e41c9e30cc78fb7ef7277268b546',
     'cd4a092bde2eba04a8adcb2f241c638b560ee56b9c537f78bd4808937f1b73e2',
     'c9d4889baca9908d2ca2f8515d02f164fcd84642bee5e73cbf3544b26a8315a6'
]

In [23]:
atp_report_high_move_losses = selected_betting_data[
    (selected_betting_data["winner_movement"] > 0.10) & 
    (selected_betting_data["loser"].isin(report_players))
].drop_duplicates(subset="match_uid")

In [24]:
print("Players in the 2008 ATP report have lost {0} matches with large pre-match movements in the data we analyzed."\
      .format(len(atp_report_high_move_losses)))

Players in the 2008 ATP report have lost 112 matches with large pre-match movements in the data we analyzed.


---

---

---