# Project 4: Football matches as mobility networks
During a football match, players move on the field to attack and defend. This generates a series of movements that can be analyzed to understand the players behavior.<br><br>
The student should use the `Wyscout open dataset`, describing the “events” in all matches of seven competitions (e.g., passes, shots, tackles etc.), to analyze pass chains and the mobility of football players. A player’s movement is defined by consecutive events made by that player in the match.<br><br>
- Investigate the distances traveled by players during their matches and their distributions. Discuss about the similarity of these distributions with those about mobility trajectories seen during the course.<br><br>
- Relate the pass chains made by teams with the probability of making a shot, a goal, and to win a match. Are long chains more likely to lead to a shot/goal? Are short pass chains more successful?<br><br>
- Quantify the predictability of pass chains based on some division of the football field (tessellation). To what extent can we predict the next tile (field zone) where the ball will be? Use a next-location predictor to quantify the accuracy to predict the next zone the ball will be.

## Functions

In [1]:
# returns the id of all the players in a team
def retrieve_players_in_a_team(team_code):
    players = []
    for match in teams_matches[team_code]:
        for player in match['teamsData'][team_code]['formation']['lineup']:
            if player['playerId'] not in players:
                players.append(player['playerId'])
        for player in match['teamsData'][team_code]['formation']['bench']:
            if player['playerId'] not in players:
                players.append(player['playerId'])
    return players

In [2]:
#searches for the player in the players list
def search_player(player_id, players):
    for player in players:
        if player['wyId'] == player_id:
            return player
    return 'No player found'

In [3]:
# returns the information of a player given his id
def retrieve_player_info(player_id):
    for player in players:
        if player['wyId'] == player_id:
            return player

In [4]:
#distance per player per match
def players_distances(players_id, team_matches, events):
    team_matches.sort(key=lambda x: x['gameweek'])
    players_distance = {}
    for match in team_matches:
        for player in players_id:
            player_events = []
            distance = 0
            for event in events:
                if event['playerId'] == player and event['matchId'] == match['wyId']:
                    player_events.append(event)
            player_events.sort(key=lambda x: x['eventSec'])
            for i in range (len(player_events)-1):
                distance += math.dist([player_events[i]['positions'][0]['x'], player_events[i]['positions'][0]['y']], [player_events[i+1]['positions'][0]['x'], player_events[i+1]['positions'][0]['y']])
            if(player not in players_distance.keys()):
                players_distance[player] = [[match['gameweek'], distance]]
            else:
                players_distance[player].append([match['gameweek'], distance])
    return players_distance

## Code

# REMEMBER THAT THE EVENTS ARE SPLIT IN 1H AND 2H AND THE SECONDS RESTART FROM 0 FOR EACH HALF

In [5]:
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from utils import *
from collections import Counter
import operator

In [6]:
# loading the events data
events={}
nations = ['Italy','England','Germany','France','Spain','European_Championship','World_Cup']
for nation in nations:
    with open('./data/events/events_%s.json' %nation) as json_data:
        events[nation] = json.load(json_data)
        
# loading the match data
matches={}
nations = ['Italy','England','Germany','France','Spain','European_Championship','World_Cup']
for nation in nations:
    with open('./data/matches/matches_%s.json' %nation) as json_data:
        matches[nation] = json.load(json_data)

# loading the players data
players={}
with open('./data/players.json') as json_data:
    players = json.load(json_data)

# loading the competitions data
competitions={}
with open('./data/competitions.json') as json_data:
    competitions = json.load(json_data)

# loading the teams data
teams={}
with open('./data/teams.json') as json_data:
    teams = json.load(json_data)

## Italian division analysis
There are around 60k events made by players with wyId=0 which are events such as interruptions or duels

### Teams loading and formatting

In [7]:
italian_teams = []

In [8]:
for team in teams:
    if team['area']['name'] == 'Italy' and team['type'] == 'club':
        italian_teams.append(team)

In [9]:
italian_teams_codes = [] #list of italian teams with their codes
for team in italian_teams:
    italian_teams_codes.append([team['name'], str(team['wyId'])])
italian_teams = italian_teams_codes

In [10]:
italian_teams

[['SPAL', '3204'],
 ['Milan', '3157'],
 ['Juventus', '3159'],
 ['Roma', '3158'],
 ['Sassuolo', '3315'],
 ['Bologna', '3166'],
 ['Sampdoria', '3164'],
 ['Chievo', '3165'],
 ['Lazio', '3162'],
 ['Udinese', '3163'],
 ['Internazionale', '3161'],
 ['Benevento', '3219'],
 ['Cagliari', '3173'],
 ['Atalanta', '3172'],
 ['Fiorentina', '3176'],
 ['Torino', '3185'],
 ['Napoli', '3187'],
 ['Crotone', '3197'],
 ['Hellas Verona', '3194'],
 ['Genoa', '3193']]

In [11]:
teams_codes = []
for team in italian_teams:
    teams_codes.append(team[1])

In [12]:
italian_matches = matches['Italy']

In [13]:
# creates a dictionary where the key is the team code and the value is a list of all the matches of that team
teams_matches = {}
for match in italian_matches:
    if list(match['teamsData'].keys())[0] not in list(teams_matches.keys()):
        teams_matches[list(match['teamsData'].keys())[0]] = [match]
    else:
        teams_matches[list(match['teamsData'].keys())[0]].append(match)
    if list(match['teamsData'].keys())[1] not in list(teams_matches.keys()):
        teams_matches[list(match['teamsData'].keys())[1]] = [match]
    else:
        teams_matches[list(match['teamsData'].keys())[1]].append(match)

In [14]:
for team in teams_matches:
    teams_matches[team].sort(key=lambda x: x['wyId'])

In [15]:
italian_events = events['Italy']

In [16]:
len(italian_events)

647372

In [17]:
italian_events.sort(key=lambda x: (x['matchId'], x['matchPeriod'], x['playerId'], x['eventSec']))

In [18]:
italian_events = [event for event in italian_events if event['playerId'] != 0]

In [19]:
len(italian_events)

600585

In [20]:
647372 - 600585

46787

46787 records eliminati (sono calci d'angolo, falli eccetera)

#### DICTIONARY OF ALL THE PLAYERS WHO PLAYED IN A TEAM
<b>Info: </b>
<ul>
    <li>
        <b>Key: </b> team_id
    </li>
    <li>
        <b>Value: </b> all the ids of players who played for a team
    </li>
</ul>
Only the players who played are inserted (so the ones who always were in bench are not inserted)

In [21]:
teams_players = {}

In [22]:
for match in italian_matches:
    keys = list(match['teamsData'].keys())
    team1 = match['teamsData'][keys[0]]['formation']
    players_id_list = [player['playerId'] for player in team1['lineup']] + [sub['playerIn'] for sub in team1['substitutions']]
    if keys[0] not in teams_players:
        teams_players[keys[0]] = players_id_list
    else:
        teams_players[keys[0]] += players_id_list
    
    team2 = match['teamsData'][keys[1]]['formation']
    players_id_list = [player['playerId'] for player in team2['lineup']] + [sub['playerIn'] for sub in team2['substitutions']]
    if keys[1] not in teams_players:
        teams_players[keys[1]] = players_id_list
    else:
        teams_players[keys[1]] += players_id_list

In [23]:
teams_players = {k: list(set(v)) for k, v in teams_players.items()}

In [24]:
for team in teams_players:
    teams_players[team].sort()

In [25]:
for team in teams_players:
    if 0 in teams_players[team]:
        teams_players[team].remove(0)

In [26]:
for team in teams_players:
    print(team, teams_players[team])

3162 [130, 3484, 4792, 7926, 7965, 20460, 20550, 20561, 20575, 20972, 21350, 21384, 37745, 40806, 41368, 101635, 166534, 208865, 228928, 260250, 265865, 346908, 364640, 376362]
3161 [3344, 3431, 3543, 7982, 14812, 20517, 20519, 20556, 20571, 20626, 21094, 69968, 70965, 86785, 116349, 135903, 138408, 206314, 241676, 298212, 352993, 405608]
3158 [114, 3463, 3475, 3795, 8306, 8327, 20418, 20438, 20518, 20879, 22566, 23149, 25405, 40787, 44251, 92966, 99430, 137298, 214220, 234359, 239290, 289122, 328333, 340019, 347026, 350032, 405602]
3315 [3710, 20478, 20635, 20771, 20832, 20842, 21162, 21282, 21861, 22152, 22162, 22163, 22383, 22699, 208696, 209400, 221047, 246059, 246063, 267185, 292310, 292991, 302798, 347024, 354123, 403449, 415348, 417231]
3173 [45, 20472, 20636, 20850, 20874, 20875, 21299, 21370, 21639, 21865, 21959, 22732, 22933, 23314, 40538, 50073, 69610, 92900, 116171, 134413, 220359, 263591, 283832, 286223, 335634, 402898, 404209, 434142, 472938]
3172 [625, 20404, 20820, 2084

In [27]:
type(italian_matches)

list

In [28]:
events_set = set()
for event in italian_events:
    events_set.add((event['eventId'], event['eventName']))
#sorts the set based on the first component
events_set = sorted(events_set, key=lambda x: x[0])
for elem in events_set:
    print(elem)

(1, 'Duel')
(2, 'Foul')
(3, 'Free Kick')
(4, 'Goalkeeper leaving line')
(5, 'Interruption')
(6, 'Offside')
(7, 'Others on the ball')
(8, 'Pass')
(9, 'Save attempt')
(10, 'Shot')


In [64]:
distance_per_player = {}

In [65]:
for team in italian_teams:
    distance_per_player[team[1]]= {}

In [66]:
distance_per_player

{'3204': {},
 '3157': {},
 '3159': {},
 '3158': {},
 '3315': {},
 '3166': {},
 '3164': {},
 '3165': {},
 '3162': {},
 '3163': {},
 '3161': {},
 '3219': {},
 '3173': {},
 '3172': {},
 '3176': {},
 '3185': {},
 '3187': {},
 '3197': {},
 '3194': {},
 '3193': {}}

In [67]:
type(distance_per_player)

dict

In [68]:
for team in teams_codes:
    if 0 in teams_players[team]:
        print(team)

In [69]:
italian_teams

[['SPAL', '3204'],
 ['Milan', '3157'],
 ['Juventus', '3159'],
 ['Roma', '3158'],
 ['Sassuolo', '3315'],
 ['Bologna', '3166'],
 ['Sampdoria', '3164'],
 ['Chievo', '3165'],
 ['Lazio', '3162'],
 ['Udinese', '3163'],
 ['Internazionale', '3161'],
 ['Benevento', '3219'],
 ['Cagliari', '3173'],
 ['Atalanta', '3172'],
 ['Fiorentina', '3176'],
 ['Torino', '3185'],
 ['Napoli', '3187'],
 ['Crotone', '3197'],
 ['Hellas Verona', '3194'],
 ['Genoa', '3193']]

In [70]:
for match in teams_matches['3193']:
    print(match['wyId'])

2575967
2575973
2575988
2575992
2576003
2576013
2576022
2576030
2576043
2576054
2576066
2576074
2576080
2576093
2576103
2576111
2576123
2576131
2576148
2576157
2576163
2576178
2576182
2576193
2576203
2576212
2576220
2576233
2576244
2576256
2576264
2576270
2576283
2576293
2576301
2576313
2576321
2576338


In [71]:
for team in italian_teams:
    print(team)
    team_code = team[1]
    team_players = teams_players[team_code]
    team_matches = teams_matches[team_code]

    for match in team_matches:
        for player in team_players:
            distance = 0
            events_list = []
            for event in italian_events:
                if event['playerId'] == player and event['matchId'] == match['wyId']:
                    events_list.append(event)
            
            for i in range(len(events_list)-1):
                distance += math.dist([events_list[i]['positions'][0]['x'], events_list[i]['positions'][0]['y']], [events_list[i+1]['positions'][0]['x'], events_list[i+1]['positions'][0]['y']])
            
            if(distance != 0):
                if(player not in distance_per_player[team_code].keys()):
                    distance_per_player[team_code][player] = [[match['gameweek'], distance]]
                else:
                    distance_per_player[team_code][player].append([match['gameweek'], distance])

['SPAL', '3204']
{20448: [[1, 2728.384886556065], [2, 2537.6357696183854], [3, 2232.9529411689664], [4, 2536.357599803158], [5, 2491.6732153246594], [6, 2127.610471809638], [7, 3635.6622630278384], [8, 2513.875018966593], [9, 2618.3528363372257], [10, 2550.4761318770625], [11, 2205.994977358939], [12, 3207.459848170796], [13, 1789.8455859601777], [14, 0], [15, 1467.131558081453], [16, 2886.2064078260755], [17, 2497.991591119162], [18, 2847.0774307202632], [19, 2308.254736608399], [20, 2236.8293087393686], [21, 2838.8911598404993], [22, 3056.426772124964], [23, 2074.1573367853885], [24, 2572.1947990397857], [25, 1996.9510686055291], [26, 0], [27, 2918.957158016662], [28, 0], [29, 0], [30, 0], [31, 2449.755061386646], [32, 2423.476508860665], [33, 2639.3678766743274], [34, 0], [35, 0], [36, 0], [37, 0], [38, 2263.7046300551424]], 20476: [[1, 876.6053212857672], [2, 1426.0652881895705], [3, 1612.7476140831063], [4, 1866.8780612994565], [5, 645.3228519079248], [6, 1238.034899013819], [7, 2

KeyboardInterrupt: 

In [None]:
import pickle

# save dictionary to serie_a_distances.pkl file
with open('serie_a_distances.pkl', 'wb') as fp:
    pickle.dump(distance_per_player, fp)
    print('dictionary saved successfully to file')

In [None]:
with open('serie_a_distances.pkl', 'rb') as fp:
    teams_distances = pickle.load(fp)

Devo capire perché salva più di un record per giocatore; bisogna salvare i dati dividendo i giocatori per squadra e per gameweek giocati. Se non c'è il giocatore nella partita, allora quel gameweek non deve essere considerato.

In [None]:
teams_distances

In [None]:
team_matches = []
for match in italian_matches:
    if '3162' in match['teamsData'].keys():
        team_matches.append(match)

In [None]:
match_events = []
for event in italy_events:
    if event['matchId'] == 2576335:
        match_events.append(event)

In [None]:
len(match_events)

In [None]:
lazio_players = teams_players['3162']

In [None]:
distance_per_player = []
for player in lazio_players:
    distance_per_player.append({'playerId': player, 'match_distance': []})
for player in lazio_players:
    player_events = []
    for event in match_events:
        if event['playerId'] == player:
            player_events.append(event)
    player_events.sort(key=lambda x: x['eventSec'])
    distance = 0
    for i in range (len(player_events)-1):
        distance += math.dist([player_events[i]['positions'][0]['x'], player_events[i]['positions'][0]['y']], [player_events[i+1]['positions'][0]['x'], player_events[i+1]['positions'][0]['y']])
    distance_per_player[lazio_players.index(player)]['match_distance'].append([2576335, distance])

In [None]:
players = 0
for team in teams_players:
    players += len(teams_players[team])

players

In [None]:
distance_per_player

Investigate the distances traveled by players during their matches and their distributions. Discuss about the similarity of these distributions with those about mobility trajectories seen during the course.<br>

Giacomino ha detto che bisogna fare l'analisi sui ruoli dei giocatori e vedere le distanze percorse dai giocatori e bisogna fare la stessa cosa pure per le shquadre

In [None]:
players_distances_tot_matches = {}
for team in italian_teams:
    print(team[0])
    matches = teams_matches[team[1]]
    team_players = retrieve_players_in_a_team(team[1])
    players_distances_tot_matches[team[1]] = players_distances(team_players, matches, italy_events)


In [None]:
import pickle

# save dictionary to serie_a_distances.pkl file
with open('serie_a_distances.pkl', 'wb') as fp:
    pickle.dump(players_distances_tot_matches, fp)
    print('dictionary saved successfully to file')

In [None]:
with open('serie_a_distances.pkl', 'rb') as fp:
    teams_distances = pickle.load(fp)

In [None]:
teams_distances[list(teams_distances.keys())[2]]

### Plots per each team

In [None]:
player

#### Code 3162 (SS Lazio)

In [None]:
retrieve_player_info(364640)

In [None]:
# function to retrieve only players who play in a specific role in a specific team
def retrieve_players_per_role_and_per_team(team_players, role):
    players_per_role = []
    for player in players:
        if player['role']['name'] == role and player['wyId'] in team_players:
            players_per_role.append(player['wyId'])
    return players_per_role

In [None]:
fw_lazio_players = retrieve_players_per_role_and_per_team(retrieve_players_in_a_team('3161'), 'Forward')

In [None]:
fw_lazio_players

In [None]:
lazio_players

In [None]:
retrieve_players_per_role(players, role)

In [None]:
'3162' in teams_distances.keys()

In [None]:
lazio_info = teams_distances['3162']

In [None]:
len(lazio_info)

In [None]:
lazio_info[player_id][i][1]

In [None]:
lazio_info

In [None]:
len(lazio_info.keys())

In [None]:
distances = []
for key in lazio_info.keys():
    for i in range (38):
        distances.append(lazio_info[key][i][1])

In [None]:
distances.sort()

In [None]:
distances

In [None]:
distances[-1]

In [None]:
distances[-2]

In [None]:
distances[-1]

In [None]:
distances_dictionary = {}
for i in range (0, 4300, 100):
    distances_dictionary[i] = 0

In [None]:
i = 1
count = 0
for dist in distances:
    if dist < i*100:
        count += 1
    else:
        distances_dictionary[(i-1)*100] = count
        i += 1
        count = 1
if count != 0:
    distances_dictionary[i*100] = count

In [None]:
len(distances)

In [None]:
print(distances)

In [None]:
distances_dictionary

In [None]:
del distances_dictionary[0]

In [None]:
keys = list(distances_dictionary.keys())
values = list(distances_dictionary.values())

plt.bar(keys, values, width=100)  # Adjust the width as needed
plt.xlabel('Distance')
plt.ylabel('# of matches')
plt.show()

In [None]:
len(lazio_players)

In [None]:
hello = retrieve_player_info(130)
hello

### Code 3162 (SS Lazio)

In [None]:
lazio_players = retrieve_players_in_a_team('3162')

In [None]:
lazio_matches = teams_matches['3162']

In [None]:
lazio_players_distances = players_distances(lazio_players, lazio_matches, italy_events)

In [None]:
list(lazio_players_distances.keys())[0]

Prendere tutti gli eventi del giocatore
Ordinarli cronologicamente
Prendere la prima tupla di ogni "positions" e calcolare la distanza con l'evento appena successivo
Fare la somma di tutte queste distanze

In [None]:
match_id = 2576335
a_match = []
for nation in nations:
    for ev in events[nation]:
        if ev['matchId'] == match_id:
            a_match.append(ev)
            
for nation in nations:
    for match in matches[nation]:
        if match['wyId'] == match_id:
            match_f = match
            
df_a_match = pd.DataFrame(a_match)
team_1, team_2 = np.unique(df_a_match['teamId']) #takes the wyId of the two teams
df_a_match['x'] = [x[0]['x'] for x in df_a_match['positions']]
df_a_match['y'] = [x[0]['y'] for x in df_a_match['positions']]
df_team_1 = df_a_match[df_a_match['teamId'] == team_1]
df_team_2 = df_a_match[df_a_match['teamId'] == team_2]

f = draw_pitch("#195905", "#faf0e6", "h", "full")
plt.scatter(df_team_1['x'], df_team_1['y'], c='cyan', edgecolors="k", zorder=12)
plt.scatter(df_team_2['x'], df_team_2['y'], marker='s', c='k', edgecolors="w", linewidth=0.25, zorder=12)
plt.title(match_f['label'], fontsize=20)
plt.show()

In [None]:
for team in teams:
    if team['name']=='Napoli':
        print(team)

In [None]:
type(competitions)

In [None]:
competitions[0]

In [None]:
type(matches)

In [None]:
matches.keys()

In [None]:
type(matches['Italy'])

In [None]:
matches['Italy'].reverse()

In [None]:
matches['Italy'][0]

In [None]:
matches['Italy'][0].keys()

In [None]:
list(matches['Italy'][0]['teamsData'].keys())[0]

In [None]:
for player in matches['Italy'][0]['teamsData'][list(matches['Italy'][0]['teamsData'].keys())[0]]['formation']['lineup']:
    print(player['playerId'])

In [None]:
players[0]

In [None]:
for player in players:
    if(player['wyId'] == 20470):
        print(player)

In [None]:
type(events)

In [None]:
events.keys()

In [None]:
events['Italy'][0].keys()

In [None]:
events_marchisio=[]

In [None]:
for event in events['Italy']:
    if(event['playerId'] == 20470 and event['matchId'] == 2575964):
        events_marchisio.append(event)

In [None]:
for event in events['Italy']:
    if(event['playerId'] == 20470 and event['matchId'] == 2575964):
        events_marchisio.append(event)

In [None]:
events_marchisio[0]

In [None]:
ev_all_nations = []
for nation in nations:
    for i in range(len(events[nation])):
        ev_all_nations.append(events[nation][i]['eventName'])

count = Counter(ev_all_nations)
counter = {}
for i,v in zip(count.keys(),count.values()):
    counter[i] = int(float(v)/len(ev_all_nations)*100)
sorted_d = np.array(sorted(counter.items(), key=operator.itemgetter(1), reverse=False))

#bar plot
f,ax = plt.subplots(figsize=(8,6))
plt.barh(list(sorted_d[:,0]),[int(x) for x in list(sorted_d[:,1])])
plt.xticks(rotation=90)
plt.xticks(fontsize=18)
plt.xlabel('events (%)', fontsize=25)
plt.yticks(fontsize=18)
plt.grid(alpha=0.3)
f.tight_layout()
plt.show()

In [None]:
events_marchisio[0]

In [None]:
events_marchisio[4]

In [None]:
len(events_marchisio)

In [None]:
sorted_list_marchisio = sorted(events_marchisio, key=lambda x: x['eventSec'])

In [None]:
for event in sorted_list_marchisio:
    print(event['eventName'], event['eventSec'])

In [None]:
2630/60

In [None]:
#draws edges between consecutive events
f = draw_pitch("#195905", "#faf0e6", "h", "full")
for i in range(len(sorted_list_marchisio)-1):
    x1, y1 = [sorted_list_marchisio[i]['positions'][0]['x'], sorted_list_marchisio[i+1]['positions'][0]['x']], [sorted_list_marchisio[i]['positions'][0]['y'], sorted_list_marchisio[i+1]['positions'][0]['y']]
    plt.plot(x1, y1, marker = 's', c='k', linewidth=0.25, zorder=12)

In [None]:
f = draw_pitch("#195905", "#faf0e6", "h", "full")
for event in events_marchisio:
    plt.scatter(event['positions'][0]['x'], event['positions'][0]['y'], marker='s', c='k', edgecolors="w", linewidth=0.25, zorder=12)

In [None]:
napoli_matches = []

In [None]:
for match in matches['Italy']:
    if '3187' in list(match['teamsData']):
        napoli_matches.append(match)
    

In [None]:
len(napoli_matches)

In [None]:
napoli_matches[0]['wyId']

In [None]:
for player in players:
    if(player['wyId'] == 99452):
        print(player)

In [None]:
milik_events = []

In [None]:
for event in events['Italy']:
    if(event['playerId'] == 99452 and event['matchId'] == 2575962):
        milik_events.append(event)

In [None]:
milik_events

In [None]:
sorted_list_milik = sorted(milik_events, key=lambda x: x['eventSec'])

In [None]:
#draws edges between consecutive events
f = draw_pitch("#195905", "#faf0e6", "h", "full")
for i in range(len(sorted_list_milik)-1):
    x1, y1 = [sorted_list_milik[i]['positions'][0]['x'], sorted_list_milik[i+1]['positions'][0]['x']], [sorted_list_milik[i]['positions'][0]['y'], sorted_list_milik[i+1]['positions'][0]['y']]
    plt.plot(x1, y1, marker = 's', c='b', linewidth=0.25, zorder=12)