## Project Motivation and Idea
Considering the enormous amount of money in todays football market, the building of a well-functioning roster with players interacting collaboratively together becomes an incredibly important task with high stakes. Therefore any insight into how a team performs together, which players are the pillars of the team and which players could be replaced for performance enhancement could be worth millions. 

Using network analysis on professional football teams adds something new and remarkable to sports analytics. While normal sports analytics is often about different metrics for evaluating players' performance, without considering the interaction between teammates network analysis might be able to actually extract patterns of play, show interaction between certain sets of players and give us an idea to how certain teams set up their offense. 

There have been different approaches using network analysis on football teams, however they often fall short.
Generally, adjacency matrices are built from all possessions observed. This leads to the defensive and central midfielders almost always showing up as the most important players for each team. This might make sense, since they are often in the center of play, however this is not an interesting or trustworthy find if you claim that the central midfielder is the most important player in every team.
In this project, I try to take a different approach by looking only at possessions that ended in a shot. Doing this we take care of two possible problems:

i) We only look at possessions where the team actually looked to score. Short possessions in your own half are not considered vital to the overall game. Players that have the ball often in those possessions (defenders, central midfielders and even goalkeepers) will not show up as central, if they do not contribute to offensive play.

ii) We can assign a value to each of the possessions using an expected Goals model. This gives us an idea of how big the chance was. This value can be used to assign each of the players a value according to their contribution. The main idea is here: if you create a lot of good chances, you are a good (offensive) player.

In [1]:
# reading in all necessary libraries
import networkx as nx
import os
import pandas as pd
os.chdir('/home/Nickfis/Documents/Projects/football_network_analysis')
import numpy as np
%matplotlib inline

## Reading in data

The dataframes that are loaded in here were derived earlier from the overall event data for the season
plus qualifiers that explain each event more precisely. 
This means the possession as well as the xg_features data is based on multiple assumptions made by me.
The possessions df gives each offensive possession in the season that ended in a shot. An offensive possession
is just defined as a sequence of passes from only one team that ended in a shot.
XG_features is basically a dataframe with about 16.000 rows of shots, with certain features that describe the
shot more closely, in order to obtain a probability for that shot going in. In this project, we only need to 
look at the probability of the shot going in, since this will give us a measure of danger of that offensive
possession. 

In [2]:
# event data: all events that occurred in every game of LaLiga over the season 2017/2018
event = pd.read_csv('event2017.csv')
# possession dataframes. All offensive possessions that ended in a shot
possession = pd.read_csv('possession2017.csv')
# expected goals 
xg_features = pd.read_csv('xg_features2017.csv')
# team and playerspecific information
teams = pd.read_csv('teams.csv')
allplayers = pd.read_csv('player_frame.csv')


In [3]:
# possession shape needs to be a little bit preprocessed.
# drop the duplicates
possession = possession[~possession.duplicated(keep='first')]
# leave out na's because not part of possession and not a player
possession = possession[~possession['player_id'].isna()]
# only look at shots and passes
shotsandpass = [1,13,14,15,16]
possession = possession[possession['event_type'].isin(shotsandpass)]
possession.index = possession['possession']
# deleting all possessions with fewer than 3 actions
possession = possession[pd.DataFrame(possession['possession'].value_counts())['possession']>2]

  # This is added back by InteractiveShellApp.init_path()


We will see that our y and x axis both range from 0 to 100. In reality though, the pitch is obviously not a square. Here I took a ratio, that should transform the y-coordinates in an adequate manner. 

In [4]:
# check for the y-coordinate
possession['y'].describe()

count    19778.000000
mean        49.471489
std         27.407908
min          0.000000
25%         27.300000
50%         49.400000
75%         71.500000
max        100.000000
Name: y, dtype: float64

In [5]:
ratio=105/69.5

possession['y'] = possession['y']/ratio

## Building Adjacency Matrices according to amount of passes for each team

In the following block of code the "normal" adjacency matrices are going to be built. For every outgoing and incoming pass for each player in the offensive possessions a one will be added to their cell in the matrix. 
What we will end up with is a matrix of all players involved in possessions that led to a shot with a cell describing the passes they played and another cell showing how many passes they received. 

Furthermore the loop is used to create another dataframe for each team, observing the average position of the players involved, in order to draw the offensive graph later on.

In [6]:
#create adjacency matrix for each team
for team in possession['team_id'].unique():
    team_pos = possession[possession['team_id']==team]
    # create adjacency matrix for that team
    team_player = team_pos['player_id'].unique()
    # player x player matrix
    adj_matrix = pd.DataFrame(np.zeros((len(team_player),len(team_player))))
    # name indices
    adj_matrix.columns = team_player
    adj_matrix.index = team_player
    # loop through every possession for the team and add received / played passes for each player
    for pos in team_pos['possession'].unique():
        one_pos = team_pos[team_pos['possession']==pos]
        #print(pos)
        # now go back more and more in the dataframe till the beginning
        # first pass
        for k in range((one_pos.shape[0]-1),0,-1):
            #print(k)
            # player that passes the ball
            i = one_pos.iloc[k]['player_id']
            # player that receives the pass
            j = one_pos.iloc[k-1]['player_id']
            # filling up the values
            adj_matrix.loc[i,j]+=1

    ############################################ BUILDING POSITION DATAFRAME #########################################################
    # in another dataframe for that team I want to have all the players and their median (mean?) position and their number of actions
    # a df with 3 columns with the length of the players in the adjacency matrix, maybe even do a density of each player?
    team_positions = pd.DataFrame(team_player)
    team_positions.index = team_player
    # add positions
    team_positions['x'] = 0
    team_positions['y'] = 0
    # actions
    team_positions['actions'] = 0

    for i in team_player:
        #player_pos = team_pos[team_pos['player_id']==i]
        team_positions.loc[i,'x'] = team_pos[team_pos['player_id']==i]['x'].mean()
        team_positions.loc[i,'y'] = team_pos[team_pos['player_id']==i]['y'].mean()
        team_positions.loc[i, 'actions'] = len(team_pos[team_pos['player_id']==i]['possession'])


    team_positions.to_csv('positions_'+str(team)+'.csv')
    adj_matrix.to_csv('adj_matrix_'+str(team)+'.csv')

## Building Adjacency Matrices according to xG value for each team
Although the previously created adjacency matrices already depict better what is going on in an offensive network, we want to be more precise. 
Using a previously created xG-Model with certain features that are supposed to help evaluate whether we are looking at a chance with a high or low probability of going in, we want to assign each possession a value.
This value will be between 0 and 1, since it is basically the probability of the team scoring the shot.
Amongst the features of the model are the distance, the type of shot (header/foot), whether it was a shot set up by a teammate, the pass that led to the shot and more. This model achieves overall over 80% area under the curve.
After assigning the probability of each shot to its possession, the players are rewarded uniformly. Of course these weights can be changed (to exponential or according to a pass probability model or by x amount of other methods), but here I try to keep it simple for now. 
The general idea behind a uniform distribution of reward is: if a player is often involved with great chances, he  should show up as a great contributor, no matter of the position. There is a reason why certain defenders will initiate dangerous possessions more frequently than others. 

In [None]:
## add the goal probability to the possession dataframe
goal_prob = pd.merge(possession[['unique_event_id', 'possession']], xg_features[['unique_event_id', 'goal_prob']], how='right', on='unique_event_id')
possession = pd.merge(possession, goal_prob.dropna(subset=['goal_prob'])[['possession', 'goal_prob']], on='possession', how='left')
possession = possession.drop_duplicates(subset='unique_event_id')

for team in possession['team_id'].unique():
    #print(team)
    team_pos = possession[possession['team_id']==team]
    # create adjacency matrix for that team
    team_player = team_pos['player_id'].unique()
    # player x player matrix
    adj_matrix = pd.DataFrame(np.zeros((len(team_player),len(team_player))))
    # name indices
    adj_matrix.columns = team_player
    adj_matrix.index = team_player
    # loop through every possession for the team and add received / played passes for each player
    for pos in team_pos['possession'].unique():
        one_pos = team_pos[team_pos['possession']==pos]
        value = one_pos['goal_prob'].iloc[0]/one_pos.shape[0]
        #print(pos)
        # now go back more and more in the dataframe till the beginning
        # first pass
        for k in range((one_pos.shape[0]-1),0,-1):
            #print(k)
            # player that passes the ball
            # player that passes
            i = one_pos.iloc[k]['player_id']
            #print(i)
            # player that receives the pass
            j = one_pos.iloc[k-1]['player_id']
            # filling up the values
            adj_matrix.loc[i,j]+=value

    adj_matrix.to_csv('xg_adj_matrix_'+str(team)+'.csv')

Defaulting to column, but this will raise an ambiguity error in a future version
  exec(code_obj, self.user_global_ns, self.user_ns)
