# Passing analysis

Great, I'm happy that you are still interested in learning about soccer analytics! If you haven't done yet, I recommend you to first have a look at the notebook on goal kicks before starting this one, as we are going to assume that you are already familiar with some of the things you learned there. 

In this notebook we are going to look into passing in general and make a deep-dive into the "German clasico". The notebook is set up in the following way: 

1. [Descriptive analysis on passes on team level](#pass_analysis)
2. Deep-dive into the German clasico:
    - [Team statistics to get a general understanding](#team_statistics)
    - [Player statistics to see how each player did](#player_statistics)
    - [Passing lines between players to get a feeling for player tandems](#passing_lines)
    - [Passing zones for starting to understand the match setup](#passing_zones)
3. [Gini coefficient of passes to get an understanding of the team compactness](#gini)

Again, we are going to use a lot of helper functions and learn amongst other things to:
- Compute statistics efficiently with the *compute_statistics* function
- Use the position plot to indicate player's positions and passing lines
- Draw pass polar plots to read the game setup
- Use the Gini coefficient in order to measure the pass balance within a team

Ready for looking into passes?! Let's go!

In [6]:
# import packages
import os
import pandas as pd
import numpy as np
import plotly.express as px

if os.getcwd().split(os.sep)[-1] == "notebooks":
    os.chdir("../")

import helper.io as io
import helper.event_data as ed_help
import helper.plotly as py_help
import helper.general as gen_help

# this is very useful as it makes sure that always all columns and rows of a data frame are displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Read all required data

In [7]:
league = "germany"
df_events = io.read_event_data(league)
df_matches = io.read_match_data(league)
df_teams = io.read_team_data(league)
df_formations = io.read_formation_data(league)

<a id="pass_analysis"></a>

## Passing per team

Let's start by looking at the number of passes as well as the accuracy of the passes for each team. Notice that in order to do so, we use the helper function *compute_statistics*, which can be used to compute pre-defined KPIs in only one line of code. Quite neat, isn't it? If it is not completely clear yet how to use this function, don't worry. We will use it several times below and at the end of this notebook it should definitely by clear :-)

In [3]:
# compute total passes and accuracy for each team
df_passes_teams = ed_help.compute_statistics(df_events, group_col="team", 
                                             keep_kpis=["totalPasses", "shareAccuratePasses"])
# add the team name
df_passes_teams = pd.merge(df_passes_teams, df_teams[["teamId", "teamName", "position"]], how="left")

# look at passes per match rather than total passes
df_passes_teams["totalPassesPerMatch"] = np.round(df_passes_teams["totalPasses"] / df_passes_teams["nbMatches"],0)
df_passes_teams.sort_values("totalPassesPerMatch", inplace=True)

We now plot the number of passes together with the pass accuracy. We also add the league position of each team to the chart to see, if there is a correlation between teams doing well and the number of passes

In [5]:
fig = px.bar(df_passes_teams, 
             x="teamName", 
             y="totalPassesPerMatch", 
             color="shareAccuratePasses",
             text="position",
             labels={"totalPassesPerMatch": "Passes per match",
                     "shareAccuratePasses": "Pass accuracy (%)",
                     "position": "League position",
                     "teamName": "Team"},
             title="Number of passes vs. pass accuracy")
fig.show()

Ok, Bayern having by far the most passes was definitely expected. And there is definitely some correlation between the good teams, the number of passes and their accuracy. Nevertheless, there are some teams that are quite surprising. Take Augsburg, for example. They ended up 12th in league but were second worst in number of passes and worst in accuracy... 

Maybe it has something to do with the length of the passes. Meaning, Augsburg is doing relatively bad in number of passes and accuracy because they play longer passes on average. 

Let's quickly look into this: And again we can use the *compute_statistics* function and now add the *passLength* to the KPIs we want to have returned.

In [6]:
df_passes_teams = ed_help.compute_statistics(df_events, group_col="team", 
                                             keep_kpis=["totalPasses", "shareAccuratePasses", "passLength"])

# add the team name
df_passes_teams = pd.merge(df_passes_teams, df_teams[["teamId", "teamName", "position"]], how="left")

# look at passes per match rather than total passes
df_passes_teams["totalPassesPerMatch"] = np.round(df_passes_teams["totalPasses"] / df_passes_teams["nbMatches"],0)

Instead of using a bar chart again, let's use a scatter plot in which we plot the accuracy vs. the pass length. This shows us quite nicely the negative correlation between pass length and accuracy

In [7]:
# notice that the scatter plot works almost identically as the bar chart
fig = px.scatter(df_passes_teams, 
                 x="shareAccuratePasses", 
                 y="meanPassLength", 
                 color="totalPassesPerMatch",
                 text="teamName",
                 labels={"totalPassesPerMatch": "Passes per match",
                         "shareAccuratePasses": "Pass accuracy (%)",
                         "meanPassLength": "Avg. pass length (in m)",
                         "teamName": "Team"},
                 title="Pass accuracy vs. pass length")

# increase the size of the marker and lower their opacity
fig.update_traces(marker=dict(size=20, 
                              opacity=0.5))

fig.show()

Nice, Augsburg does indeed have a higher average pass length than the other teams. And also Frankfurt, which performs relatively bad on number of passes and accuracy compared to their league performance (see table above) plays relatively long passes on average. Nevertheless, there are also counter-examples as e.g. Schalke which finished second in the league and is neither extremely good in accuracy nor do they play a lot of passes...

One cool thing I want to show you about the plotly plots: You notice that Hertha, Köln, Stuttgart and Freiburg are quite close together and it is kind of hard to distinguish them. As plotly is interactive, just draw a rectangle around the markers and you will see that plotly automatically zooms in. Isn't that cool?! :-) To get back to the full plot, just double-click

As always, there are a lot of things that one could now analyse further such as the distribution of the pass length or a heatmap of where most of the passes were played. Given what we already learned in this and previous notebook, I bet this is something you could now do easily by yourself, couldn't you? :-) 

Let us instead continue by looking at an individual match, namely the German clasico

## German clasico

In the following we are going to look at the German clasico, i.e. the home game of Borussia Dortmund against Bayern München. In case you don't remember, Bayern won 3:1.

Let's first briefly look at some team statistics before we dive into the individual players.

In [8]:
# Set the home team and the away team
home_team = "Borussia Dortmund"
away_team = "Bayern München"

Below we retrieve the matchId as well as all the events belonging to the match

In [9]:
# compute teamIds - not necessary if we set *home_team* and *away_team* by their id
home_team_id = ed_help.get_team_id(df_teams, home_team)
away_team_id = ed_help.get_team_id(df_teams, away_team)

# compute the matchId based on the home and away team
match_id = ed_help.get_match_id(df_matches, home_team_id, away_team_id)

# retrieve all events of the match
df_match_events = df_events[(df_events["matchId"] == match_id)]

<a id="team_statistics"></a>

### Team statistics

Let's now compute the team statistics by using the *compute_statistics* function again

In [10]:
df_match_events.head()

Unnamed: 0,id,matchId,matchPeriod,eventSec,eventName,subEventName,teamId,posBeforeXMeters,posBeforeYMeters,posAfterXMeters,posAfterYMeters,playerId,playerName,playerPosition,playerStrongFoot,teamPossession,homeTeamId,awayTeamId,accurate,notAccurate,goal,ownGoal,assist,keyPass,counterAttack,leftFoot,rightFoot,head/body,direct,indirect,dangerousBallLost
149416,202840066,2516829,1H,0.613509,Pass,Simple pass,2444,40.95,23.12,35.7,10.88,3345,Thiago Alcântara,MD,right,2444,2447,2444,1,0,0,0,0,0,0,0,0,0,0,0,0
149417,202840067,2516829,1H,2.466024,Pass,Simple pass,2444,35.7,10.88,30.45,22.44,14724,D. Alaba,DF,left,2444,2447,2444,1,0,0,0,0,0,0,0,0,0,0,0,0
149418,202840069,2516829,1H,4.915162,Pass,Simple pass,2444,30.45,22.44,33.6,53.04,14795,M. Hummels,DF,right,2444,2447,2444,1,0,0,0,0,0,0,0,0,0,0,0,0
149419,202840070,2516829,1H,6.852296,Pass,Simple pass,2444,33.6,53.04,10.5,43.52,134383,N. Süle,DF,right,2444,2447,2444,1,0,0,0,0,0,0,0,0,0,0,0,0
149420,202840071,2516829,1H,10.526331,Pass,Simple pass,2444,10.5,43.52,24.15,56.44,14736,S. Ulreich,GK,right,2444,2447,2444,1,0,0,0,0,0,0,0,0,0,0,0,0


In [41]:
# get the match statistics by team
df_team_stats = ed_help.compute_statistics(df_match_events, 
                                            group_col="team", 
                                            drop_kpis=["centroid", "totalAccuratePasses"])
# add the team name
df_team_stats = pd.merge(df_team_stats, df_teams[["teamId", "teamName"]], how="left", on="teamId")
df_team_stats

Unnamed: 0,teamId,nbMatches,totalPasses,shareAccuratePasses,meanPassLength,totalShots,totalGoals,totalDuels,teamName
0,2444,1,485,87.42,20.68,10,3,220,Bayern München
1,2447,1,465,87.53,19.34,14,1,220,Borussia Dortmund


Nice! With only a couple lines of code we get the most important statistics on the two team. It's very interesting that on all passing statistics both of the teams are basically even. Nevertheless, Bayern won 3:1 despite having less shots than Dortmund. 

We will dive a little bit deeper on the passes below. But before we do that, let's look at some of the player statistics. 
<a id="player_statistics"></a>

### Player statistics

In [42]:
# get the match statistics by player - by passing the *df_formations* table we will also get the minutes each player 
# played
df_stats = ed_help.compute_statistics(df_match_events, 
                                             group_col="player", 
                                             drop_kpis=["totalAccuratePasses"],
                                             df_formations=df_formations)
df_stats.head()

Unnamed: 0,playerId,playerName,playerPosition,teamId,nbMatches,totalPasses,shareAccuratePasses,meanPassLength,totalShots,totalGoals,totalDuels,lineup,substituteIn,substituteOut,minutesPlayed,centroidX,centroidY
0,3335,Bartra,DF,2447,1,46,82.61,24.55,1.0,1.0,13.0,1,0,0,90.0,46.257534,50.776438
1,3345,Thiago Alcântara,MD,2444,1,53,94.34,18.71,1.0,0.0,16.0,1,0,0,90.0,51.836842,21.706316
2,3416,Javi Martínez,MD,2444,1,31,93.55,18.03,0.0,0.0,13.0,1,0,1,81.0,42.888462,35.111538
3,14687,S. Papastathopoulos,DF,2447,1,13,100.0,28.2,0.0,0.0,11.0,1,0,1,42.0,32.753226,45.143226
4,14718,Rafinha,DF,2444,1,12,75.0,17.74,0.0,0.0,3.0,0,1,0,16.0,42.061765,10.36


Wow, that is quite a lot of statistics we get for each of the players! I guess most of the columns are self-explaining but let me quickly explain the two last columns - *centroidX* and *centroidY*:

It would definitely be interesting to have some idea of where each player is "on average" during the match, i.e. is he more on the left side, on the right side, in the back or in the front. So is there any chance to get this information?

Unfortunately, we do not have any tracking data of the players. Having it would allow us to identify where exactly each player is at each point in time and things would be rather simple. Nevertheless, we can build a pretty good proxy. What we do have for each player is the position at every event the player is part of. So what we do is to take all of the player's positions we have and average out their x and y coordinates. Now, the *centroidX* is nothing else than the average over the x-values and the *centroidY* the average over the y-values. Makes sense?

Given that information, let's plot it the average positions on a soccer field.

In [43]:
# get all Dortmund players that were in the lineup
df_stats_dortmund = df_stats[(df_stats["teamId"] == home_team_id) & (df_stats["lineup"] == 1)].copy()

# get the title for the plot
plot_title = py_help.get_match_title(match_id, df_matches, df_teams)

# create the position plot and show it
fig = py_help.create_position_plot(df_stats_dortmund, title=plot_title, dict_info=False)
fig.show()

It looks a little bit as if Dortmund's game was leaning towards the left, doesn't it? And Yarmolenko on the very right maybe looking to open up some spaces... 

However, now that we already have all the statistics computed, isn't there a chance to integrate them into this plot? You can probably guess the answer, so let's do it.

In [44]:
# we can add the default statistics as hover information by not setting dict_info to False :-)
fig = py_help.create_position_plot(df_stats_dortmund, title=plot_title)
fig.show()

Indeed, when hovering over the different players, it does seem like Toprak and Schmelzer on the left hand side do have the ball quite often. Hovering over all the players, however, is quite tedious. So let's change the marker colour depending on the number of total passes each player played.

In [45]:
# we can add the default statistics as hover information by not setting dict_info to False :-)
fig = py_help.create_position_plot(df_stats_dortmund, 
                           title=plot_title, 
                           colour_kpi="totalPasses")
fig.show()

This is now way better to see, isn't it? Notice that we could have plugged in any other KPI as well (e.g. the passing accuracy) and have the marker coloured according to this KPI

When looking at the above graph is does indeed look like as if there is some triangle between Toprak, Schmelzer and Castro on the left side as all of them do have quite a lot of passes and they play next to each other. The problem we have with the above chart is, that we can not really tell whether all of them have the ball quite often but they pass it somewhere else or if they play a lot between each other. Luckily, there is a helper-function that helps us to visualize this quite easily :-)

<a id="passing_lines"></a>

### Passing lines

In [46]:
# aggregate the passes between each two players
df_passes_dortmund = py_help.prepare_passes_for_position_plot(df_match_events, df_stats_dortmund)

# we can add the default statistics as hover information by not setting dict_info to False :-)
fig = py_help.create_position_plot(df_stats_dortmund, 
                                   title=plot_title, 
                                   df_passes = df_passes_dortmund,
                                   colour_kpi="totalPasses")
fig.show()

Wow, that looks very interesting but also a little bit overwhelming with all the lines. Let's do a small trick and only plot the most important ones. We do this by plotting all lines such that 70% of all passes played are being represented (notice the *show_top_k_percent* parameter in the preparation function below)

In [47]:
# aggregate the passes between each two players
df_passes_dortmund = py_help.prepare_passes_for_position_plot(df_match_events, df_stats_dortmund, show_top_k_percent=70)

# we can add the default statistics as hover information by not setting dict_info to False :-)
fig = py_help.create_position_plot(df_stats_dortmund, 
                                   title=plot_title, 
                                   df_passes = df_passes_dortmund,
                                   colour_kpi="totalPasses")
fig.show()

This makes things much clearer, doesn't it? So what do we notice:
1. There is indeed a strong triangle between Toprak, Schmelzer and Castro
2. Aubameyang seems to be completely out of the match 
3. Pulisic, Kagawa and Yarmolenko are some kind of dead-ends
4. Toprak and Weigl are the central points with connections to 6 players 
5. Papastathopoulos also seems to be out of the match. Notice, however, that he only played for 42 minutes and the comparision is therefore not really fair. If you want to, you can go ahead and fix for this by e.g. scaling to 90 minutes. For now, I'll just leave everything as is and keep in mind to be careful with Papa

Nice, we already learned quite a bit about how Dortmund set up their game. To put things into perspective, let's now plot the same graph for Bayern.

In [48]:
# get all Bayern players that were in the lineup
df_stats_bayern = df_stats[(df_stats["teamId"] == away_team_id) & (df_stats["lineup"] == 1)].copy()

# get the title for the plot; by setting the perspective to "away" we will indicate in the title that
# Bayern played in Dortmund and not at home
plot_title_bayern = py_help.get_match_title(match_id, df_matches, df_teams, perspective="away")

# aggregate the passes between each two players
df_passes_bayern = py_help.prepare_passes_for_position_plot(df_match_events, df_stats_bayern, show_top_k_percent=70)

# we can add the default statistics as hover information by not setting dict_info to False :-)
fig = py_help.create_position_plot(df_stats_bayern, 
                                   title=plot_title_bayern, 
                                   df_passes = df_passes_bayern,
                                   colour_kpi="totalPasses")
fig.show() 

Ok, this does look way different, doesn't it? Before we go into the comparison, let's fix one thing. Compare the colouring between the two charts - you notice how the scale for Dortmund goes from ~15 to ~70, while for Bayern it is between ~20 and ~60. We can change this by giving one additional argument to the plotting function

In [50]:
# plot Dortmund
fig = py_help.create_position_plot(df_stats_dortmund, 
                                   title=plot_title, 
                                   df_passes = df_passes_dortmund,
                                   colour_kpi="totalPasses",
                                   colour_scale=(15,70))

fig.show()


# plot Bayern
fig = py_help.create_position_plot(df_stats_bayern, 
                                   title=plot_title_bayern, 
                                   df_passes = df_passes_bayern,
                                   colour_kpi="totalPasses",
                                   colour_scale=(15,70),
                                   size=1)
fig.show() 

You remember when we looked at the team statistics above and said that they basically looked the same? Not so much any more, does it? So what do we notice when comparing the two plots?

1. The number of passes are way more leveled out between the Bayern players
2. For Bayern each (!) field player has at least 3 connections to other players; for Dortmund we have Aubameyang without any connection and 3 players with only 2 connections (we do not consider Papa here)
3. Yarmolenko is playing very much on the right, while Robben is positioned a little bit more centrally
4. While Dortmund tends to play a lot over their left hand side, Bayern seems to be more balanced

Cool, we have learned quite a lot on how to create position plots. While we could obviously go even deeper on those, let's switch the perspective slightly and move away from looking at specific player and start looking more into how each team played in the different zones of the field.

<a id="passing_zones"></a>

### Passing in zones

In the next section we want to understand how each team passes in each zone, i.e. if Dortmund has for example the ball in left midfield, where do they usually pass the ball. 

In order to do so, let's first quickly understand how many passes each team has in which zone. You remember the heatmap functions we learned about the in notebook about goal kicks? We can now use exactly this function to quickly plot in which zone each team has how many passes.

In [51]:
# prepare plot for Dortmund
###########################
df_passes_dortmund = df_match_events[(df_match_events["teamId"] == home_team_id) & 
                                     (df_match_events["eventName"] == "Pass")]

nb_passes_dortmund, x, y, df_passes_dortmund = py_help.prepare_heatmap(df_passes_dortmund, 
                                                                       "posBeforeXMeters", "posBeforeYMeters", 
                                                                        4, 3, return_df=True)

share_passes_dortmund = nb_passes_dortmund / nb_passes_dortmund.sum() * 100

# prepare plot for Bayern
###########################
df_passes_bayern = df_match_events[(df_match_events["teamId"] == away_team_id) & 
                                     (df_match_events["eventName"] == "Pass")]

nb_passes_bayern, x, y, df_passes_bayern = py_help.prepare_heatmap(df_passes_bayern, 
                                                                   "posBeforeXMeters", "posBeforeYMeters", 
                                                                   4, 3, return_df=True)

share_passes_bayern = nb_passes_bayern / nb_passes_bayern.sum() * 100

# prepare values needed for both plots
##########################

# compute x and y index
x_index = np.tile(np.arange(len(x)),(len(y),1))
y_index = np.transpose(np.tile(np.arange(len(y)),(len(x),1)))

# make sure the colour scale is the same
colour_scale = (0, max(share_passes_dortmund.max(), share_passes_bayern.max()))


# plot Dortmund
###############
# Define what is shown when hovering over a zone
dict_info = {"Total passes": {"values": nb_passes_dortmund, "display_type": ".0f"},
             "Share passes (in %)": {"values": share_passes_dortmund, "display_type": ".1f"},
             "Index x": {"values": x_index, "display_type": ".0f"},
             "Index y": {"values": y_index, "display_type": ".0f"}}

field = py_help.create_heatmap(x, y, share_passes_dortmund, dict_info, 
                               title_name="<b>Dortmund</b> - % of passes starting in each zone",
                               colour_scale=colour_scale,
                               legend_name="% of passes")
field.show()

# plot Bayern
###############
# Define what is shown when hovering over a zone
dict_info = {"Total passes": {"values": nb_passes_bayern, "display_type": ".0f"},
             "Share passes (in %)": {"values": share_passes_bayern, "display_type": ".1f"},
             "Index x": {"values": x_index, "display_type": ".0f"},
             "Index y": {"values": y_index, "display_type": ".0f"}}

field = py_help.create_heatmap(x, y, share_passes_bayern, dict_info, 
                               title_name="<b>Bayern</b> - % of passes starting in each zone",
                               colour_scale=colour_scale,
                               legend_name="% of passes")

field.show()

What we had already assumed by looking at the individual players can again be seen in the graphs above. While Dortmund has quite a lot of passes on their left side and significantly less on the right, Bayern is way better balanced. Notice that also Bayern plays way less through the middle than Dortmund does...

Instead of only checking where the passes started, let's now look into the end points of the passes given Dortmund had the ball in the defensive midfield. To be more precisely: When hovering over the heatmap of Dortmund you can see that in the "defensive midfield zone" (zone (1,1)) there were 62 passes made it this zone. Let's check where those passes end up...

In [52]:
# only get the passes in the defensive midfield zone
df_passes_zone = df_passes_dortmund[(df_passes_dortmund["posBeforeXMetersZone"] == 1) & 
                                    (df_passes_dortmund["posBeforeYMetersZone"] == 1)]

# Make sure it is indeed 62 passes
print(f"Number of passes: {len(df_passes_zone)}")

# get the end zone for each of the passes
nb_passes_zone, x, y, df_passes_zone = py_help.prepare_heatmap(df_passes_zone, 
                                                                "posAfterXMeters", "posAfterYMeters", 
                                                                 4, 3, return_df=True)

# compute the share
share_passes_zone = nb_passes_zone / nb_passes_zone.sum() * 100

# define what is shown when hovering over a zone
dict_info = {"Total passes": {"values": nb_passes_zone, "display_type": ".0f"},
             "Share passes (in %)": {"values": share_passes_zone, "display_type": ".1f"},}

field = py_help.create_heatmap(x, y, share_passes_zone, dict_info, 
                               title_name="<b>Dortmund</b> - End zones of passes from defensive midfield",
                               legend_name="% of passes")

field.show()

Number of passes: 62


Voila, now we can tell what happened when Dortmund had the ball in the defensive midfield. Not surprisingly, there were quite a lot of passes to the left full-back. What I find more interesting, however, is that when playing the ball to the right, it is usually not played to the right full-back but rather to a midfield player...

Even though I kind of like the chart above there is two things that I don't like about it:
1. I can always only look at one start zone at a time, i.e. I would have to look at 12 pictures per team
2. Even if I highlighted the start zone better, I still find it confusing that the graph displayed the end zone and I kind of have to keep the start zone in mind

So, let's try to get rid of those issues and come up with a cooler chart, a pass polar chart :-) 

In [53]:
# as usual, set the hover information to be shown 
dict_info = {"Total passes": {"values": "totalPasses", "display_type": ".0f"},
             "Accurate passes (in %)": {"values": "shareAccuratePasses", "display_type": ".1f"}}

# prepare the pass polar plot
df_pass_polar = py_help.prepare_pass_polar_plot(df = df_passes_dortmund, 
                                                # we want to have on polar bar per zone
                                                group_cols = ["posBeforeXMetersZone", "posBeforeYMetersZone"], 
                                                # length of the triangles should depend on total passes
                                                length_scale_col="totalPasses", 
                                                # colour of the triangle should depend on pass accuracy
                                                colour_col="shareAccuratePasses", 
                                                # pass accuracy of <= 50% should be white and then intepolated between 50% and 100%
                                                colour_scale=(50,100),
                                                # centroids of the zones
                                                centroids_xy=(x,y))

# create the figure
fig_dortmund = py_help.create_pass_polar_plot(df = df_pass_polar, 
                                              dict_info=dict_info, 
                                              title_name="<b>Dortmund</b> - Passes and accuracy per zone")

fig_dortmund.show()

Isn't that a cool looking chart? But how do we interpret it? There is basically 2 things about it:

1. The length of a triangle tells me the number of passes that were taken in the respective direction. We can see for example, that there were basically no passes back to the goalkeeper from the defensive midfield.
2. The colour of a triangle gives me the passing accuracy of the passes in the respective direction. Red means high accuracy and white low accuracy. Crosses from the right hand side, for example, have a rather low accuracy

You can also hover over the triangles to see the information. That should make it completely clear.

Let's quickly draw the same figure for Bayern and compare the two.

In [54]:
# prepare the pass polar plot
df_pass_polar = py_help.prepare_pass_polar_plot(df = df_passes_bayern, 
                                                # we want to have on polar bar per zone
                                                group_cols = ["posBeforeXMetersZone", "posBeforeYMetersZone"], 
                                                # length of the triangles should depend on total passes
                                                length_scale_col="totalPasses", 
                                                # colour of the triangle should depend on pass accuracy
                                                colour_col="shareAccuratePasses", 
                                                # pass accuracy of <= 50% should be white and then intepolated between 50% and 100%
                                                colour_scale=(50,100),
                                                # centroids of the zones
                                                centroids_xy=(x,y))

# create the figure
fig_bayern = py_help.create_pass_polar_plot(df = df_pass_polar, 
                                            dict_info=dict_info, 
                                            title_name="<b>Bayern</b> - Passes and accuracy per zone")

fig_dortmund.show()
fig_bayern.show()

Great! I bet you can directly see some interesting differences in the game setup between the two teams. Let's point some of them out:
1. For Bayern, the goalie is much more involved with quite a lot of passes going back to him; also the full-back areas (I do not mean the players itself but the area on the field) are used more often by Bayern
2. In the defensive midfield, Dortmund plays more vertically while Bayern plays most often horizontally
3. Overall, Dortmund has way more passes on their left hand side than on the right (this we had seen before already)
4. However, when having the ball in the offensive midfield, Dortmund plays towards the right more often than to the left; this is different for Bayern


Notice that while we used the pass polar plot above for only one match (which honestly makes some of the statistics somehow shaky as we do not have a lot of observations ;-)), you can obviously do the same thing for all the matches. Or you can decide to check how Dortmund likes to play against the top teams etc. This is something you should keep in mind in general: In this notebook we do use the helpers in a specific way to show you that they exist and give you an idea on how to use them. However, you can obviously tweak the use-cases and also use those functions for other scenarios!

<a id="gini"></a>

# Gini coefficient of passes

Uff, we have already learned quite a lot in this notebook, haven't we? There is one more thing, however, I would like to look into: the [Gini coefficient](https://en.wikipedia.org/wiki/Gini_coefficient) for passing. Yes, you heard right, what you might know from some economics classes I want to use for the "compactness" of a team w.r.t. passing.

You remember when we looked at the position plots above and said that Bayern's passing numbers were way more equally distributed between the players than Dortmund's? In the following section I want to look a little bit deeper into this and we are also going to come up with a KPI on how to measure the equality of the passing distribution. 

Let's start by bringing the numbers back to mind.

In [22]:
df_stats_dortmund[["playerName", "playerPosition", "totalPasses", "minutesPlayed"]]

Unnamed: 0,playerName,playerPosition,totalPasses,minutesPlayed
0,Bartra,DF,46,90.0
3,S. Papastathopoulos,DF,13,42.0
8,G. Castro,MD,57,90.0
9,Ö. Toprak,DF,68,90.0
11,M. Schmelzer,DF,66,90.0
13,S. Kagawa,MD,26,68.0
17,P. Aubameyang,FW,13,90.0
18,R. Bürki,GK,28,90.0
20,A. Yarmolenko,FW,33,80.0
24,J. Weigl,MD,55,90.0


Ok, two things that we should consider before measuring any compactness: 
1. Not all players played for 90 minutes; we should clean for that by rescaling the passes to 90 minutes
2. The goalie does have a special role; we should not consider him

In [23]:
# get rid of goalie
df_pass_dortmund = df_stats_dortmund[df_stats_dortmund["playerPosition"] != "GK"].copy()

# rescale passes to 90 minutes
df_pass_dortmund["totalPasses90Min"] = df_pass_dortmund["totalPasses"] * 90 / df_pass_dortmund["minutesPlayed"]

df_pass_dortmund.sort_values("totalPasses90Min", ascending=False, inplace=True)

# plot the number of passes per player
fig = px.bar(df_pass_dortmund, 
             x="playerName", 
             y="totalPasses90Min", 
             text="playerPosition",
             labels={"totalPasses90Min": "Passes per 90min",
                     "playerPosition": "Position",
                     "playerName": "Player"},
             title="<b>Dortmund</b> - Number of passes scaled to 90 minutes")
fig.show()

This again shows the inequality. While Toprak has 68 passes in 90 minutes, Aubameyang only has 13... Let's look at the graph in a slightly different way by considering cumulative passes. We will first produce the graph and then I will elaborate a little bit more on what I mean with that...

In [24]:
df_pass_dortmund["cumPasses"] = df_pass_dortmund["totalPasses90Min"].cumsum()
df_pass_dortmund["cumSharePasses"] = df_pass_dortmund["cumPasses"] / df_pass_dortmund["totalPasses90Min"].sum() * 100
df_pass_dortmund["playerNumber"] = np.arange(1,len(df_pass_dortmund)+1)

fig = px.line(df_pass_dortmund, 
              x="playerNumber", 
              y="cumSharePasses", 
              labels={"cumSharePasses": "Cumulative passes (in %)",
                      "playerNumber": "Amount of players"},
              title="<b>Dortmund</b> - Cumulative passes of players")
fig.update_yaxes(range=[0, 100])
fig.show()

So what do we see here? On the x-axis we have all the field players lined up, starting with the player that had the most passes, then the second most etc. On the y-axis we have the share of passes that were played by the x-players. 

If you look at the first player, the y-value is ~16%. This means that the player with the most passes had ~16% of all passes of the team (excl. goalie). Now, if you go to the next player, you see that the first two players had ~31% of all passes etc. This way we can see that the 4 players with the most passes had 57% of the passes and the 7 players with the most passes ~85%(!). 

So now what does that tell us? Let's draw in Bayern as a comparison.

In [25]:
# get rid of goalie
df_pass_bayern = df_stats_bayern[df_stats_bayern["playerPosition"] != "GK"].copy()

# rescale passes to 90 minutes
df_pass_bayern["totalPasses90Min"] = df_pass_bayern["totalPasses"] * 90 / df_pass_bayern["minutesPlayed"]
df_pass_bayern.sort_values("totalPasses90Min", ascending=False, inplace=True)

# compute the share of passes per player
df_pass_bayern["cumPasses"] = df_pass_bayern["totalPasses90Min"].cumsum()
df_pass_bayern["cumSharePasses"] = df_pass_bayern["cumPasses"] / df_pass_bayern["totalPasses90Min"].sum() * 100
df_pass_bayern["playerNumber"] = np.arange(1,len(df_pass_bayern)+1)
df_pass_bayern["team"] = "Bayern"

df_pass_dortmund["team"] = "Dortmund"

# combine the data of Dortmund and Bayern
df_pass_all = pd.concat([df_pass_dortmund, df_pass_bayern])

# draw the chart
fig = px.line(df_pass_all, 
              x="playerNumber", 
              y="cumSharePasses",
              color="team",
              labels={"cumSharePasses": "Cumulative passes (in %)",
                      "playerNumber": "Amount of players",
                      "team": "Team"},
              title="Cumulative passes of players - Dortmund vs. Bayern")
fig.update_yaxes(range=[0, 100])
fig.show()

This show us very clearly what we had already thought when looking at the position plots: The distribution regarding the passes is better between the Bayern players. How can I tell? You see that the red line for Bayern is always below the blue line for Dortmund? That means that x players with the most passes had less share of the total passes of the team. Look, for example, at the 4 players with the most passes. For Dortmund those players made ~57% of all passes while for Bayern they only made ~48%. 

So, the lower the curve, the more equalized the passing between the players of team. Question is if we can measure this observation also in one number... This is were the Gini coefficient comes in. In a nutshell, the Gini coefficient measures how well-balanced the number of passes are between the players - with Gini coefficient of 0 indicating that all players have the same amount of passes and a Gini coefficient of 1 indicating that there was only one player in the team who made every single pass. In case you are interested, you can get more information here: [Gini coefficient](https://en.wikipedia.org/wiki/Gini_coefficient)

So, let's compute the coefficient for both teams:

In [26]:
print(f"Gini coefficient Bayern: {gen_help.compute_gini(df_pass_bayern['totalPasses'])*100:.1f}%")
print(f"Gini coefficient Dortmund: {gen_help.compute_gini(df_pass_dortmund['totalPasses'])*100:.1f}%")

Gini coefficient Bayern: 13.4%
Gini coefficient Dortmund: 28.0%


Good, this is exactly the way we had expected it... Let's now go one step further and compute the Gini coeffiencient for all teams and all matches. Notice how we can use the *compute_statistics* function again to efficiently do this :-)

In [27]:
# compute passes and minutes player for each player in each match
df_stats = ed_help.compute_statistics(df_events, 
                                      group_col="player_match", 
                                      df_formations=df_formations,
                                      keep_kpis=["totalPasses","minutesPlayed"])

df_stats.head()

Unnamed: 0,playerId,matchId,playerName,playerPosition,teamId,nbMatches,totalPasses,lineup,substituteIn,substituteOut,minutesPlayed
0,77,2516788,N. Moisander,DF,2443,1,78.0,1,0,0,90.0
1,77,2516800,N. Moisander,DF,2443,1,25.0,1,0,0,90.0
2,77,2516806,N. Moisander,DF,2443,1,55.0,1,0,0,90.0
3,77,2516812,N. Moisander,DF,2443,1,48.0,1,0,0,90.0
4,77,2516823,N. Moisander,DF,2443,1,58.0,1,0,0,90.0


Let's only take field players that started the match and rescale the passes to 90 minutes

In [28]:
# Only take field players that were in the lineup
df_stats = df_stats[(df_stats["lineup"] == 1) & 
                    (df_stats["playerPosition"] != "GK")].copy()
# scale passes to 90 minutes
df_stats["totalPasses90Min"] = df_stats["totalPasses"] * 90 / df_stats["minutesPlayed"]

Good, we are not ready to compute the Gini coefficient for each team and match and also add the number of points (i.e. did the team win, lose or draw)

In [29]:
# compute the Gini coefficient for each team and match 
df_gini = df_stats.groupby(["matchId","teamId"])["totalPasses"].agg(lambda x: gen_help.compute_gini(x)).reset_index()
df_gini.rename(columns={"totalPasses":"giniCoeff"}, inplace=True)

# add the number of points the team made in the match
df_gini = pd.merge(df_gini, df_matches[["matchId","teamId","points"]])

# for each match, highlight the team that had the higher gini index
df_gini_min = df_gini.groupby("matchId").agg(minGini=("giniCoeff","min")).reset_index()
df_gini = pd.merge(df_gini, df_gini_min)
df_gini["lowerGini"] = 1*(df_gini["giniCoeff"] == df_gini["minGini"])

df_gini.head(4)

Unnamed: 0,matchId,teamId,giniCoeff,points,minGini,lowerGini
0,2516739,2444,0.168246,3,0.168246,1
1,2516739,2446,0.192737,0,0.168246,0
2,2516740,2443,0.145638,0,0.145638,1
3,2516740,2482,0.252311,3,0.145638,0


Hm, in the first match it was indeed the team for with lower Gini coefficient (i.e. with better balance in passing) that won the match. In the second match, however, it was the completely opposite. Let's see what happens when we aggregate over all 306 matches...

In [30]:
df_gini.groupby("lowerGini").agg(meanPoints=("points","mean")).reset_index()

Unnamed: 0,lowerGini,meanPoints
0,0,0.96732
1,1,1.761438


Wow! The team that had the lower Gini coefficient and therefore a more equal distribution between the passes of the field players made on average 1.76 points/game while the team with the higher Gini coefficient made only 0.97 points/game.

Let us think about this again: Only by knowing the <b>distribution</b> of the passes between the field players we do get a pretty good idea of who wins a match. And we do not know anything about the total number of passes, the number of shots, the number of free kicks etc. That's amazing! And you want to know what is even more amazing...

Let's check out the average number of points for the team that had more shots in the match:

In [31]:
# get the number of shots for each team in each match
df_shots_team = ed_help.compute_statistics(df_events, 
                                            group_col="team_match", 
                                            keep_kpis=["totalShots"])

# get the total number of shots for each game
df_shots_match = ed_help.compute_statistics(df_events, 
                                            group_col="match", 
                                            keep_kpis=["totalShots"])
df_shots_match.rename(columns={"totalShots":"totalShotsMatch"}, inplace=True)

# combine the two and compute the share of shots that each team had
df_shots_team = pd.merge(df_shots_team, df_shots_match, how="left")
df_shots_team["shareShots"] = df_shots_team["totalShots"] / df_shots_team["totalShotsMatch"]

# add the number of points each team made
df_shots_team = pd.merge(df_shots_team, df_matches[["matchId","teamId","points"]])

# only keep the teams that had more shots than the opponent
df_more_shots = df_shots_team[df_shots_team["shareShots"] > 0.5]

print(f"Avg. number of points of the team with more shots: {df_more_shots['points'].mean():.2f}")

Avg. number of points of the team with more shots: 1.67


You see what I am seeing?! Knowing which team had the lower Gini coefficient seems to be a better indicator of who will the match than knowing which team had more shots! 

## Summary

Nice, you made it through the notebook about passes! Again, I hope you had some fun and learned something on the way! Let me quickly summarize what we learned:
- Usage of the *compute_statistics* function to compute all sorts of statistics
- Drawing player positions and passes between players using the *create_position_plot* function
- Usage of *create_pass_polar_plot* to get good visualization of passes in different zones of the field
- Measuring how well balanced a team is w.r.t. to the number of passes of each player by using the Gini coefficient
- Hopefully much more about Python when going through the notebook and potentially through some of the helper functions :-)