# Abstract

For this project, we want to analyze football transfers. The data to build our network is web scraped from [`transfermarkt.com`](https://www.transfermarkt.com/), a football-specialized website. This website records all transfers between clubs all around the world, from major leagues to less-popular ones. The data does not concern only the first-level leagues, but also second and inferior divisions. Due to the great granularity of the data stored in this website, our analysis will only take into account all transfers from the 1st January 2015 to the 31 December 2016.

Our network is composed of football clubs. Each node represents a club who participate in at least one transfer between the two years of interest. A transfer between two clubs is encapsulated as an edge.

A first step in this project will be to analyze the differences between the major three types of transfers: Free transfers, loans, and monetary transfers. Each type of transfers has its own specificities, regarding the type of clubs or the characteristics of players. In a second phase, we will look more deeply in the monetary transfers network and the way money flows in this market.

Our project shows that the difference between bla bla bla **TODO**

> **Tip**: For a better experience reading this notebook, we advice you, dear reader, to open it with [nbviewer](https://nbviewer.jupyter.org/github/MGT-416/Team1FinalProject/blob/master/Project%20Report.ipynb#Introduction)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Abstract" data-toc-modified-id="Abstract-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Abstract</a></span></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-Acquisition-and-Preparation" data-toc-modified-id="Data-Acquisition-and-Preparation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Acquisition and Preparation</a></span></li><li><span><a href="#Overview-of-Analysis" data-toc-modified-id="Overview-of-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Overview of Analysis</a></span></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Analysis</a></span><ul class="toc-item"><li><span><a href="#Centralities-analysis" data-toc-modified-id="Centralities-analysis-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Centralities analysis</a></span><ul class="toc-item"><li><span><a href="#Centralities-analysis---Club-Level" data-toc-modified-id="Centralities-analysis---Club-Level-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Centralities analysis - Club Level</a></span></li><li><span><a href="#Centralities-analysis---League-Level" data-toc-modified-id="Centralities-analysis---League-Level-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Centralities analysis - League Level</a></span></li></ul></li><li><span><a href="#Communities" data-toc-modified-id="Communities-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Communities</a></span></li><li><span><a href="#ARTEM" data-toc-modified-id="ARTEM-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>ARTEM</a></span></li><li><span><a href="#HUGO" data-toc-modified-id="HUGO-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>HUGO</a></span></li></ul></li><li><span><a href="#Discussion" data-toc-modified-id="Discussion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Discussion</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Appendix:-Project-Structure" data-toc-modified-id="Appendix:-Project-Structure-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Appendix: Project Structure</a></span></li></ul></div>

# Introduction

Football is the probably the most popular sport in the world. It also one of the sports with the most money flowing around it. With all the financial stakes involved in football, the transfer market is a key moment in a club's sporting and marketing success. In the football transfer market, each club can hire players. If a player already has a contract with another club, both clubs can find a financial agreement for the player leaving a club, the selling club, to join a new team, the buying club. This is what we call **monetary transfers** in this project. Another possibility for a club, or team, to sign a new player is to sign it for free, when the player has no current contract. This is referred to as **free transfer**. Another type of transfer possible is the **loan**, when a player, with a contract with a club, joins a new team for a pre-defined amount of time, negotiated between the two clubs.

In this project, we model the football transfer market as a network. Each node represents a club. All nodes have an attribute: the league the club is part. A transfer between two clubs is encapsulated as an edge. As for the nodes, each edge stores information about its transfer, like player or clubs characteristics. When a player leaves club A and sign a new contract with club B, this is represented as a directed edge, from club A to club B. Note that two players might leave club A and join club B: this is represented with two directed edges, from club A node to club B node. Thus, this project will deal with a **Multi Directed Network**.

The question we want to answer with this data is: Does the monetary transfers have the same characteristics as the loans on the free transfers? What is the kind of clubs participating in these type of transfers? What are the differences between clubs doing a lot of monetary transfers and the clubs doing loans? What are the interactions between clubs of the same country? Of the same league division? We will also look at the players and try to create a player's profile for each type of transfers. We will also look exclusively at the monetary transfers and try to understand how the money flows between clubs, which are the *key* clubs in this network, which clubs have the more effect on the amount of money involving transfers?

To answer these questions, we will mainly rely on centralities measures, like the degree centrality or the PageRank centrality. Diffusion models will also be a key part of our analysis of monetary transfers.

# Data Acquisition and Preparation

> The notebook with the code detailed in this section is **`1. Buid Data`** into our github repository.

Our analysis will be based on the data available on [Transfer Mart](https://www.transfermarkt.com/), one of the most complete open platform about Football data. This website is often used by journalists to estimate the monetary value of a player in the transfer market. Regarding, our project and the analysis we wish to do, the website most important features is its transfers history. Almost all transfers happening in the football world, from the one involving millions of millions to the ones between two small clubs playing in a small division of eastern Europe, are recorded and categorized. 

For example, if someone wants to know all transfers that were made on the 12th of August 2015, the website has this information accessible at [https://www.transfermarkt.com/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/2015-08-12/page/1](https://www.transfermarkt.com/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/2015-08-12/page/1). Within this URL, three things need to be stated:

- Once the page is open, we can see that all transfers are organized into an array, with information about the player, the selling club, the buying club and the transfer's type, among all other information. We can also notice that the player and clubs are represented with a *hyperlink* attached to it, enabling us to have the URL will all other information for a player or a club.
- The date is present just after **datum** in the URL, in format *year-month-day*. To get the data for transfers between 2015 and 2016, we will use this URL with the appropriate date.
- For a given day, it might have been a lot of transfers. The website only shows 25 transfers per HTML page, but we can iterate through all pages with the last component of the URL.

For each transfer, the website stores a lot of information, from the player's name to the selling club director. Only a subset of those records are of interest for our project:

- Player attributes:
    - **Player Name**: Name of the player
    - **Player Link**: *Transfermarkt** URL for the player's profile
    - **Player position**: Position of the player
    - **Age**: Age of the player at the time of the transfer


- Transfer money:
    - **Fee**: Monetary value, if any, of the transfer
    - **Market value**: Theoretical value of the player, computed by Transfermarkt.com

- Clubs
    - **From club**: Club/Team that the player leaves
    - **To club**: Club/Team that the player joins.
    - **From manager**: Manager of the club that the player leaves.
    - **To manager**: Manager of the club that the player joins.
    - **From manager link**: Transfermarkt** URL for the manager of the club that the player leaves.
    - **To manager link**: Transfermarkt** URL for the manager of the club that the player joins.
    
    
- Competitions
    - **From competition**: Competition/League where the `from club` participates in
    - **To competition**: Competition/League where the `to club`  participates in

**Web scraping strategy**:
- `Transfermarkt.com` has an URL for each transfers occurring at a specific date.
- The transfers happening on a specific day can be spread across multiple pages.
- For each transfer, a detailed version - containing the information we are interested in - is available through a link.
- Create one csv file per day. At the end, merge all csv files into one (so if an error occurs, no need to start everything)
- All transfers happening in **2015** and **2016** will be retrieved.

Once retrieved, all *.csv* files will be merged into a single one and feed to **Pandas** to build a Dataframe. The data looks like:

In [4]:
import pandas as pd

df = pd.read_csv("data/data.csv", index_col=0)
df.sample(5)

Unnamed: 0,Player Name,Player Link,Player position,From club,To club,From competition,To competition,From manager,From manager link,To manager,To manager link,Market value,Fee,Age,From manager agent,To manager agent,Player Agent
17432,chris-kettings,/chris-kettings/profil/spieler/177953,Keeper,Crystal Palace,Stevenage FC,Premier League,League Two,Alan Pardew,/alan-pardew/profil/trainer/1988,Teddy Sheringham,/teddy-sheringham/profil/trainer/37240,250 Th. €,Loan,22 years 09 months 20 days,,,
18784,yonathan-del-valle,/yonathan-del-valle/profil/spieler/72643,Right Wing,Rio Ave FC,Kasimpasa,Primeira Liga,Süper Lig,Pedro Martins,/pedro-martins/profil/trainer/16027,Riza Calimbay,/riza-calimbay/profil/trainer/789,"1,80 Mill. €",Loan,25 years 02 months 06 days,,,"TGC - Sports Management, Events & Trading"
3666,kevin-van-veen,/kevin-van-veen/profil/spieler/159894,Centre-Forward,FC Oss,Scunthorpe Utd.,Jupiler League,League One,Wil Boessen,/wil-boessen/profil/trainer/2199,Mark Robins,/mark-robins/profil/trainer/1569,250 Th. €,300 Th. €,23 years 07 months 29 days,,,Futuralis Football Group
4297,kosuke-nakamura,/kosuke-nakamura/profil/spieler/165794,Keeper,Kashiwa Reysol,Avispa Fukuoka,J. League Division 1 – Second Stage,J. League Division 2,Tatsuma Yoshida,/tatsuma-yoshida/profil/trainer/38838,Masami Ihara,/masami-ihara/profil/trainer/10924,50 Th. €,Loan,19 years 10 months 11 days,,,
33464,alex-sirri,/alex-sirri/profil/spieler/132688,Centre-Back,Alessandria,Arezzo,Prima Divisione - A,Prima Divisione - A,Angelo Gregucci,/angelo-gregucci/profil/trainer/2130,Stefano Sottili,/stefano-sottili/profil/trainer/19861,125 Th. €,?,24 years 09 months 19 days,,,Italian Managers Group s.r.l.


This data needs to be modified a bit in order to use it perfectly for our analyses:
- Create the **transfer type**: As stated above, there is not a unique type of transfers in the football market. The website stores the type of a transfer, but all under the same column and with several and different formats. Based on all possible entries for the **transfer fee** column in *Transfermarkt.com*, we create **four types** of transfers: **Free**, **loan**, **swap** and monetary **transfer**. Note that the *swap* transfers will be discarded due to the small amount of transfers.
- For the monetary type and the loans, some money might have been involved in the transfer. This information is also stored under the **transfer fee** column in *Transfermatk.com*. The fee specified will be extracted and stored in a new column in our data under the *integer* type.
- During the web scraping phase, whitespaces have been appended with **player position** data. Those need to be removed.
- The each of the player is stored as a string, with multiple formats: sometimes with the year, month and day information (*21 years 09 months 04 days*) or sometimes only with the years and month (*31 years and 01 months*). The age will be converted into a single float variable, stored into a new column.

Finally, the data is ready to be used for creating networks. **Five networks** are created:
- All transfers
- Only monetary transfers
- Only loans
- Only free transfers
- Only swap transfers

All networks are **Multi Directed Graph**. Below is the output of NetworkX's function *info*:

> Name: loan<br/>
> Type: MultiDiGraph<br/>
> Number of nodes: 2664<br/>
> Number of edges: 10773<br/>
> Average in degree:   4.0439<br/>
> Average out degree:   4.0439<br/>

> Name: swap<br/>
> Type: MultiDiGraph<br/>
> Number of nodes: 76<br/>
> Number of edges: 87<br/>
> Average in degree:   1.1447<br/>
> Average out degree:   1.1447<br/>

> Name: transfer<br/>
> Type: MultiDiGraph<br/>
> Number of nodes: 1124<br/>
> Number of edges: 2913<br/>
> Average in degree:   2.5916<br/>
> Average out degree:   2.5916<br/>

> Name: free<br/>
> Type: MultiDiGraph<br/>
> Number of nodes: 3851<br/>
> Number of edges: 30994<br/>
> Average in degree:   8.0483<br/>
> Average out degree:   8.0483<br/>

# Overview of Analysis

# Analysis

In [11]:
import networkx as nx

In [12]:
G_monetary  = nx.read_gml("networks/transfers_transfer_network.gml")
G_loans     = nx.read_gml("networks/transfers_loan_network.gml")
G_free      = nx.read_gml("networks/transfers_free_network.gml")

## Centralities analysis

In a first touch with the data, we want to compute some centralities measures and understand the differences betwenn the three types of transfers for each measures. 

Please refer to the notebook **`Transfers vs Loan vs Free`** for the complete code. Below, only the findings are presented.

### Centralities analysis - Club Level

**In-degree**

The **in-degree** centrality deals with the number of new players clubs acquire. In the **monetary transfers**, we can see clubs from primary leagues, with centralities values close one to the others. The **free** and **loan** network versions have top-ranked club less popular, from inferior divisions leagues. This is a first step in confirming the above assumption: popular clubs with money, which are fewer, participate in less transfers and in the vast majority of the monetary ones.

**Out-degree**

- we can see again that popular clubs are the owns *selling* the most players. 
- The **free** ranking is a bit different, compare to the *in-degree* version. There is almost only clubs from primary divisions. One assumption could be that players in the final phase of their careers leave a mid-table club for free and return to their home country league. It will be interesting to compare those findings with the ones taking player's age into account.
- The **loan** version contains clubs exclusively from primary divisions. This makes sens, as top clubs have the habit to buy young and promising players and directly "send" them to less stressful teams in order to win experiences. This ranking is composed a lot of italian clubs and the club leading this ranking is **Juventus**, the previous winner of the primary italian division. We can note that in the **in-degree** version, the *loan* ranking was also composed of a lot of italian clubs, but this time from inferior divisions. One possible explanation is that clubs from primary division "send" their young players to teams playing in the secondary and tertiary italian divisions.

> **Interesting fact**: In the **free** ranking, the club **Parma** appears first, with a centrality value 60% higher than the second one. Why ? Because this football club had has financial troubles and had to declare bankruptcy in 2015. Thus a lot of player left the club for free.
    
> **Interesting fact #2**: As pointed out previously, there is a lot of italian clubs in the **loan** rankings, both in and out degree versions. This might be a cultural thing: in Italy, some clubs have a reputation of being **farm clubs** club. taking young players on loan from bigger teams. Why in Italy mainly ? Because the italian rules allow italian clubs [to *co-own* players](http://www.bbc.com/sport/football/34125476).

**Closeness**

As previously, one top-clubs compose the monetary transfers ranking. Those clubs are all european and mainly from England and italy. It's interesting to know that Watford, the ranking leader, was managed in 2016/2016 by an italian manager and it's now managed by a portuguese one. The closeness centralities values are pretty close one to the others.

The loan ranking is interesting, as there is some portuguese clubs. Portuguese clubs are known for making transfers with south-american players, mostly Brazilian ones, wishing to have a career in Europe. Why Portugal? Because of the cultural and language proximity. Those players are then sell to higher-value european clubs, or go back at home in case of failure. Thus, founding portuguese clubs in this ranking isn't surprising.

We can notice that in this centrality rankings, almost all clubs are from first division leagues.


**Betwenness**
The betweenness rankings are a bit more difficutly to make sens of. As before, we can notice that each of those has italian clubs among the top ones.

**PageRank**

In the **transfer** version, weighted with the number of transfers, almost all clubs are from England, primary or secondary division mixed. But in the **free** ranking, there is no club from a first division, only *small* clubs. Interestingly, first both clubs are from Switzerland. We note the same behavior on the **loan** ranking, mainly composed of clubs from inferior divisions. This conforts our previous findings: clubs from the primary division are mainly important in the monetary transfers, but not so in the loans and free version.

In the transfer version, but weighted with transfer's fees, most *famous* clubs are present. There is mainly clubs from England, with a lot of financial power, or clubs recently bought by new investors, like Paris SG and Valencia.

### Centralities analysis - League Level

Perform the same centralities analysis, but this time at the league level. All clubs evolving within the same league will be merged together.

**In-degree**

We first see that there is a much bigger variation in the centralities values compare to the club version.

In the **transfer** ranking, the first two leagues are the primary leagues from **China** and **Turkey**. It's interesting since no clubs from those leagues were present in the club ranking version. This means that chinese and turkish clubs are pretty homogeneous in terms of transfers: there is no club that does more transfer than the others, they all follow the same transfer strategy. Clubs from the **Jupiler Pro League**, from Belgium, also have the same type of attitude. 

Even if the leagues in this ranking are mainly issued from primary division, two secondary division leagues are present: **Championship** is the England second division and one of the "richest" league in Europe, **Serie B** is from Italy and confirms a fact already observed with clubs - Italian clubs where highly active in monetary transfers last years.

> Note that the network doesn't take transfer value in consideration.

Among the top 3 leagues in the **loan** ranking, the primary and secondary divisions from **Spain** are present. And once again, there was no spanish club in the same ranking at club-level. Following the same analysis as for Chinese and Turkish clubs in the monetary transfers, this means that spanish clubs have a *football loan* culture, as for Italy. But the difference is that in Italy, only a subset of clubs participate in this type of transfers, where in Spain it appears that much more clubs follow the same "loan strategy". Similar behavior can be observer with **Portuguese** clubs (*Primeira Liga* and *Segunda Liga*).

**Out-degree**
Comparing the in-degree and out-degree monetary transfers ranking, we can notice that Chinese and Turkish leagues don't appear in the second one: In 2015-2016, football players go more often to those countries that they leave. This is expected, as both countries have increase football interests recently. Another expected fact: find the Brazilian primary league in this out-degree ranking. The brazilian league is present in the top-10 ranking for monetary transfers, and is at second position in the loan rankings. This must be linked directly with Spanish and Portuguese clubs in the in-degree ranking version.

The german league (1. Bundesliga) in present in the top-5 of both version of the monetary transfers ranking: german top clubs seem to have less stable teams in those last years.

The free versions of the in and out degree centralities are the only ones to contains leagues from eastern Europe, where there is less money flowing. But money flowing in leagues isn't the key differentiator, because the english league is present everywhere. Compare to the two other rankings, the free ones have a lot of overlap between the in and out degree version.

**Closeness**

First, we can notice that the closeness centralities are all high, almost at 0.5.

The monetary ranking is coherent with the in-degree ranking, with the Turkish and Chinese leagues in the top of rankings. One more time, the Championship (England) is the only secondary division league present in the monetary ranking. This clearly demonstrates the importance of this league in the transfer market. But this centrality measure also gives results more complicated to analyse: the belgium and swiss leagues appear in this ranking, quite surprisingly. Otherw§ise, the big 5 is always present: Germany, Italy, England, France and Spain.

In the free transfers ranking, we can directly notice that all centralities values are pretty close one to the others. There is no league with a lot of money flowing in and out, the ranking is mainly composed of inferiors division. The fact of having such close centralities values also means that there is no outstanding or central league, they are close in a network point of view.

The loan ranking also has close centralities values. We can notice, as before, the presence of the primary and secondary divisions from Spain and Portugal.

**Betweenness**

All centralities values are small, but in can see as before that the **top five** plus Swiss and Belgium leagues are present in the **monetary transfers** ranking. Western Europe is really the central place for monetary transfers.

As for the *closeness centrality*, the rankings for **free** and **loan** transfers are very different from the ones of in/out-degree, with a mix of inferior division and some primary ones. We can note that the **Premier League** is present everywhere.

**PageRank**

The pagerank centrality for the transfer version is interesting, with the **Chinese** league being first, with a centrality value bigger than the second one. This is due to the massive offensive of chinese clubs in the transger market these last three years. Most surprisingly, the Turkish and Belgium league appear in this top 5. Below, the centralities are very close ones to the others, but we can se that a small spanish division seems important. The free and loan rankings are mainly composed of inferior division, once more asserting our assumption of popular and powerful clubs making business almost entirely with monetary transfers.

## Communities

In [16]:
# import the Louvain algorithm
import community as community
import operator

Take only a subset of the graph: All edges where one of the two nodes is part of the main european leagues.

In [17]:
leagues = ['Premier League', 'Championship', 'Serie A', 'Ligue 1', '1.Bundesliga', 'Primera División', 'Primeira Liga', ]

In [18]:
for G in [G_monetary, G_free, G_loans]:
    G_new = nx.MultiDiGraph()

    print("================================================", G.name,"================================================")
    print()
    
    for n1,n2,e in G.edges(data=True):
        if G.node[n1]['competition'] in leagues or G.node[n2]['competition'] in leagues:
            G_new.add_edge(n1,n2)

    # compute the best partition
    partition = community.best_partition(G_new.to_undirected())

    size = len(set(partition.values()))
    print('The number of communities: ', size)

    # For each communitiy, group the students per major
    nbrToPrint = 5

    for i in range(size):
        # Retrieve nodes inside the community
        community_i = [nodes for nodes in partition.keys() if partition[nodes] == i]

        # Majors frequency dict
        majors = {}
        for n in community_i:
            major = G.nodes[n]['competition']
            majors[major] = majors.get(major, 0) + 1

        tot = sum(majors.values())
        majors = sorted(majors.items(), key=operator.itemgetter(1), reverse=True)

        print("Community",i)
        for en,c in enumerate(majors[:nbrToPrint]):
            print("\t{}.   {:<35}{:<5}({:<7.2f}%)".format(en+1, c[0], c[1], 100*c[1]/tot))
        print()


The number of communities:  13
Community 0
	1.   1.Bundesliga                       17   (22.37  %)
	2.   2.Bundesliga                       13   (17.11  %)
	3.   Raiffeisen Super League            6    (7.89   %)
	4.   3.Liga                             6    (7.89   %)
	5.   Bundesliga                         3    (3.95   %)

Community 1
	1.   Ligue 1                            5    (13.16  %)
	2.   Campeonato Brasileiro Série A      4    (10.53  %)
	3.   Premier League                     3    (7.89   %)
	4.   Primera División                   3    (7.89   %)
	5.   Chinese Super League               3    (7.89   %)

Community 2
	1.   Championship                       23   (22.12  %)
	2.   Premier League                     12   (11.54  %)
	3.   League One                         12   (11.54  %)
	4.   League Two                         10   (9.62   %)
	5.   Eredivisie                         6    (5.77   %)

Community 3
	1.   Primera División                   13   (22.81  %)
	2.  

In the **loan transfers** and **free transfers** network, the communities are clearly organized after the countries. Primary, secondary and tertiary divisions all always cluster together. There is also some more hybrid communities, like the ones composed of leagues from Portugal and Brazil or Spain and South America countries. The language plays a crucial role in these types of transfers. As for the Portugal-Brazil or Spanish-speaking countries, France and the Belgium primary divisions are clustered together.

> **Interesting fact**: In the free transfer, the community containing french clubs also contains clubs from the **Qatar Stars League**. This league has never showed up in the analysis before,

The **monetary transfers** communities are also tied to this notion of countries, but in a less strong way. Within this network, there are some communities composed mainly by clubs from one country (Germany, England, France), but also communities composed of clubs from more diverse countries, like *Community 3* with clubs from Portugal, Spain, France, Brazil and China. One conclusion that can be made is that when it comes to loans and free transfers, clubs have a preference to deal with close clubs, regarding the country and language. This "restriction* is less obvious when money is involved. This observation was expected: the more good a player is, the more money its transfer will cost. Good players have the tendency to join big european clubs, thus leaving their home countries in most of the cases.

## ARTEM

## HUGO

# Discussion

# Conclusion

# Appendix: Project Structure