<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Web-Scrap---Transfers" data-toc-modified-id="Web-Scrap---Transfers-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Web Scrap - Transfers</a></span></li><li><span><a href="#Open-data" data-toc-modified-id="Open-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Open data</a></span></li><li><span><a href="#Web-Scrap-Agents" data-toc-modified-id="Web-Scrap-Agents-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Web Scrap Agents</a></span><ul class="toc-item"><li><span><a href="#Player-Agents" data-toc-modified-id="Player-Agents-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Player Agents</a></span></li><li><span><a href="#Manager-Agents" data-toc-modified-id="Manager-Agents-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Manager Agents</a></span></li></ul></li><li><span><a href="#Save-data" data-toc-modified-id="Save-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Save data</a></span></li><li><span><a href="#Transform-data" data-toc-modified-id="Transform-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Transform data</a></span><ul class="toc-item"><li><span><a href="#Create-Transfer-type" data-toc-modified-id="Create-Transfer-type-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Create Transfer type</a></span></li><li><span><a href="#Transfer-fee---integer-amount" data-toc-modified-id="Transfer-fee---integer-amount-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Transfer fee - integer amount</a></span></li><li><span><a href="#Players-positions" data-toc-modified-id="Players-positions-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Players positions</a></span></li><li><span><a href="#Age-to-number" data-toc-modified-id="Age-to-number-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Age to number</a></span></li></ul></li><li><span><a href="#Build-Network" data-toc-modified-id="Build-Network-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Build Network</a></span><ul class="toc-item"><li><span><a href="#Nodes" data-toc-modified-id="Nodes-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Nodes</a></span></li><li><span><a href="#Edges" data-toc-modified-id="Edges-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Edges</a></span></li><li><span><a href="#Network" data-toc-modified-id="Network-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Network</a></span></li><li><span><a href="#Export" data-toc-modified-id="Export-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Export</a></span></li><li><span><a href="#Build-Network-for-each-transfer-type" data-toc-modified-id="Build-Network-for-each-transfer-type-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>Build Network for each transfer type</a></span></li></ul></li></ul></div>

# Imports

In [1]:
# Beautiful Soup for web scrapping data
from bs4 import BeautifulSoup
import requests

# Save into csv file
import csv

# Pandas to reimport csv data
import pandas as pd

# Numpy for array manipulation
import numpy as np

# NetworkX
import networkx as nx

Use it for progess bar (https://github.com/alexanderkuk/log-progress)

In [2]:
def log_progress(sequence, every=None, size=None, name='Items', delete=False):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )
        if delete:
            box.close()

# Web Scrap - Transfers

For this project, we want to analyze football transfers. The data to build our network will be web scraped from `transfermarkt.com`, a football-specialized website. This website records all transfers between clubs all around the world, from major leagues to less-popular ones. The data do not concern only the first-level leagues, but also second and inferior divisions.

For each transfer, the website stores a lot of infomation, from the player name to the selling club director. Only a subset of those records are of interest for our project:

- Player attributes:
    - **Player Name**: Name of the player
    - **Player Link**: Transfetrmarkt.com url for the player's profile
    - **Player position**: Position of the player
    - **Age**: Age of the player at the time of the transfer


- Tranfer money:
    - **Fee**: Monetary value, if any, of the transfer
    - **Market value**: Theoritical value of the player, computed by Transfermarkt.com

- Clubs
    - **From club**: Club/Team that the player leaves
    - **To club**: Club/Team that the player joins.
    - **From manager**: Manager of the club that the player leaves.
    - **To manager**: Manager of the club that the player joins.
    - **From manager link**: Transfetrmarkt.com url for the manager of the club that the player leaves.
    - **To manager link**: Transfetrmarkt.com url for the manager of the club that the player joins.
    
    
- Competitions
    - **From competition**: Competition/League where the `from club` participate
    - **To competition**: Competition/League where the `to club`  participate

Web scraping strategy:
- `Transfermarkt.com` has an URL for each transfers occurring at a specific date.
- The transfers happening at a specific day can be spread across multiple pages.
- For each transfer, a detailed version - containing the information we are interested in - is available through a link.
- Create one csv file per day. At the end, merge all csv files into one (so if an error occurs, no need to start everything)
- All transfers happening in **2015** and **2016** will be retrieved.

In [None]:
# URL informations
base_url = "https://www.transfermarkt.com"
transfer_url = base_url + '/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/{0}/page/{1}'

In [None]:
# Web Scraping helper
headers = {'User-Agent': 'Mozilla/5.00'}

In [None]:
# Keep track of all possible problems
nbr_failure = 0 

# Dates to retrieve
YearRange = range(2015,2017) # 2015 - 2016
Months = range(1,13) # 12 Months in a year
DaysInMonths = [0,31,28,31,30,31,30,31,31,30,31,30,31] # Number of days per month


for year in YearRange:
    for month in log_progress(Months, name="Month"):
        for day in log_progress(range(1,DaysInMonths[month]+1), name="Day", delete=True):
            date = str(year)+'-'+str(month)+'-'+str(day)

            # create csv file for the current day
            csvfile = open('data/data-transfer-day/data-test'+date+'.csv', 'w')
            
            # Retrieve the first page of transfers for the current day
            url = transfer_url.format(date, 1)
            r  = requests.get(url, headers=headers)
            tranfer_data_page = BeautifulSoup(r.text, 'html.parser')
            
            # Retrieve the number of pages containing transfers for the current day
            try:
                nbr_pages = int(tranfer_data_page.find('li', attrs={"class":"letzte-seite"}).find('a')['href'].split('page/')[1])
            except:
                nbr_pages = 1

            # Retrieve each tranfers for each page
            try:
                for page in log_progress(range(1,nbr_pages+1), every=1, name="Date: "+date+"    - Page", delete=True): #Progress bar for each page for a given day
                    url = transfer_url.format(date, page)
                    r  = requests.get(url, headers=headers)
                    tranfer_data_page = BeautifulSoup(r.text, 'html.parser')


                    links = tranfer_data_page.find_all('a', href=True)
                    for a in log_progress([l for l in links if "/jumplist/transfers/" in l['href']], every=1, name="Page "+str(page)+"   - Player", delete=True): #Progress bar for each transfer in a page
                        i += 1

                        try:
                            # Player page - Transfer details
                            p_transfer_url = base_url + a['href']
                            r2  = requests.get(p_transfer_url, headers=headers)
                            soup = BeautifulSoup(r2.text, 'html.parser')

                            # Player info
                            player_link = soup.find("li", attrs={"id":"Profile"}).find("a")['href']
                            player_name = player_link.split('/')[1]
                            player_position = soup.find("span", text="Position:").find_next().get_text()
                            player_position = player_position.replace("\r", "").replace("\t", "").replace("\n", "")

                            # Clubs
                            clubs = soup.find("td", attrs={"class":"zentriert hauptlink no-border-rechts no-border-links"})
                            from_club = clubs.find_previous('td').find_all('a')[-1].text
                            to_club   = clubs.find_next('td').find_all('a')[-1].text

                            # Competition
                            competitions = soup.find("td", text="Competition")
                            from_competition = competitions.find_previous('td').find_all('a')[-1].text
                            to_competition   = competitions.find_next('td').find_all('a')[-1].text

                            # Managers
                            managers = soup.find("td", text="Manager(s)")
                            from_manager = managers.find_previous('td').find_all('a')[-1].text
                            from_manager_link = managers.find_previous('td').find('a')['href']
                            to_manager   = managers.find_next('td').find_all('a')[-1].text
                            to_manager_link = managers.find_next('td').find('a')['href']

                            # Market Value
                            market_value = soup.find("b", text="Market value at time of change").find_parent().text.split("Market value at time of change")[-1]

                            # Fee
                            fee = soup.find("b", text="Transfer fee").find_next().find_next().text

                            # Age
                            age = soup.find("b", text="Age at the time of the transfer").find_parent().text.split("Age at the time of the transfer")[-1]


                            # Write to CSV file
                            csvfile.write("|".join([player_name, player_link, player_position, from_club, to_club, from_competition, to_competition, from_manager, from_manager_link, to_manager, to_manager_link, market_value, fee, age])+"\n")

                        except:
                            nbr_failure += 1
            except:
                print("Error with", page)
                pass


            # Close CSV file
            csvfile.close()

All transfers are stored into several csv files, one csv file per day. Merge those files using the bash command `cat`

In [None]:
# Merge everything into a single file
!cat data/data-transfer-day/*.csv > data/data-transfer-day/entire_data.csv

# Open data

Load the csv files with all transfers into a pandas Dataframe.

In [None]:
# Data of interest for each tranfer
transfer_infos = ["Player Name", "Player Link", "Player position", "From club", "To club", "From competition", "To competition", "From manager", "From manager link", "To manager", "To manager link", "Market value", "Fee", "Age"]

In [None]:
df = pd.read_csv("data/data-transfer-day/entire_data.csv", sep='|', names=transfer_infos)
df.head(2)

In [None]:
df.describe()

# Web Scrap Agents

As explained above, the website stores information about player and managers agents. We want also to retrieve it, reason why we kept the profile's link for each player and manager.

## Player Agents

In [None]:
df['Player Agent'] = None

for pl in log_progress(df['Player Link'].unique()):
    r = requests.get(base_url+pl, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    try:
        agent = soup.find('th', text="Player agents:").find_next().text.replace("\n","")
        df.loc[df['Player Link'] == pl,'Player Agent'] = agent
    except:
        pass

## Manager Agents

In [None]:
fml = df['From manager link'].unique()
print("Nbr of 'From managers':    ",len(fml))
      
tml = df['To manager link'].unique()
print("Nbr of 'To managers':      ",len(tml))

print("Nbr of unique mangers:     ", len(list(set(np.concatenate((fml,tml))))))

In [None]:
df_manager = pd.DataFrame(list(set(np.concatenate((fml,tml)))), columns=["ManagerLink"])

df['From manager agent'] = None
df['To manager agent'] = None


for pl in log_progress(df_manager['ManagerLink'].unique(), every=1):
    r = requests.get(base_url+pl, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    try:
        agent = soup.find('span', text="Agents:").find_next().text.replace("\n","")
        df.loc[df['From manager link'] == pl,'From manager agent'] = agent
        df.loc[df['To manager link'] == pl,'To manager agent'] = agent
    except:
        pass

# Save data

Save the data, completed with the agents information, from the pandas Dataframe to a csv file.

In [None]:
df.to_csv("data/data.csv")

# Transform data

There is still one thing we want to deal with before building a network of transfers: deal with transfers fees. `Transfermarkt.com` categorizes transfers into different categories: 
- **"Normal" transfers**, in which clubs exchange player against money
- **Loans** The player will come back to its 'original' team.
- **Free**: The player had no contract, the club spend no money to transfer him.
- **Swap**: Two clubs exchange two player.

Unfortunately, the website stores the transfer type information inside the `Fee` amount. We will need to divide this `Fee` information into two columns: **Transfer type** and **Transfer Value** (amount of money spend on a player from the buying club (*to club*)


Note that for the loans, `Transfermarkt.com` stores two transfers activity: begin and end of loan. We didn't retrieve the end of loan transfers.

**Start by loading the data**

In [3]:
df = pd.read_csv("data/data.csv", index_col=0)
df.sample(5)

Unnamed: 0,Player Name,Player Link,Player position,From club,To club,From competition,To competition,From manager,From manager link,To manager,To manager link,Market value,Fee,Age,From manager agent,To manager agent,Player Agent
22138,zdenek-linhart,/zdenek-linhart/profil/spieler/186536,Centre-Forward,C. Budejovice,Slavia Prag,Druha Liga,Gambrinus Liga,David Horejs,/david-horejs/profil/trainer/29877,Dušan Uhrin Jr.,/du-scaron-an-uhrin-jr-/profil/trainer/4798,-,?,21 years 09 months 27 days,,,Nehoda Sport
39411,igor-voronkov,/igor-voronkov/profil/spieler/120541,Defensive Midfield,FK Slutsk,Krumkachi,Wyschejschaja Liha,Wyschejschaja Liha,Vyacheslav Grigorov,/vyacheslav-grigorov/profil/trainer/41986,Oleg Dulub,/oleg-dulub/profil/trainer/17235,50 Th. €,Free transfer,35 years 02 months 12 days,,,
33150,joel-sami,/joel-sami/profil/spieler/45166,Centre-Back,Zulte Waregem,US Orléans,Jupiler Pro League,Relegation Ligue 2,Francky Dury,/francky-dury/profil/trainer/1848,Olivier Frapolli,/olivier-frapolli/profil/trainer/24620,800 Th. €,?,31 years 07 months 18 days,,Kemari,no agent
4443,florind-bardulla,/florind-bardulla/profil/spieler/209321,Central Midfield,KF Teuta,KF Vllaznia,Kategoria Superiore,Kategoria Superiore,Gentian Begeja,/gentian-begeja/profil/trainer/25931,Baldassare Raineri,/baldassare-raineri/profil/trainer/5891,175 Th. €,?,22 years 01 month 20 days,,,
22568,milos-spasic,/milos-spasic/profil/spieler/307689,Left-Back,AKA Admira U18,FC Admira II,Jugendliga U18,Regionalliga Ost,Michael Gruber,/michael-gruber/profil/trainer/5298,Rolf Martin Landerl,/rolf-martin-landerl/profil/trainer/26874,-,-,17 years 11 months 03 days,,,Igor Gluscevic


In [4]:
df.describe()

Unnamed: 0,Player Name,Player Link,Player position,From club,To club,From competition,To competition,From manager,From manager link,To manager,To manager link,Market value,Fee,Age,From manager agent,To manager agent,Player Agent
count,44778,44778,44778,44778,44778,44778,44778,44778,44778,44778,44778,44778,44767,44778,6046,5469,22993
unique,31221,31793,16,3600,3474,327,316,5210,5220,5104,5113,141,613,7066,244,233,2015
top,paulinho,/william-palacios/profil/spieler/265982,Centre-Forward,Juventus,Monza,Serie A,Primera División,Massimiliano Allegri,/massimiliano-allegri/profil/trainer/7671,Hüseyin Kalpar,/huseyin-kalpar/profil/trainer/12053,-,Free transfer,23 years 03 months 13 days,no agent,no agent,no agent
freq,18,6,7653,151,55,1599,1121,151,151,50,50,7234,19743,26,593,493,2275


## Create Transfer type

In [5]:
transfer = ['Mill. €', 'Th. €', '1 €']
loan = ['Loan fee:', 'Loan']
free = ['gratuito', 'Gratuito', 'free transfer', 'Draft', 'draft', 'Free transfer', 'nan', 'Libre para traspaso', '-', '?', '0', 'free', 'frei', 'svincolato', 'bez odstępnego', 'a']
swap = ['Swap deal', 'Trade', 'trade', 'Tausch', 'Spielertausch']

In [6]:
df['Transfer Type'] = None

for tranfer_type in ["loan", "swap", "transfer", "free"]:
    df2 = df['Transfer Type'].isnull()
    for fee in log_progress(df[df2]['Fee'].unique(), every=1):
        if str(fee)=='nan': continue

        for type_fee in eval(tranfer_type):
            if type_fee in fee:
                df.loc[df['Fee']==fee, 'Transfer Type'] = tranfer_type

In [7]:
df['Transfer Type'].value_counts()

free        30994
loan        10773
transfer     2913
swap           87
Name: Transfer Type, dtype: int64

In [8]:
df[df.Fee.isnull()]

Unnamed: 0,Player Name,Player Link,Player position,From club,To club,From competition,To competition,From manager,From manager link,To manager,To manager link,Market value,Fee,Age,From manager agent,To manager agent,Player Agent,Transfer Type
2572,afran-izmailov,/afran-izmailov/profil/spieler/121842,Right Wing,Xäzär Länkäran,Inter Baku,Premyer Liqasi,Premyer Liqasi,Elbeus Mammadov,/elbeus-mammadov/profil/trainer/38668,Kakhaber Tskhadadze,/kakhaber-tskhadadze/profil/trainer/10419,400 Th. €,,26 years 02 months 25 days,,,Cosmosport,
7081,aly-hassan,/aly-hassan/profil/spieler/220993,Centre-Forward,Fort Lauderdale,Ottawa Fury,NASL Fall Championship,NASL Fall Championship,Iván Guerrero,/ivan-guerrero/profil/trainer/40766,Marc Dos Santos,/marc-dos-santos/profil/trainer/13259,50 Th. €,,26 years 01 month 11 days,,Gold World Stars,,
12780,alexander-langen,/alexander-langen/profil/spieler/303311,Right-Back,TSV 1860 U17,Ingolstadt U19,B-Jgd. BL Süd/Südwest,A-Jgd. BL Süd/Südwest,Josef Albersinger,/josef-albersinger/profil/trainer/9904,Roberto Pätzold,/roberto-patzold/profil/trainer/35740,-,,17 years 05 months 02 days,,,BoaVista Consulting GmbH,
24305,dejan-gavric,/dejan-gavric/profil/spieler/330108,Right Midfield,FK Derventa,FK Krupa,Prva Liga RS,Prva Liga RS,Bojan Magazin,/bojan-magazin/profil/trainer/47459,Slobodan Starcevic,/slobodan-starcevic/profil/trainer/25027,25 Th. €,,18 years 06 months 03 days,,,,
26536,dae-ho-kim,/dae-ho-kim/profil/spieler/134743,Keeper,Asan Mugunghwa,Jeonnam Dragons,K-League Challenge,K-League Classic,Heung-sil Lee,/heung-sil-lee/profil/trainer/23161,Sang-rae No,/sang-rae-no/profil/trainer/37358,150 Th. €,,30 years 08 months 16 days,,,,
27310,sinan-jakupovic,/sinan-jakupovic/profil/spieler/284989,Right Midfield,HNK Capljina,Velez Mostar,Prva liga FBiH,Premijer Liga,Damir Borovac,/damir-borovac/profil/trainer/29604,Zijo Tojaga,/zijo-tojaga/profil/trainer/45696,50 Th. €,,21 years 05 months 05 days,,,no agent,
28235,kevin-oliveira,/kevin-oliveira/profil/spieler/312744,Attacking Midfield,Benfica B,Swope Park,Segunda Liga,USL Pro,Hélder Cristóvão,/helder-cristovao/profil/trainer/11055,Marc Dos Santos,/marc-dos-santos/profil/trainer/13259,-,,19 years 07 months 26 days,,Gold World Stars,Soccer Features Limited,
28555,seong-nam-ahn,/seong-nam-ahn/profil/spieler/113532,Attacking Midfield,Gwangju FC,Gyeongnam FC,K-League Classic,K-League Challenge,Ki-Il Nam,/ki-il-nam/profil/trainer/34529,Jong-bu Kim,/jong-bu-kim/profil/trainer/54825,600 Th. €,,31 years 10 months 13 days,,,,
36912,chang-kyun-im,/chang-kyun-im/profil/spieler/263882,Attacking Midfield,Gyeongnam FC,Suwon FC,K-League Challenge,K-League Classic,Jong-bu Kim,/jong-bu-kim/profil/trainer/54825,Deok-Je Cho,/deok-je-cho/profil/trainer/28144,50 Th. €,,26 years 02 months 29 days,,,,
38465,harallamb-qaqi,/harallamb-qaqi/profil/spieler/189690,Centre-Back,Hellas Verona,KF Laçi,Serie B,Kategoria Superiore,Fabio Pecchia,/fabio-pecchia/profil/trainer/19635,Marcello Troisi,/marcello-troisi/profil/trainer/41047,25 Th. €,,22 years 10 months 10 days,,,,


## Transfer fee - integer amount

In [9]:
df["Transfer Value"] = 0

for fee in df[df['Transfer Type'] == 'transfer']['Fee'].unique():
    if " Mill" in fee:
        value = float(fee.split(" Mill")[0].replace(',','.'))*1000000
    
    if " Th." in fee:
        value = float(fee.split(" Th")[0].replace(',','.'))*1000
    
    if "1 €" == fee:
        value = 1
        
    df.loc[df['Fee']==fee, 'Transfer Value'] = value

## Players positions

Remove the sequence of white spaces at the end of each `player position`. This will be performed with the python built-in function **rstrip()**.

In [10]:
positions = df['Player position'].unique()
positions

array(['Central Midfield                            ',
       'Centre-Forward                            ',
       'Right Wing                            ',
       'Keeper                            ',
       'Attacking Midfield                            ',
       'Defensive Midfield                            ',
       'Centre-Back                            ',
       'Left-Back                            ',
       'Right-Back                            ',
       'Left Wing                            ',
       'Left Midfield                            ',
       'Secondary Striker                            ',
       'Right Midfield                            ',
       'Striker                            ',
       'Midfield                            ',
       'Defence                            '], dtype=object)

In [11]:
df['Player position'] = df['Player position'].str.rstrip()

In [12]:
positions_2 = df['Player position'].unique()
positions_2

array(['Central Midfield', 'Centre-Forward', 'Right Wing', 'Keeper',
       'Attacking Midfield', 'Defensive Midfield', 'Centre-Back',
       'Left-Back', 'Right-Back', 'Left Wing', 'Left Midfield',
       'Secondary Striker', 'Right Midfield', 'Striker', 'Midfield',
       'Defence'], dtype=object)

## Age to number

The age of the players in stored as a string. In order to use it in our analysis, we want to convert it to an float value.

The string is divided into three segments: **year**, **month** and **days**. Also strings have the year information, but not all hold the *month* and *days* of players.

We decided to represent the age as a float number: **year.month**. If the *month* information is missing, the age value will be an integer (`year.00`). The month will be divided by 12.

In [13]:
df['Age'].sample(5)

16309     20 years 01 month 22 days 
44615    25 years 02 months 24 days 
994      24 years 08 months 09 days 
27418    19 years 08 months 24 days 
4394     25 years 10 months 06 days 
Name: Age, dtype: object

In [14]:
def ageAsFloat(age):
    year =  int(age.split(' year')[0])
    month = int(age.split('years ')[1].split(' month')[0] if ' month' in age else '0')/12
    return year + (month/10)

In [15]:
df['Age'] = df['Age'].apply(ageAsFloat)

In [16]:
df['Age'].sample(5)

31230    25.016667
40680    28.033333
3492     28.058333
6342     21.075000
37785    23.066667
Name: Age, dtype: float64

# Build Network

Now that we have build our dataset, we can create the Network of transfers. 

The network will be composed of:

- **Nodes**: Each club is represented in a node. Each node stores the `Competition` in which its club is registered.
- **Edges**: Each transfer is represented by an edge. Edges are directed: the edge goes from the selling to the buying club (`From club` to `To club`). Each edge contains information about the transfer, such as the player name, managers infos, transfer type, value, ... . Note that there might be multiple edges between two nodes, so we will deal with a **Multi Directed Graph**.

In [17]:
G = nx.MultiDiGraph()

## Nodes

Nodes are clubs

In [18]:
fc = df["From club"].unique()
tc = df["To club"].unique()
all_clubs = list(set(np.concatenate((fc,tc))))

In [19]:
for c in log_progress(all_clubs, every=1):
    clubs = df[df["From club"]==c]
    
    if len(clubs)==0:
        clubs = df[df["To club"]==c]
        competition = clubs[:1]["To competition"].values[0]
    else:
        competition = clubs[:1]["From competition"].values[0]
    
    G.add_node(c, competition=competition)

## Edges

In [20]:
for tranfer in log_progress(df.iterrows(), every=10):
    data = tranfer[1]
    from_club = data["From club"]
    to_club = data["To club"]
    
    player = data["Player Name"]
    player_position = data["Player position"]
    player_age = data["Age"]
    player_agent = data["Player Agent"]
    
    transfer_type = data["Transfer Type"]
    transfer_value = data["Transfer Value"]
    
    from_manager = data["From manager"]
    from_manager_agent = data["From manager agent"]
    to_manager = data["To manager"]
    to_manager_agent = data["To manager agent"]
    
    G.add_edge(from_club, str(to_club),
                  player = str(player),
                  playerPosition = str(player_position),
                  playerAge = str(player_age),
                  transferType = str(transfer_type),
                  transferValue = str(transfer_value),
                  fromManager = str(from_manager),
                  fromManagerAgent = str(from_manager_agent),
                  toManager = str(to_manager),
                  toManagerAgent = str(to_manager_agent)
              )

## Network

In [21]:
print(nx.info(G))

Name: 
Type: MultiDiGraph
Number of nodes: 3904
Number of edges: 44778
Average in degree:  11.4698
Average out degree:  11.4698


## Export

In [22]:
nx.write_gml(G, "networks/all_transfers_network.gml")

## Build Network for each transfer type

In [23]:
for transferType in ["loan", "swap", "transfer", "free"]:

    new_g = nx.MultiDiGraph()
    new_g.name = transferType
    
    for n1,n2,e in G.edges(data=True):
        if e["transferType"] == transferType:
            new_g.add_nodes_from(G.subgraph([n1,n2]).nodes(data=True))
            new_g.add_edge(n1, n2, 
                          player = e['player'],
                          playerPosition = e['playerPosition'],
                          playerAge = e['playerAge'],
                          transferType = e['transferType'],
                          transferValue = e['transferValue'],
                          fromManager = e['fromManager'],
                          fromManagerAgent = e['fromManagerAgent'],
                          toManager = e['toManager'],
                          toManagerAgent = e['toManagerAgent'])
    
    print(nx.info(new_g))
    nx.write_gml(new_g, "networks/transfers_{0}_network.gml".format(transferType))
    print()

Name: loan
Type: MultiDiGraph
Number of nodes: 2664
Number of edges: 10773
Average in degree:   4.0439
Average out degree:   4.0439

Name: swap
Type: MultiDiGraph
Number of nodes: 76
Number of edges: 87
Average in degree:   1.1447
Average out degree:   1.1447

Name: transfer
Type: MultiDiGraph
Number of nodes: 1124
Number of edges: 2913
Average in degree:   2.5916
Average out degree:   2.5916

Name: free
Type: MultiDiGraph
Number of nodes: 3851
Number of edges: 30994
Average in degree:   8.0483
Average out degree:   8.0483

