# EDA for Football Transfers

### Tasks
- Clean the dataset
- Transform data
- Data statistics
- Data visualizations
- Relationships extraction, variable (feature) selection

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
import numpy as np

from general_functions import value_counts, missing_values

**Import of selected libraries (visit readme for installation process)**

In [2]:
df = pd.read_csv("data/fotbal_prestupy_2000_2019.csv")

**Read data from csv. In this case is not required to change format (UTF-8) or define delimiter**

In [3]:
df.head()

Unnamed: 0,Jméno,Pozice,Věk,Původní tým,Původní liga,Nový tým,Nová Liga,Sezóna,Odhadovaná hodnota,Přestupová částka
0,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60000000
1,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56810000
2,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40000000
3,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36150000
4,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34500000


In [4]:
df.shape

(4700, 10)

### First I'm checking with basic operations if import was correct and how data looks like**
1. **Data imported correctly?**
    - Yes
2. **Shape of data (dimensions) are according assignment task?**
    - Yes -> 4700 x 10
3. **Columns names and their descriptions is in allignment?**
    - Name - Jméno
    - Position - Pozice
    - Age - Věk
    - Original team - Původní tým (ze ktéreho byl prodán)
    - Original league - Původní liga (působiště týmu)
    - New team - Nový tým (kam se hráč prodáva)
    - New league - Nová liga (kam se hráč prodáva)
    - Season - Sezóna (kdy došlo k přestupu)
    - Estimated value - Odhadovaná tržní hodnota hráče (EUR)
    - Actual value - Skutečná hodnota přestupu (EUR)

In [5]:
df.columns

Index(['Jméno', 'Pozice', 'Věk', 'Původní tým', 'Původní liga', 'Nový tým',
       'Nová  Liga', 'Sezóna', 'Odhadovaná hodnota', 'Přestupová částka'],
      dtype='object')

In [6]:
df = df.rename(columns={'Jméno': 'Name', 'Pozice': 'Position', 'Věk': 'Age', 'Původní tým': 'Original Team', 'Původní liga': 'Original League', 'Nový tým': 'New Team', 'Nová  Liga': 'New League', 'Sezóna': 'Season', 'Odhadovaná hodnota': 'Estimated Value', 'Přestupová částka': 'Actual Value'})

In [7]:
df

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value
0,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60000000
1,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56810000
2,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40000000
3,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36150000
4,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34500000
...,...,...,...,...,...,...,...,...,...,...
4695,Jasmin Kurtic,Attacking Midfield,29,Atalanta,Serie A,SPAL,Serie A,2018-2019,5000000.0,4800000
4696,Tchê Tchê,Central Midfield,25,Palmeiras,Série A,Dynamo Kyiv,Premier Liga,2018-2019,3000000.0,4800000
4697,Silvan Widmer,Right-Back,25,Udinese Calcio,Serie A,FC Basel,Super League,2018-2019,8500000.0,4500000
4698,Yuya Osako,Second Striker,28,1. FC Köln,2.Bundesliga,Werder Bremen,1.Bundesliga,2018-2019,4500000.0,4500000


**I renamed columns with english translation.** I also noticed typo in column name of "Nová liga".

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4700 entries, 0 to 4699
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             4700 non-null   object 
 1   Position         4700 non-null   object 
 2   Age              4700 non-null   int64  
 3   Original Team    4700 non-null   object 
 4   Original League  4700 non-null   object 
 5   New Team         4700 non-null   object 
 6   New League       4700 non-null   object 
 7   Season           4700 non-null   object 
 8   Estimated Value  3440 non-null   float64
 9   Actual Value     4700 non-null   int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 367.3+ KB


**Info provide a small insight into data types. Seems to be correct. I might check Estimated Value column for float numbers**

## Cleaning dataset

In [9]:
missing_values(df)

Name - 0.0%
Position - 0.0%
Age - 0.0%
Original Team - 0.0%
Original League - 0.0%
New Team - 0.0%
New League - 0.0%
Season - 0.0%
Estimated Value - 26.80851063829787%
Actual Value - 0.0%


**Data are relatively clean. Almost all of them contain values, there is only one column with missing data - "Estimated Value". Almost 27% of data is missing.**
- Question: is there reason why this data is missing?
  - Option A: Data from older years 2000-2001 for this column were not collect
  - Option B: It was not important/expected to collect these values (estimated values of players)
- This will require further data exploration

In [10]:
value_counts(df)

Value counts for column 'Name':
Name
Alex                  8
Fernando              7
Peter Crouch          7
Craig Bellamy         6
Paulinho              6
                     ..
Marius Wolf           1
Benedikt Höwedes      1
Sergi Gómez           1
Adam Masina           1
Christophe Hérelle    1
Name: count, Length: 3104, dtype: int64


Value counts for column 'Position':
Position
Centre-Forward        1218
Centre-Back            714
Central Midfield       487
Attacking Midfield     426
Defensive Midfield     411
Right Winger           305
Left Winger            267
Left-Back              225
Right-Back             181
Goalkeeper             180
Second Striker         130
Left Midfield           87
Right Midfield          63
Forward                  3
Sweeper                  1
Defender                 1
Midfielder               1
Name: count, dtype: int64


Value counts for column 'Age':
Age
24    536
25    524
23    519
26    481
22    461
27    404
21    371
28    327
20    302


#### Value counts provide also important insights:
- **Name**:
    - Some of the names are missing part (surname or first name) of the name. Might be identified with id or combination with league and team.
    - It can also mean that some players did multiple transfers. But might be problem if names are the same but other variables are different. Worth checking
- **Position**:
    - Centre-Forward was the position that has the most transfers. Players on this position are traded often.
    - Midfielder/Defender/Sweeper were positions that were almost never traded.
- **Age**:
    - Most players are traded around age 22-27 years old
    - There was player traded at age 15 years old
    - There was player with age 0 - that is probably mistake or missing age data. Worth checking
- **Original Team, Original League, New Team, New League**:
    - Data seems to be okay. Probably not missing anything specific or out of pattern
-  **Season**:
    - Count of seasons, nothing specific to notice
- **Estimated Value, Actual Value**:
    - Estimated value is in float format
    - At first sight nothing specific 

In [18]:
name_check = df.loc[df["Name"] == "Alex"]

In [14]:
name_check

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value
321,Alex,Attacking Midfield,23,Cruzeiro,Brazil,Parma,Serie A,2001-2002,,8000000
357,Alex,Attacking Midfield,24,Parma,Serie A,Cruzeiro,Brazil,2001-2002,,6000000
1007,Alex,Centre-Back,22,Santos FC,Brazil,Chelsea,Premier League,2004-2005,,11500000
1092,Alex,Attacking Midfield,26,Cruzeiro,Brazil,Fenerbahce,Süper Lig,2004-2005,,4000000
2139,Alex,Attacking Midfield,26,Internacional,Série A,Spartak Moscow,Premier Liga,2008-2009,7500000.0,5000000
2584,Alex,Attacking Midfield,29,Spartak Moscow,Premier Liga,Corinthians,Série A,2010-2011,8000000.0,6000000
2880,Alex,Centre-Back,29,Chelsea,Premier League,Paris SG,Ligue 1,2011-2012,13000000.0,5000000
3099,Alex,Attacking Midfield,30,Corinthians,Série A,Al Gharafa,Stars League,2012-2013,3500000.0,6000000


**This is important. I can see that first two rows is the same player, but after that is another player. Now I need to create new column to identified players so I can seperate them**
 - First idea -> unique identifier is combination of name + postion (should not change) + season + age

In [50]:
player_id_map = {}
current_id = 1
def generate_player_id(row):
    global current_id

    key = (row["Name"], row["Position"], row["Age"], row["Season"])

    if key not in player_id_map:
        if (row["Name"], row["Position"], row["Age"] - 1, str(int(row["Season"].split("-")[0]) - 1) + "-" + str(int(row["Season"].split("-")[1]) - 1)) in player_id_map:
            player_id_map[key] = player_id_map[(row["Name"], row["Position"], row["Age"] - 1, str(int(row["Season"].split("-")[0]) - 1) + "-" + str(int(row["Season"].split("-")[1]) - 1))]
        else:
            player_id_map[key] = f"Player_{current_id}"
            current_id += 1
    return player_id_map[key]

In [54]:
df["Player ID"] = df.apply(generate_player_id, axis=1)


In [55]:
name_check = df.loc[df["Name"] == "Alex"]
name_check

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
321,Alex,Attacking Midfield,23,Cruzeiro,Brazil,Parma,Serie A,2001-2002,,8000000,Player_318
357,Alex,Attacking Midfield,24,Parma,Serie A,Cruzeiro,Brazil,2001-2002,,6000000,Player_351
1007,Alex,Centre-Back,22,Santos FC,Brazil,Chelsea,Premier League,2004-2005,,11500000,Player_964
1092,Alex,Attacking Midfield,26,Cruzeiro,Brazil,Fenerbahce,Süper Lig,2004-2005,,4000000,Player_1043
2139,Alex,Attacking Midfield,26,Internacional,Série A,Spartak Moscow,Premier Liga,2008-2009,7500000.0,5000000,Player_2007
2584,Alex,Attacking Midfield,29,Spartak Moscow,Premier Liga,Corinthians,Série A,2010-2011,8000000.0,6000000,Player_2416
2880,Alex,Centre-Back,29,Chelsea,Premier League,Paris SG,Ligue 1,2011-2012,13000000.0,5000000,Player_2692
3099,Alex,Attacking Midfield,30,Corinthians,Série A,Al Gharafa,Stars League,2012-2013,3500000.0,6000000,Player_2892


In [47]:
df["Player ID"].value_counts()

Player ID
Player_521     3
Player_3214    3
Player_2919    3
Player_3058    3
Player_2101    3
              ..
Player_15      1
Player_14      1
Player_13      1
Player_12      1
Player_11      1
Name: count, Length: 4310, dtype: int64

In [48]:
player_check = df.loc[df["Player ID"] == "Player_521"]
player_check

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
546,Émerson,Defensive Midfield,30,Dep. La Coruña,LaLiga,Atlético Madrid,LaLiga,2002-2003,,6500000,Player_521
1488,Émerson,Defensive Midfield,30,Juventus,Serie B,Real Madrid,LaLiga,2006-2007,26000000.0,16000000,Player_521
1894,Émerson,Defensive Midfield,31,Real Madrid,LaLiga,AC Milan,Serie A,2007-2008,16000000.0,5000000,Player_521
