# EDA for Football Transfers

### Tasks
- Clean the dataset
- Transform data
- Data statistics
- Data visualizations
- Relationships extraction, variable (feature) selection

In [86]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
import numpy as np

from general_functions import value_counts, missing_values

**Import of selected libraries (visit readme for installation process)**

In [87]:
df = pd.read_csv("data/fotbal_prestupy_2000_2019.csv")

**Read data from csv. In this case is not required to change format (UTF-8) or define delimiter**

In [88]:
df.head()

Unnamed: 0,Jméno,Pozice,Věk,Původní tým,Původní liga,Nový tým,Nová Liga,Sezóna,Odhadovaná hodnota,Přestupová částka
0,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60000000
1,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56810000
2,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40000000
3,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36150000
4,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34500000


In [89]:
df.shape

(4700, 10)

### First I'm checking with basic operations if import was correct and how data looks like**
1. **Data imported correctly?**
    - Yes
2. **Shape of data (dimensions) are according assignment task?**
    - Yes -> 4700 x 10
3. **Columns names and their descriptions is in allignment?**
    - Name - Jméno
    - Position - Pozice
    - Age - Věk
    - Original team - Původní tým (ze ktéreho byl prodán)
    - Original league - Původní liga (působiště týmu)
    - New team - Nový tým (kam se hráč prodáva)
    - New league - Nová liga (kam se hráč prodáva)
    - Season - Sezóna (kdy došlo k přestupu)
    - Estimated value - Odhadovaná tržní hodnota hráče (EUR)
    - Actual value - Skutečná hodnota přestupu (EUR)

In [90]:
df.columns

Index(['Jméno', 'Pozice', 'Věk', 'Původní tým', 'Původní liga', 'Nový tým',
       'Nová  Liga', 'Sezóna', 'Odhadovaná hodnota', 'Přestupová částka'],
      dtype='object')

In [91]:
df = df.rename(columns={'Jméno': 'Name', 'Pozice': 'Position', 'Věk': 'Age', 'Původní tým': 'Original Team', 'Původní liga': 'Original League', 'Nový tým': 'New Team', 'Nová  Liga': 'New League', 'Sezóna': 'Season', 'Odhadovaná hodnota': 'Estimated Value', 'Přestupová částka': 'Actual Value'})

In [92]:
df

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value
0,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60000000
1,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56810000
2,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40000000
3,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36150000
4,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34500000
...,...,...,...,...,...,...,...,...,...,...
4695,Jasmin Kurtic,Attacking Midfield,29,Atalanta,Serie A,SPAL,Serie A,2018-2019,5000000.0,4800000
4696,Tchê Tchê,Central Midfield,25,Palmeiras,Série A,Dynamo Kyiv,Premier Liga,2018-2019,3000000.0,4800000
4697,Silvan Widmer,Right-Back,25,Udinese Calcio,Serie A,FC Basel,Super League,2018-2019,8500000.0,4500000
4698,Yuya Osako,Second Striker,28,1. FC Köln,2.Bundesliga,Werder Bremen,1.Bundesliga,2018-2019,4500000.0,4500000


**I renamed columns with english translation.** I also noticed typo in column name of "Nová liga".

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4700 entries, 0 to 4699
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             4700 non-null   object 
 1   Position         4700 non-null   object 
 2   Age              4700 non-null   int64  
 3   Original Team    4700 non-null   object 
 4   Original League  4700 non-null   object 
 5   New Team         4700 non-null   object 
 6   New League       4700 non-null   object 
 7   Season           4700 non-null   object 
 8   Estimated Value  3440 non-null   float64
 9   Actual Value     4700 non-null   int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 367.3+ KB


**Info provide a small insight into data types. Seems to be correct. I might check Estimated Value column for float numbers**

## Cleaning dataset

In [94]:
missing_values(df)

Name - 0.0%
Position - 0.0%
Age - 0.0%
Original Team - 0.0%
Original League - 0.0%
New Team - 0.0%
New League - 0.0%
Season - 0.0%
Estimated Value - 26.80851063829787%
Actual Value - 0.0%


**Data are relatively clean. Almost all of them contain values, there is only one column with missing data - "Estimated Value". Almost 27% of data is missing.**
- Question: is there reason why this data is missing?
  - Option A: Players were not valuated
  - Option B: It was not important/expected to collect these values (estimated values of players)
- This will require further data exploration

In [134]:
value_counts(df)

Value counts for column 'Name':
Name
Alex            8
Peter Crouch    7
Fernando        7
Éder            6
Sokratis        6
               ..
Sidcley         1
Fabrício        1
Johan Mojica    1
Lucas           1
Gerard López    1
Name: count, Length: 3103, dtype: int64


Value counts for column 'Position':
Position
Centre-Forward        1217
Centre-Back            714
Central Midfield       487
Attacking Midfield     426
Defensive Midfield     411
Right Winger           305
Left Winger            267
Left-Back              225
Right-Back             181
Goalkeeper             180
Second Striker         130
Left Midfield           87
Right Midfield          63
Forward                  3
Sweeper                  1
Defender                 1
Midfielder               1
Name: count, dtype: int64


Value counts for column 'Age':
Age
24    536
25    524
23    519
26    481
22    461
27    404
21    371
28    327
20    302
29    223
19    165
30    157
18     82
31     59
32     30
17    

#### Value counts provide also important insights:
- **Name**:
    - Some of the names are missing part (surname or first name) of the name. Might be identified with id or combination with league and team.
    - It can also mean that some players did multiple transfers. But might be problem if names are the same but other variables are different. Worth checking
- **Position**:
    - Centre-Forward was the position that has the most transfers. Players on this position are traded often.
    - Midfielder/Defender/Sweeper were positions that were almost never traded.
- **Age**:
    - Most players are traded around age 22-27 years old
    - There was player traded at age 15 years old
    - There was player with age 0 - that is probably mistake or missing age data. Worth checking
- **Original Team, Original League, New Team, New League**:
    - Data seems to be okay. Probably not missing anything specific or out of pattern
-  **Season**:
    - Count of seasons, nothing specific to notice
- **Estimated Value, Actual Value**:
    - Estimated value is in float format
    - At first sight nothing specific 

### Assigning unique Id

In [95]:
name_check = df.loc[df["Name"] == "Alex"]

In [96]:
name_check

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value
321,Alex,Attacking Midfield,23,Cruzeiro,Brazil,Parma,Serie A,2001-2002,,8000000
357,Alex,Attacking Midfield,24,Parma,Serie A,Cruzeiro,Brazil,2001-2002,,6000000
1007,Alex,Centre-Back,22,Santos FC,Brazil,Chelsea,Premier League,2004-2005,,11500000
1092,Alex,Attacking Midfield,26,Cruzeiro,Brazil,Fenerbahce,Süper Lig,2004-2005,,4000000
2139,Alex,Attacking Midfield,26,Internacional,Série A,Spartak Moscow,Premier Liga,2008-2009,7500000.0,5000000
2584,Alex,Attacking Midfield,29,Spartak Moscow,Premier Liga,Corinthians,Série A,2010-2011,8000000.0,6000000
2880,Alex,Centre-Back,29,Chelsea,Premier League,Paris SG,Ligue 1,2011-2012,13000000.0,5000000
3099,Alex,Attacking Midfield,30,Corinthians,Série A,Al Gharafa,Stars League,2012-2013,3500000.0,6000000


**This is important. I can see that first two rows is the same player, but after that is another player. Now I need to create new column to identified players so I can seperate them**
 - First idea -> unique identifier is combination of name + postion (should not change) + season + age
     - *This was not correct. In this case, age can change during within season, thus not recognizing correct players.*
     - *Also I realise that league can also change every season. So player may enter new team in league, let's say,  serie A, but next season the league could be serie B because it descend lower due to bad performance*

In [97]:
player_id_map = {}
latest_transfer_map = {} 
current_id = 1
def assign_player_id(row):
    global current_id

    name, position, original_team = row["Name"], row["Position"], row["Original Team"]
    new_team = row["New Team"]
    
    for player_id, info in player_id_map.items():
        if info["name"] == name and info["position"] == position:
            if latest_transfer_map[player_id] == original_team:
                latest_transfer_map[player_id] = new_team
                return player_id

    # I create new Id if no match was found
    new_player_id = "Player_" + str(current_id)
    current_id += 1

    player_id_map[new_player_id] = {"name": name, "position": position}
    latest_transfer_map[new_player_id] = new_team

    return new_player_id

In [98]:
df["Player ID"] = df.apply(assign_player_id, axis=1)


In [99]:
name_check = df.loc[(df["Name"] == "Alex")]
name_check

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
321,Alex,Attacking Midfield,23,Cruzeiro,Brazil,Parma,Serie A,2001-2002,,8000000,Player_316
357,Alex,Attacking Midfield,24,Parma,Serie A,Cruzeiro,Brazil,2001-2002,,6000000,Player_316
1007,Alex,Centre-Back,22,Santos FC,Brazil,Chelsea,Premier League,2004-2005,,11500000,Player_904
1092,Alex,Attacking Midfield,26,Cruzeiro,Brazil,Fenerbahce,Süper Lig,2004-2005,,4000000,Player_316
2139,Alex,Attacking Midfield,26,Internacional,Série A,Spartak Moscow,Premier Liga,2008-2009,7500000.0,5000000,Player_1649
2584,Alex,Attacking Midfield,29,Spartak Moscow,Premier Liga,Corinthians,Série A,2010-2011,8000000.0,6000000,Player_1649
2880,Alex,Centre-Back,29,Chelsea,Premier League,Paris SG,Ligue 1,2011-2012,13000000.0,5000000,Player_904
3099,Alex,Attacking Midfield,30,Corinthians,Série A,Al Gharafa,Stars League,2012-2013,3500000.0,6000000,Player_1649


 - **Second idea -> unique identifier is combination of name and position. If it match, then check original team and new team. After matching name and position compare with this occurence.**
     - I changed the approach. I look how I assessed if the player is the same or different one. What I compared was original team with new team from previous occurence (search).
     - Simply, if name and position match with another records, I check the original team and compared with new team. I stored in the varaible for the latest transfer. Now I always compare if player name and position match with team from latest_transfer_map. If it match, then it is the same player
     - If no match was found I assign player new Id

In [100]:
df["Player ID"].value_counts()

Player ID
Player_55      6
Player_900     6
Player_329     6
Player_394     6
Player_321     6
              ..
Player_615     1
Player_616     1
Player_19      1
Player_617     1
Player_3301    1
Name: count, Length: 3301, dtype: int64

In [101]:
player_check = df.loc[df["Player ID"] == "Player_394"]
player_check

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
408,Alberto Gilardino,Centre-Forward,19,Piacenza,Serie A,Hellas Verona,Serie A,2001-2002,,3870000,Player_394
516,Alberto Gilardino,Centre-Forward,20,Hellas Verona,Serie B,Parma,Serie A,2002-2003,,12000000,Player_394
1236,Alberto Gilardino,Centre-Forward,23,Parma,Serie A,AC Milan,Serie A,2005-2006,20000000.0,25000000,Player_394
2006,Alberto Gilardino,Centre-Forward,26,AC Milan,Serie A,Fiorentina,Serie A,2008-2009,17000000.0,14000000,Player_394
2806,Alberto Gilardino,Centre-Forward,29,Fiorentina,Serie A,Genoa,Serie A,2011-2012,16000000.0,8000000,Player_394
3643,Alberto Gilardino,Centre-Forward,32,Genoa,Serie A,GZ Evergrande,Super League,2014-2015,5000000.0,5500000,Player_394


**I did small random checks if the players match.** 

In [102]:
df["Estimated Value"].value_counts()

Estimated Value
5000000.0      225
6000000.0      194
4000000.0      168
10000000.0     161
3000000.0      160
              ... 
6250000.0        1
70000000.0       1
1150000.0        1
90000000.0       1
120000000.0      1
Name: count, Length: 180, dtype: int64

In [103]:
nan_rows = df[df["Estimated Value"].isna()]

In [104]:
nan_rows

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
0,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60000000,Player_1
1,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56810000,Player_2
2,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40000000,Player_3
3,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36150000,Player_4
4,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34500000,Player_5
...,...,...,...,...,...,...,...,...,...,...,...
4355,Douglas Luiz,Central Midfield,19,Vasco da Gama,Série A,Man City,Premier League,2017-2018,,12000000,Player_3041
4427,Jadon Sancho,Left Winger,17,Man City U18,U18 Premier League,Bor. Dortmund,1.Bundesliga,2017-2018,,7840000,Player_3097
4618,Davide Bettella,Centre-Back,18,Inter,Serie A,Atalanta,Serie A,2018-2019,,7000000,Player_3230
4648,William Bianda,Centre-Back,18,Lens,Ligue 2,AS Roma,Serie A,2018-2019,,6000000,Player_3256


### Remove faulty rows

In [105]:
low_age_rows = df[df["Age"]< 15]
low_age_rows

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
236,Marzouq Al-Otaibi,Centre-Forward,0,Shabab,Saudi Arabia,Ittihad,Saudi Arabia,2000-2001,,2000000,Player_236


In [106]:
df =df.drop(low_age_rows.index)

**Removing row with incorrect data**
- I noticed that the youngest player traded was 15 years old, which still seems to be possible, so condition is set under 15.
- I see the player with age 0 which is probably mistake during insertion of record. There are possibilities how to deal with this problem:
    - Try to fix the player age -> With quick search, I'm able to find player profile https://www.transfermarkt.com/marzouq-al-otaibi/transfers/spieler/28152/transfer_id/765037 . Born Nov.1975, transfered Jul.2000, I'm able to quickly determine his age = 24.
    - Drop the incorrect row -> Pretend that I'm not able to determine player age e.g. transaction dataset in banks must be precise. If I don't have correct data I won't include it.
- **For this case I decided to remove this row, it won't have impact overall on total statistic**  

In [133]:
filtered_rows = df[
    (
        df["Original League"].str.contains(r"série a", case=False) &
        df["New League"].str.contains(r"serie a", case=False)
    ) |
    (
        df["Original League"].str.contains(r"serie a", case=False) &
        df["New League"].str.contains(r"série a", case=False)
    )
]

In [132]:
filtered_rows.head(30)

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID
1158,Júlio César,Goalkeeper,25,Flamengo,Série A,Inter,Serie A,2004-2005,2000000.0,2450000,Player_1008
1738,Alexandre Pato,Centre-Forward,17,Internacional,Série A,AC Milan,Serie A,2007-2008,4000000.0,24000000,Player_1402
2033,Thiago Silva,Centre-Back,24,Fluminense,Série A,AC Milan,Serie A,2008-2009,7500000.0,10000000,Player_1570
2507,Hernanes,Attacking Midfield,25,São Paulo,Série A,Lazio,Serie A,2010-2011,9000000.0,13500000,Player_1892
2655,Mario Bolatti,Defensive Midfield,25,Fiorentina,Serie A,Internacional,Série A,2010-2011,5500000.0,4000000,Player_1872
2704,Neto,Goalkeeper,21,Atlético-PR,Série A,Fiorentina,Serie A,2010-2011,1800000.0,3500000,Player_2027
2717,Ronaldinho,Attacking Midfield,30,AC Milan,Serie A,Flamengo,Série A,2010-2011,27500000.0,3000000,Player_366
2893,Jonathan,Right-Back,25,Santos FC,Série A,Inter,Serie A,2011-2012,4500000.0,5000000,Player_2135
2921,Zé Love,Second Striker,23,Santos FC,Série A,Genoa,Serie A,2011-2012,3000000.0,4500000,Player_2153
2951,Juan Jesus,Centre-Back,20,Internacional,Série A,Inter,Serie A,2011-2012,3000000.0,3800000,Player_2176


**Cleaning values**
- Here I was concerned that e.g. Série A and Serie A might be the same value but with typo or different data input method. I wrote condition to compare data.
    - Apparently "Série A" and "Serie A" are different leagues. The pattern is that "Serie A" league is always connected with teams based in Brazil, "Serie A" are teams based in Italy.
    - Data are ok

In [161]:
df["Actual Value (mil)"] = df["Actual Value"] / 1_000_000

In [162]:
df

Unnamed: 0,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value,Player ID,Actual Value (mil)
0,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60000000,Player_1,60.00
1,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56810000,Player_2,56.81
2,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40000000,Player_3,40.00
3,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36150000,Player_4,36.15
4,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34500000,Player_5,34.50
...,...,...,...,...,...,...,...,...,...,...,...,...
4695,Jasmin Kurtic,Attacking Midfield,29,Atalanta,Serie A,SPAL,Serie A,2018-2019,5000000.0,4800000,Player_3297,4.80
4696,Tchê Tchê,Central Midfield,25,Palmeiras,Série A,Dynamo Kyiv,Premier Liga,2018-2019,3000000.0,4800000,Player_3298,4.80
4697,Silvan Widmer,Right-Back,25,Udinese Calcio,Serie A,FC Basel,Super League,2018-2019,8500000.0,4500000,Player_3299,4.50
4698,Yuya Osako,Second Striker,28,1. FC Köln,2.Bundesliga,Werder Bremen,1.Bundesliga,2018-2019,4500000.0,4500000,Player_3300,4.50


In [195]:
df.columns

Index(['Name', 'Position', 'Age', 'Original Team', 'Original League',
       'New Team', 'New League', 'Season', 'Estimated Value', 'Actual Value',
       'Player ID', 'Actual Value (mil)'],
      dtype='object')

In [200]:
df_columns_new = ['Player ID', 'Name', 'Position', 'Age', 'Original Team', 'Original League',
       'New Team', 'New League', 'Season', 'Estimated Value', 'Actual Value (mil)']

In [201]:
df_cleaned = df[df_columns_new]

In [206]:
df_cleaned

Unnamed: 0,Player ID,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value (mil)
0,Player_1,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60.00
1,Player_2,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56.81
2,Player_3,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40.00
3,Player_4,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36.15
4,Player_5,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34.50
...,...,...,...,...,...,...,...,...,...,...,...
4695,Player_3297,Jasmin Kurtic,Attacking Midfield,29,Atalanta,Serie A,SPAL,Serie A,2018-2019,5000000.0,4.80
4696,Player_3298,Tchê Tchê,Central Midfield,25,Palmeiras,Série A,Dynamo Kyiv,Premier Liga,2018-2019,3000000.0,4.80
4697,Player_3299,Silvan Widmer,Right-Back,25,Udinese Calcio,Serie A,FC Basel,Super League,2018-2019,8500000.0,4.50
4698,Player_3300,Yuya Osako,Second Striker,28,1. FC Köln,2.Bundesliga,Werder Bremen,1.Bundesliga,2018-2019,4500000.0,4.50


In [211]:
df_cleaned_not_na = df_cleaned.copy()

In [212]:
df_cleaned_not_na["Estimated Value"] = df_cleaned_not_na["Estimated Value"].fillna(0)

In [213]:
df_cleaned_not_na

Unnamed: 0,Player ID,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value (mil)
0,Player_1,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,0.0,60.00
1,Player_2,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,0.0,56.81
2,Player_3,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,0.0,40.00
3,Player_4,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,0.0,36.15
4,Player_5,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,0.0,34.50
...,...,...,...,...,...,...,...,...,...,...,...
4695,Player_3297,Jasmin Kurtic,Attacking Midfield,29,Atalanta,Serie A,SPAL,Serie A,2018-2019,5000000.0,4.80
4696,Player_3298,Tchê Tchê,Central Midfield,25,Palmeiras,Série A,Dynamo Kyiv,Premier Liga,2018-2019,3000000.0,4.80
4697,Player_3299,Silvan Widmer,Right-Back,25,Udinese Calcio,Serie A,FC Basel,Super League,2018-2019,8500000.0,4.50
4698,Player_3300,Yuya Osako,Second Striker,28,1. FC Köln,2.Bundesliga,Werder Bremen,1.Bundesliga,2018-2019,4500000.0,4.50


In [214]:
df_cleaned

Unnamed: 0,Player ID,Name,Position,Age,Original Team,Original League,New Team,New League,Season,Estimated Value,Actual Value (mil)
0,Player_1,Luís Figo,Right Winger,27,FC Barcelona,LaLiga,Real Madrid,LaLiga,2000-2001,,60.00
1,Player_2,Hernán Crespo,Centre-Forward,25,Parma,Serie A,Lazio,Serie A,2000-2001,,56.81
2,Player_3,Marc Overmars,Left Winger,27,Arsenal,Premier League,FC Barcelona,LaLiga,2000-2001,,40.00
3,Player_4,Gabriel Batistuta,Centre-Forward,31,Fiorentina,Serie A,AS Roma,Serie A,2000-2001,,36.15
4,Player_5,Nicolas Anelka,Centre-Forward,21,Real Madrid,LaLiga,Paris SG,Ligue 1,2000-2001,,34.50
...,...,...,...,...,...,...,...,...,...,...,...
4695,Player_3297,Jasmin Kurtic,Attacking Midfield,29,Atalanta,Serie A,SPAL,Serie A,2018-2019,5000000.0,4.80
4696,Player_3298,Tchê Tchê,Central Midfield,25,Palmeiras,Série A,Dynamo Kyiv,Premier Liga,2018-2019,3000000.0,4.80
4697,Player_3299,Silvan Widmer,Right-Back,25,Udinese Calcio,Serie A,FC Basel,Super League,2018-2019,8500000.0,4.50
4698,Player_3300,Yuya Osako,Second Striker,28,1. FC Köln,2.Bundesliga,Werder Bremen,1.Bundesliga,2018-2019,4500000.0,4.50


In [215]:
df_cleaned_not_na.describe()

Unnamed: 0,Age,Estimated Value,Actual Value (mil)
count,4699.0,4699.0,4699.0
mean,24.343903,6312257.0,9.449171
std,3.211578,8438651.0,10.438264
min,15.0,0.0,0.825
25%,22.0,0.0,4.0
50%,24.0,4000000.0,6.5
75%,27.0,8100000.0,10.82
max,35.0,120000000.0,222.0


In [216]:
df_cleaned_not_na.to_csv("df_cleaned_not_na.csv", index=False)
df_cleaned.to_csv("df_cleaned.csv", index=False)