# Merging and Joining
In this notebook, we are going to make the master dataframe so that we can analyse our data. 

This notebook will merge the data from transfermarkt and countrywise data from FBref.com, this will allow us to do a countrywise analysis.

### Standard Python + R setup and imports

Work in this notebook so I can test viz in R as well.
Also imported fuzzy pandas.

In [17]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import fuzzy_pandas as fpd

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [19]:
%%R

# My commonly used R imports
require('tidyverse')

## Clean up

add the country of the Club involved. How do we add the club involved for over 30 countries?

In [53]:
# This is the not the original df, I have overwritten it
prem_df = pd.read_csv('data/premier-league.csv')
prem_df.sample(5)

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
3479,Wimbledon FC,Morten Bakke,30.0,Goalkeeper,Molde,loan transfer,in,Winter,,Premier League,1998,1998/1999
1739,Middlesbrough FC,Branco,31.0,Left-Back,Corinthians,?,in,Winter,,Premier League,1995,1995/1996
3861,Middlesbrough FC,Clayton Blackmore,34.0,Left-Back,Barnsley FC,?,out,Summer,,Premier League,1999,1999/2000
5156,Middlesbrough FC,David Murphy,17.0,Left-Back,Boro U18,-,in,Summer,,Premier League,2001,2001/2002
10641,Reading FC,John Halls,25.0,Defensive Midfield,Crystal Palace,"End of loanFeb 1, 2008",in,Winter,,Premier League,2007,2007/2008


In [54]:
#search for a player_name which contains enzo
prem_df[prem_df['player_name'].str.contains('Enzo')]

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
1788,Bolton Wanderers,Enzo Gambaro,29.0,Left-Back,Milan,free transfer,in,Winter,0.0,Premier League,1995,1995/1996
1791,Bolton Wanderers,Enzo Gambaro,30.0,Left-Back,Grimsby Town,?,out,Winter,,Premier League,1995,1995/1996


In [55]:
#Let's read in the rest of the leagues
bundesliga_df = pd.read_csv('data/1-bundesliga.csv')
championship_df = pd.read_csv('data/championship.csv')
laliga_df = pd.read_csv('data/primera-division.csv')
ligue1_df = pd.read_csv('data/ligue-1.csv')
seriea_df = pd.read_csv('data/serie-a.csv')
liganos_df = pd.read_csv('data/liga-nos.csv')
eredivisie_df = pd.read_csv('data/eredivisie.csv')
russian_league_df = pd.read_csv('data/premier-liga.csv')

In [56]:
#Sum the the fee_cleaned for prem_df and filter for transfer movement in, group and sort by club_name
prem_df.groupby('club_name')['fee_cleaned'].sum().sort_values(ascending=False).head(10) 

club_name
Chelsea FC           4253.312
Manchester City      3427.667
Manchester United    3175.695
Liverpool FC         3031.811
Tottenham Hotspur    2663.286
Arsenal FC           2435.985
Everton FC           1921.142
Newcastle United     1702.146
West Ham United      1648.246
Aston Villa          1469.990
Name: fee_cleaned, dtype: float64

In [57]:
#Let's join all the leagues into one dataframe
leagues_df = pd.concat([prem_df, bundesliga_df, championship_df, laliga_df, ligue1_df, seriea_df, liganos_df, eredivisie_df, russian_league_df], ignore_index=True)

#Sum the the fee_cleaned for prem_df and filter for transfer movement in, group and sort by league_name
leagues_df_in = leagues_df[leagues_df['transfer_movement'] == 'in']

leagues_df_in.head(5)

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Middlesbrough FC,Tommy Wright,26.0,Left Winger,Leicester,€910Th.,in,Summer,0.91,Premier League,1992,1992/1993
1,Middlesbrough FC,Jonathan Gittens,28.0,defence,Southampton,€250Th.,in,Summer,0.25,Premier League,1992,1992/1993
2,Middlesbrough FC,Chris Morris,28.0,Right-Back,Celtic,?,in,Summer,,Premier League,1992,1992/1993
3,Middlesbrough FC,Ben Roberts,17.0,Goalkeeper,Boro U18,-,in,Summer,,Premier League,1992,1992/1993
4,Middlesbrough FC,Andy Todd,17.0,Centre-Back,Boro U18,-,in,Summer,,Premier League,1992,1992/1993


In [58]:
#Make a new dataaframe where we group by league_name and year and transfer_period and sum the fee_cleaned
leagues_df_in_grouped = leagues_df_in.groupby(['league_name', 'year', 'transfer_period'])['fee_cleaned'].sum().reset_index()
leagues_df_in_grouped.head(5)

Unnamed: 0,league_name,year,transfer_period,fee_cleaned
0,1 Bundesliga,1992,Summer,23.22
1,1 Bundesliga,1992,Winter,6.175
2,1 Bundesliga,1993,Summer,35.639
3,1 Bundesliga,1993,Winter,2.25
4,1 Bundesliga,1994,Summer,51.727


In [59]:
#year_wise = leagues_df_in.groupby(['league_name', 'year'])['fee_cleaned'].sum().unstack().sort_values(2018, ascending=False)
#export year_wise to df
leagues_df_in_grouped.to_csv('year_wise.csv')


In [52]:
#Show me 2022 for league_name Premier League
leagues_df_in_grouped[(leagues_df_in_grouped['league_name'] == 'Premier League') & (leagues_df_in_grouped['year'] == 2022)]


Unnamed: 0,league_name,year,transfer_period,fee_cleaned
341,Premier League,2022,summer,2247.57


In [1]:
#read in the year_wise df
year_wise = pd.read_csv('year_wise.csv')
year_wise


NameError: name 'pd' is not defined

In [None]:
%%R


## Doing a join with the country databases
We'll have to use fuzzy pandas and asses the damage after we get them in. 

In [8]:
#Also add the name of the country as a column so that we can read it in later
spain_df = pd.read_csv('data/spain_clubs.csv')
spain_df['country'] = 'Spain'

germany_df = pd.read_csv('data/germany_clubs.csv')
germany_df['country'] = 'Germany'

italy_df = pd.read_csv('data/italy_clubs.csv')
italy_df['country'] = 'Italy'

english_df = pd.read_csv('data/english_clubs.csv')
english_df['country'] = 'England'

france_df = pd.read_csv('data/france_clubs.csv')
france_df['country'] = 'France'

scotland_df = pd.read_csv('data/scotland_clubs.csv')
scotland_df['country'] = 'Scotland'

belgium_df = pd.read_csv('data/belgium_clubs.csv')
belgium_df['country'] = 'Belgium'

turkey_df = pd.read_csv('data/turkey_clubs.csv')
turkey_df['country'] = 'Turkey'

korea_df = pd.read_csv('data/korea_clubs.csv')
korea_df['country'] = 'Korea'

japan_df = pd.read_csv('data/japan_clubs.csv')
japan_df['country'] = 'Japan'

netherlands_df = pd.read_csv('data/netherlands_clubs.csv')
netherlands_df['country'] = 'Netherlands'

brazil_df = pd.read_csv('data/brazil_clubs.csv')
brazil_df['country'] = 'Brazil'

portugal_df = pd.read_csv('data/portugal_clubs.csv')
portugal_df['country'] = 'Portugal'

ukraine_df = pd.read_csv('data/ukraine_clubs.csv')
ukraine_df['country'] = 'Ukraine'

denmark_df = pd.read_csv('data/denmark_clubs.csv')
denmark_df['country'] = 'Denmark'

russia_df = pd.read_csv('data/russia_clubs.csv')
russia_df['country'] = 'Russia'

sweden_df = pd.read_csv('data/sweden_clubs.csv')
sweden_df['country'] = 'Sweden'

austria_df = pd.read_csv('data/austria_clubs.csv')
austria_df['country'] = 'Austria'

croatia_df = pd.read_csv('data/croatia_clubs.csv')
croatia_df['country'] = 'Croatia'


In [9]:
#Concatenate all the dataframes
country_df = pd.concat([austria_df, english_df, russia_df, sweden_df, spain_df, denmark_df, ukraine_df, germany_df, italy_df, france_df, scotland_df, belgium_df, turkey_df, korea_df, japan_df, netherlands_df, brazil_df, portugal_df], ignore_index=True)
country_df.sample(5)



Unnamed: 0,Squad,Gender,Comp,From,To,Comps,Champs,Other Names,country
3060,PSV Vrouwen,F,Eredivisie Vrouwen,2018-2019,2022-2023,5,0.0,,Netherlands
2385,FC Échirolles,M,,2014-2015,2015-2016,0,0.0,,France
162,Bishop's Stortford FC,M,,2017-2018,2022-2023,0,0.0,,England
2617,SAS Épinal,M,,2014-2015,2022-2023,0,0.0,,France
569,Liskeard Athletic FC,M,,2022-2023,2022-2023,0,,,England


Let's use fuzzy pandas and do a join 

In [10]:
matches = fpd.fuzzy_merge(leagues_df, country_df, left_on=['club_involved_name'], right_on=['Squad'])
len(matches)


KeyboardInterrupt: 

36483

In [24]:
results = fpd.fuzzy_merge(leagues_df, country_df,
            left_on=['club_involved_name'],
            right_on=['Squad'],
            keep_left=['club_name','player_name', 'club_involved_name', 'year', 'transfer_movement', 'transfer_period', 'fee_cleaned', 'league_name'],
            keep_right=['country','Comp'])


NameError: name 'country_df' is not defined

In [23]:
results.head(5)

NameError: name 'results' is not defined

In [None]:
#From results DF, we want to see how many tranfers are in each country
#How much money did la liga spend on transfers.
#in results, show me the matches where the country is NaN
matches[matches['country'].isnull()]

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,...,season,Squad,Gender,Comp,From,To,Comps,Champs,Other Names,country


In [12]:
#group by country and sum the fee_cleaned for every year
results.groupby(['country'])['fee_cleaned'].sum().sort_values(ascending=False)


country
Italy          11439.888
Spain           8077.205
Germany         7322.478
Portugal        4004.914
France          2091.082
Belgium         1720.448
Netherlands     1697.532
Brazil           775.239
Denmark          375.750
Turkey           295.942
England          177.071
Sweden           129.157
Japan             91.088
Scotland          65.782
Russia            51.425
Austria           22.425
Korea              8.600
Name: fee_cleaned, dtype: float64

In [13]:
#filter by club_name, where transfer_movement is in and sum by fee_cleaned
results[results['transfer_movement'] == 'in'].groupby(['club_name'])['fee_cleaned'].sum().sort_values(ascending=False)

club_name
FC Barcelona                   905.600
Juventus FC                    792.340
Real Madrid                    768.410
Paris Saint-Germain            757.820
Liverpool FC                   743.625
                                ...   
FC Dordrecht                     0.000
SC Beira-Mar                     0.000
Rotor Volgograd                  0.000
Rotherham United                 0.000
Энергия-Текстильщик Камышин      0.000
Name: fee_cleaned, Length: 387, dtype: float64

In [14]:
prem_transfer = results[results['league_name'] == 'Premier League']

In [21]:
#sort prem_transfer by fee_cleaned and transfers in 
prem_transfer[prem_transfer['transfer_movement'] == 'in'].groupby(['club_name'])['fee_cleaned'].sum().sort_values(ascending=False)

club_name
Liverpool FC               743.625
Chelsea FC                 717.990
Manchester City            654.350
Manchester United          465.580
Tottenham Hotspur          420.890
Wolverhampton Wanderers    352.375
Newcastle United           314.050
Arsenal FC                 274.830
Leicester City             272.660
Watford FC                 139.850
Aston Villa                131.980
Fulham FC                  130.860
Southampton FC             117.580
Everton FC                 113.460
Brighton & Hove Albion      94.760
Leeds United                90.200
Swansea City                68.370
West Bromwich Albion        63.880
Burnley FC                  56.900
Stoke City                  55.045
Sunderland AFC              53.975
Norwich City                52.900
West Ham United             40.866
Cardiff City                39.050
Blackburn Rovers            39.025
Nottingham Forest           27.300
Sheffield United            23.000
Brentford FC                21.000
Wigan Athl

In [19]:
#make a new df for the clubs in la_liga d
la_liga_transfer = results[results['league_name'] == 'Primera Division']

In [20]:
#sort la_liga_transfer by fee_cleaned and transfers
la_liga_transfer[la_liga_transfer['transfer_movement'] == 'in'].groupby(['club_name'])['fee_cleaned'].sum().sort_values(ascending=False)


club_name
FC Barcelona               905.600
Real Madrid                768.410
Atlético de Madrid         546.235
Valencia CF                308.720
Sevilla FC                 290.385
Villarreal CF              270.500
Deportivo de La Coruña     209.440
Real Betis Balompié        181.468
Athletic Bilbao            120.450
Celta de Vigo              108.683
Real Sociedad               95.530
RCD Espanyol Barcelona      94.775
Getafe CF                   53.000
Granada CF                  43.640
Real Zaragoza               37.740
Málaga CF                   35.350
Levante UD                  33.000
RCD Mallorca                29.330
Real Valladolid CF          24.580
UD Almería                  24.100
CA Osasuna                  23.470
Racing Santander            21.892
CD Tenerife                 19.270
CD Leganés                  19.200
SD Eibar                    16.600
Deportivo Alavés            14.050
Rayo Vallecano              14.050
UD Las Palmas               12.900
Recreativo

In [18]:
laliga_df.head(5)

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Real Sociedad,Alberto López,23.0,Goalkeeper,Real Sociedad B,-,in,Summer,,Primera Division,1992,1992/1993
1,Real Sociedad,Iñigo Idiakez,18.0,Attacking Midfield,R. Sociedad U19,-,in,Summer,,Primera Division,1992,1992/1993
2,Real Sociedad,José González,27.0,Goalkeeper,Valencia,?,out,Summer,,Primera Division,1992,1992/1993
3,Cádiz CF,Igor Stimac,24.0,Centre-Back,Hajduk Split,?,in,Summer,,Primera Division,1992,1992/1993
4,Cádiz CF,Quino,21.0,attack,CD Málaga,?,in,Summer,,Primera Division,1992,1992/1993
