# Merging and Joining
In this notebook, we are going to make the master dataframe so that we can analyse our data. 

This notebook will merge the data from transfermarkt and countrywise data from FBref.com, this will allow us to do a countrywise analysis.

### Standard Python + R setup and imports

Work in this notebook so I can test viz in R as well.
Also imported fuzzy pandas.

In [59]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import fuzzy_pandas as fpd

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [23]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [24]:
%%R

# My commonly used R imports
require('tidyverse')

## Clean up

add the country of the Club involved. How do we add the club involved for over 30 countries?

In [72]:
# This is the not the original df, I have overwritten it
prem_df = pd.read_csv('data/premier-league.csv')
prem_df.sample(5)

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
22540,Liverpool FC,Ben Woodburn,22.0,Attacking Midfield,Heart of Midl.,"End of loanMay 31, 2022",in,Summer,,Premier League,2021,2021/2022
2231,Newcastle United,Paul Kitson,26.0,Centre-Forward,West Ham,€3.50m,out,Winter,3.5,Premier League,1996,1996/1997
11164,Tottenham Hotspur,Younès Kaboul,22.0,Centre-Back,Portsmouth,€6.25m,out,Summer,6.25,Premier League,2008,2008/2009
7301,Arsenal FC,Arturo Lupoli,17.0,Centre-Forward,Parma FC Youth,€300Th.,in,Summer,0.3,Premier League,2004,2004/2005
16373,Norwich City,Jacob Murphy,18.0,Right Winger,Swindon Town,loan transfer,out,Winter,,Premier League,2013,2013/2014


In [79]:
#Let's read in the rest of the leagues
bundesliga_df = pd.read_csv('data/1-bundesliga.csv')
championship_df = pd.read_csv('data/championship.csv')
laliga_df = pd.read_csv('data/primera-division.csv')
ligue1_df = pd.read_csv('data/ligue-1.csv')
seriea_df = pd.read_csv('data/serie-a.csv')
liganos_df = pd.read_csv('data/liga-nos.csv')
eredivisie_df = pd.read_csv('data/eredivisie.csv')
russian_league_df = pd.read_csv('data/premier-liga.csv')

In [92]:
#Let's join all the leagues into one dataframe
leagues_df = pd.concat([prem_df, bundesliga_df, championship_df, laliga_df, ligue1_df, seriea_df, liganos_df, eredivisie_df, russian_league_df], ignore_index=True)
leagues_df.sample(5)
len(leagues_df)

175832

## Doing a join with the country databases
We'll have to use fuzzy pandas and asses the damage after we get them in. 

In [96]:
#Also add the name of the country as a column so that we can read it in later
spain_df = pd.read_csv('data/spain_clubs.csv')
spain_df['country'] = 'Spain'

germany_df = pd.read_csv('data/germany_clubs.csv')
germany_df['country'] = 'Germany'

italy_df = pd.read_csv('data/italy_clubs.csv')
italy_df['country'] = 'Italy'

english_df = pd.read_csv('data/english_clubs.csv')
english_df['country'] = 'England'

france_df = pd.read_csv('data/france_clubs.csv')
france_df['country'] = 'France'

scotland_df = pd.read_csv('data/scotland_clubs.csv')
scotland_df['country'] = 'Scotland'

belgium_df = pd.read_csv('data/belgium_clubs.csv')
belgium_df['country'] = 'Belgium'

turkey_df = pd.read_csv('data/turkey_clubs.csv')
turkey_df['country'] = 'Turkey'

korea_df = pd.read_csv('data/korea_clubs.csv')
korea_df['country'] = 'Korea'

japan_df = pd.read_csv('data/japan_clubs.csv')
japan_df['country'] = 'Japan'

netherlands_df = pd.read_csv('data/netherlands_clubs.csv')
netherlands_df['country'] = 'Netherlands'

brazil_df = pd.read_csv('data/brazil_clubs.csv')
brazil_df['country'] = 'Brazil'

portugal_df = pd.read_csv('data/portugal_clubs.csv')
portugal_df['country'] = 'Portugal'

ukraine_df = pd.read_csv('data/ukraine_clubs.csv')
ukraine_df['country'] = 'Ukraine'

denmark_df = pd.read_csv('data/denmark_clubs.csv')
denmark_df['country'] = 'Denmark'

russia_df = pd.read_csv('data/russia_clubs.csv')
russia_df['country'] = 'Russia'

sweden_df = pd.read_csv('data/sweden_clubs.csv')
sweden_df['country'] = 'Sweden'

austria_df = pd.read_csv('data/austria_clubs.csv')
austria_df['country'] = 'Austria'

croatia_df = pd.read_csv('data/croatia_clubs.csv')
croatia_df['country'] = 'Croatia'


In [97]:
#Concatenate all the dataframes
country_df = pd.concat([austria_df, english_df, russia_df, sweden_df, spain_df, denmark_df, ukraine_df, germany_df, italy_df, france_df, scotland_df, belgium_df, turkey_df, korea_df, japan_df, netherlands_df, brazil_df, portugal_df], ignore_index=True)
country_df.sample(5)



Unnamed: 0,Squad,Gender,Comp,From,To,Comps,Champs,Other Names,country
465,Hashtag United FC,M,,2020-2021,2022-2023,0,,,England
1271,CF Illueca,M,,2019-2020,2019-2020,0,0.0,,Spain
298,Colne FC,M,,2017-2018,2022-2023,0,,,England
2861,OH Leuven,M,Belgian First Division A,2011-2012,2022-2023,12,0.0,,Belgium
755,Rochdale AFC,M,EFL League Two,2002-2003,2022-2023,21,0.0,,England


Let's use fuzzy pandas and do a join 

In [98]:
matches = fpd.fuzzy_merge(prem_df, country_df, left_on=['club_involved_name'], right_on=['Squad'])
len(matches) 

2882

In [102]:
len(results)

36483

In [100]:
results = fpd.fuzzy_merge(leagues_df, country_df,
            left_on=['club_involved_name'],
            right_on=['Squad'],
            keep_left=['club_name','player_name', 'club_involved_name', 'year', 'transfer_movement', 'transfer_period', 'fee_cleaned', 'league_name'],
            keep_right=['country','Comp'])
results.sample(5)

Unnamed: 0,club_name,player_name,club_involved_name,year,transfer_movement,transfer_period,fee_cleaned,league_name,country,Comp
20561,Angers SCO,Yoane Wissa,AC Ajaccio,2017,out,Summer,,Ligue 1,France,Ligue 1
9221,Reading FC,Dominic Samuel,Gillingham FC,2015,out,Winter,,Championship,England,EFL League Two
4745,VfL Wolfsburg,Andrés D'Alessandro,Real Zaragoza,2006,out,Summer,0.8,1 Bundesliga,Spain,Segunda División
32120,Willem II Tilburg,Geert den Ouden,ADO Den Haag,2006,in,Summer,,Eredivisie,Netherlands,Eerste Divisie
11339,CA Osasuna,Luis Pérez,Real Unión,2000,out,Summer,,Primera Division,Spain,


In [101]:
#From results DF, we want to see how many tranfers are in each country
#How much money did la liga spend on transfers.
#in results, show me the matches where the country is NaN
results[results['country'].isnull()]

Unnamed: 0,club_name,player_name,club_involved_name,year,transfer_movement,transfer_period,fee_cleaned,league_name,country,Comp
