This file is divided into two parts. In the first one the data frame obtained by the Scraper is cleaned and being prepared for its analysis which is conducted in the second part.

### Part 1 - Cleaning

In [1]:
import pandas as pd

After importing necessary packages we upload the csv file that was created by the class ScraperURL

In [2]:
table = pd.read_csv('C:/Users/jancv/Player_stats.csv')

In [3]:
table.head()

Unnamed: 0.1,Unnamed: 0,player,nationality,position,squad,age,birth_year,games,games_starts,minutes,...,assists,pens_made,pens_att,cards_yellow,cards_red,goals_per90,assists_per90,goals_assists_per90,goals_pens_per90,goals_assists_pens_per90
0,0,Marko Alvir,hr CRO,MF,Viktoria Plzeň,26-255,1994.0,9.0,0.0,116,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Emmanuel Antwi,gh GHA,DF,Příbram,24-239,1996.0,8.0,2.0,273,...,0.0,0.0,0.0,0.0,0.0,0.66,0.0,0.66,0.66,0.66
2,2,Oleksandr Azatskyi,ua UKR,DF,Baník Ostrava,26-352,1994.0,3.0,0.0,26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Pavol Bajza,sk SVK,GK,Slovácko,29-117,1991.0,6.0,6.0,540,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Jakub Barac,cz CZE,MF,Slovan Liberec,24-148,1996.0,3.0,1.0,80,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


1) FIrst look at the data reveals that the first column can be excluded as it only stores the indexes. 

2) Further, we want to find out whether our data frame cointains NaN values. If it is the case, the observation will be dropped.

3) Moreover, we can see that a column 'nationality' needs to be edited in that we are only interested in the 3-letter abbreviation.    The same holds for the age, where the days are not very informative at all, therefore only the age in years will be taken into consideration.


In [4]:
table = table.iloc[:,1:]

In [5]:
table.isnull().any().any()

True

In [6]:
table = table.dropna()

In [7]:
nationality = []
for nat in table['nationality']:
    nationality.append(nat[3:])
table['nationality'] = nationality

In [8]:
age = []
for ag in table['age']:
    age.append(int(ag[:2]))
table['age'] = age

We found out that at this moment every column where integers or floats are expected do satisfy this condition with the exception of 'minutes'. Thus, we have to convert it from a column of strings to a column of integers.

In [9]:
minutes = []
for minute in table['minutes']:
    minute = minute.replace(",","")
    minutes.append(int(minute))

table['minutes'] = minutes  

In [10]:
table.head()

Unnamed: 0,player,nationality,position,squad,age,birth_year,games,games_starts,minutes,goals,assists,pens_made,pens_att,cards_yellow,cards_red,goals_per90,assists_per90,goals_assists_per90,goals_pens_per90,goals_assists_pens_per90
0,Marko Alvir,CRO,MF,Viktoria Plzeň,26,1994.0,9.0,0.0,116,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Emmanuel Antwi,GHA,DF,Příbram,24,1996.0,8.0,2.0,273,2.0,0.0,0.0,0.0,0.0,0.0,0.66,0.0,0.66,0.66,0.66
2,Oleksandr Azatskyi,UKR,DF,Baník Ostrava,26,1994.0,3.0,0.0,26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Pavol Bajza,SVK,GK,Slovácko,29,1991.0,6.0,6.0,540,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Jakub Barac,CZE,MF,Slovan Liberec,24,1996.0,3.0,1.0,80,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
table.to_csv('Final_Player_Table.csv')

Our data frame is now ready to be analyzed.

### Part 2 - Analysis

In [30]:
import matplotlib.pyplot as plt
import numpy as np

In [78]:
table.head()

Unnamed: 0,player,nationality,position,squad,age,birth_year,games,games_starts,minutes,goals,assists,pens_made,pens_att,cards_yellow,cards_red,goals_per90,assists_per90,goals_assists_per90,goals_pens_per90,goals_assists_pens_per90
0,Marko Alvir,CRO,MF,Viktoria Plzeň,26,1994.0,9.0,0.0,116,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Emmanuel Antwi,GHA,DF,Příbram,24,1996.0,8.0,2.0,273,2.0,0.0,0.0,0.0,0.0,0.0,0.66,0.0,0.66,0.66,0.66
2,Oleksandr Azatskyi,UKR,DF,Baník Ostrava,26,1994.0,3.0,0.0,26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Pavol Bajza,SVK,GK,Slovácko,29,1991.0,6.0,6.0,540,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Jakub Barac,CZE,MF,Slovan Liberec,24,1996.0,3.0,1.0,80,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Our statistics are from the current season 2020/21. There are 18 teams in the Czech Football League which is unsual since normally there are only 16 of them. Due to the coronavirus causing many games to cancel, the relegation part of the last season did not finish, which resulted in the fact that no team was relegated. On the other hand, from the second top league, 2 best clubs were promoted. There have been 14 game days this season, however due to the Covid pandemic, some of them could not be played. Thus, the number of games played by each team varies from 12 to 14. The statistics are therefore biased in this way. 

Firstly, we are interested in nationality.

In [135]:
table.groupby('nationality').count()['player']

nationality
ALB      1
AUT      1
BHR      1
BIH      4
BRA      7
BUL      1
CIV      4
CMR      2
COD      1
COL      1
CRO      4
CUW      1
CZE    342
ESP      3
FRA      6
GAM      1
GEO      1
GER      1
GHA      2
GRE      1
GUI      2
KOR      1
LBR      1
LVA      1
MLI      1
MNE      1
NED      3
NGA      4
NOR      1
POL      2
ROU      2
RUS      4
SEN      1
SRB      6
SVK     34
SWE      1
UKR      3
USA      1
Name: player, dtype: int64

In [136]:
CZ = int(table[table['nationality']=='CZE'].count()['player'])
FO = int(table[table['nationality']!='CZE'].count()['player'])

print(f" There are {CZ} Czech players.")
print(f" There are {FO} foreign players.")
print(f" The ratio of Czech players to all players is {round(CZ/(CZ+FO)*100,1)}%.")

 There are 342 Czech players.
 There are 112 foreign players.
 The ratio of Czech players to all players is 75.3%.


In [77]:
Nat = pd.DataFrame(index=sorted(table['squad'].unique()),columns=['Ratio','Foreigners','Czechs'])

Nat['Czechs'] = table[table['nationality']=='CZE'].groupby('squad').count()['player']
Nat['Foreigners'] = table[table['nationality']!='CZE'].groupby('squad').count()['player']
Nat['Ratio'] = round((Nat['Foreigners']/(Nat['Foreigners']+Nat['Czechs']))*100,1)

Nat.sort_values('Ratio',ascending = False)

Unnamed: 0,Ratio,Foreigners,Czechs
Karviná,47.8,11,12
České Budĕjov.,43.5,10,13
Fastav Zlín,42.9,9,12
Slovan Liberec,37.9,11,18
Příbram,29.6,8,19
Sparta Prague,28.0,7,18
Slavia Prague,26.1,6,17
Baník Ostrava,24.0,6,19
Opava,21.9,7,25
Viktoria Plzeň,21.7,5,18


Karviná has almost 50% of their squad from abroad. while in the team of Jablonec, the vast majority of players are Czechs.

#### Position

In [133]:
PoS = pd.DataFrame(index=sorted(table['position'].unique()),columns=['Ratio','Foreigners','Czechs'])

PoS['Czechs'] = table[table['nationality']=='CZE'].groupby('position').count()['player']
PoS['Foreigners'] = table[table['nationality']!='CZE'].groupby('position').count()['player']
PoS['Ratio'] = round((PoS['Foreigners']/(PoS['Foreigners']+PoS['Czechs']))*100,1)

PoS.sort_values('Ratio',ascending = False)

Unnamed: 0,Ratio,Foreigners,Czechs
"DF,MF",50.0,6.0,6
"FW,MF",47.4,9.0,10
FW,27.5,22.0,58
MF,23.8,43.0,138
DF,22.4,28.0,97
GK,11.1,4.0,32
"DF,FW",,,1


Considering only the players that did not change their position, we can see that the highest ratio of foreigners is among forwarders 27.5%. On the contrary, there are 32 Czechs out of 36 among goalkeepers which makes almost 90%.

This notebook is still in progress...