# Capology Player Web Scraping
##### Notebook to scrape raw data  from [Capology](https://www.capology.com/) using [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) and [Selenium](https://www.selenium.dev/) 


___

<a id='sectionintro'></a>

## <a id='import_libraries'>Introduction</a>
This notebook scrapes player statstics data from [Capology](https://www.capology.com/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames, and [Selenium](https://www.selenium.dev/) and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.


___

<a id='sectioncontents'></a>

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Project Brief](#section2)<br>
3.    [Data Scraping](#section3)<br>
4.    [Data Clean](#section4)<br>
5.    [Export Data](#section5)<br>
6.    [Summary](#section6)<br>
7.    [Next Steps](#section7)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
*    [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation;
*    [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) and [Selenium](https://www.selenium.dev/) for web scraping.

All packages used for this notebook except for [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) and [Selenium](https://www.selenium.dev/) can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).

### Import Libraries and Modules

In [1]:
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd
import os
import re
import random
import glob
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
from pandas import json_normalize

# Web Scraping
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from bs4 import BeautifulSoup
import re

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Progress Bar
from tqdm import tqdm

# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete


In [2]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))

Python: 3.13.2
NumPy: 2.2.4
pandas: 2.2.3
matplotlib: 3.10.1


### Defined Variables and Lists

##### Today's Date 

In [3]:
# Define today's date
todays_date = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

##### Season

In [4]:
# Define variables and lists

## Define season
season = '2020'    # '2020' for the 20/21 season

# Create 'Full Season' and 'Short Season' strings

## Full season
full_season_string = str(int(season)) + '/' + str(int(season) + 1)

## Short season
short_season_string = str((str(int(season))[-2:]) + (str(int(season) + 1)[-2:]))

##### Scraping Variables

In [5]:
options = webdriver.ChromeOptions()

In [6]:
##
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

##### Teams and Leagues

In [7]:
# Premier League

## 2013-2014 PL


## 2014-2015 PL


## 2015-2016 PL


## 2016-2017 PL
lst_teams_pl_1617 = ['arsenal', 'bournemouth', 'burnley', 'chelsea', 'crystal-palace', 'everton',
             'hull-city', 'leicester', 'liverpool', 'manchester-city', 'manchester-united',
             'middlesbrough', 'southampton', 'stoke-city', 'sunderland', 'swansea', 'tottenham',
             'watford', 'west-bromwich', 'west-ham']

## 2017-2018 PL
lst_teams_pl_1718 = ['arsenal', 'bournemouth', 'brighton', 'burnley', 'chelsea', 'crystal-palace', 'everton',
             'huddersfield', 'leicester', 'liverpool', 'manchester-city', 'manchester-united',
             'newcastle', 'southampton', 'stoke-city', 'swansea', 'tottenham',
             'watford', 'west-bromwich', 'west-ham']

## 2018-2019 PL
lst_teams_pl_1819 = ['arsenal', 'bournemouth', 'brighton', 'burnley', 'cardiff', 'chelsea',
             'crystal-palace', 'everton', 'fulham', 'huddersfield', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle',
             'southampton', 'tottenham', 'watford', 'west-ham', 'wolverhampton']

## 2019-2020 PL
lst_teams_pl_1920 = ['arsenal', 'aston-villa', 'bournemouth', 'brighton', 'burnley', 'chelsea',
             'crystal-palace', 'everton', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle',
             'norwich', 'sheffield-united', 'southampton', 'tottenham', 'watford',
             'west-ham', 'wolverhampton']

## 2020-2021 PL
lst_teams_pl_2021 = ['arsenal', 'aston-villa', 'brighton', 'burnley', 'chelsea',
             'crystal-palace', 'everton', 'fulham', 'leeds', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle',
             'sheffield-united', 'southampton', 'tottenham', 'west-bromwich',
             'west-ham', 'wolverhampton']

## 2021-2022 PL
lst_teams_pl_2122 = ['arsenal', 'aston-villa', 'brentford', 'brighton', 'burnley', 'chelsea',
             'crystal-palace', 'everton', 'leeds', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle', 'norwich',
             'southampton', 'tottenham', 'watford', 'west-ham', 'wolverhampton']

In [8]:
# Serie A

## 2013-2014 Serie A
#lst_teams_sa_1314 = ['']

## 2014-2015 Serie A

## 2015-2016 Serie A
lst_teams_sa_1516 = ['ac-milan', 'atalanta', 'bologna', 'carpi', 'chievo-verona', 'empoli', 'fiorentina', 'frosinone',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'palermo', 'roma',
                     'sampdoria', 'sassuolo', 'torino', 'udinese']

## 2016-2017 Serie A
lst_teams_sa_1617 = ['ac-milan', 'atalanta', 'bologna', 'cagliari', 'chievo-verona', 'crotone', 'empoli', 'fiorentina',
                     'genoa', 'inter-milan', 'juventus', 'lazio', 'napoli', 'palermo', 'pescara', 'roma',
                     'sampdoria', 'sassuolo', 'torino', 'udinese']

## 2017-2018 Serie A
lst_teams_sa_1718 = ['ac-milan', 'atalanta', 'benevento', 'bologna', 'cagliari', 'chievo-verona', 'crotone', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'roma',
                     'sampdoria', 'sassuolo', 'spal', 'torino', 'udinese']

## 2018-2019 Serie A
lst_teams_sa_1819 = ['ac-milan', 'atalanta', 'bologna', 'cagliari', 'chievo-verona', 'empoli', 'fiorentina',
                     'frosinone', 'genoa', 'inter-milan', 'juventus', 'lazio', 'napoli', 'parma', 'roma',
                     'sampdoria', 'sassuolo', 'spal', 'torino', 'udinese']

## 2019-2020 Serie A
lst_teams_sa_1920 = ['ac-milan', 'atalanta', 'bologna', 'brescia', 'cagliari', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'lecce', 'napoli', 'parma', 'roma',
                     'sampdoria', 'sassuolo', 'spal', 'torino', 'udinese']

## 2020-2021 Serie A
lst_teams_sa_2021 = ['ac-milan', 'atalanta', 'benevento', 'bologna', 'cagliari', 'crotone', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'parma', 'roma',
                     'sampdoria', 'sassuolo', 'spezia', 'torino', 'udinese']

## 2021-2022 Serie A
lst_teams_sa_2122 = ['ac-milan', 'atalanta', 'bologna', 'cagliari', 'empoli', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'roma', 'salernitana',
                     'sampdoria', 'sassuolo', 'spezia', 'torino', 'udinese', 'venezia']

In [9]:
# La Liga

## 2013-2014 La Liga


## 2014-2015 La Liga


## 2015-2016 La Liga


## 2016-2017 La Liga
lst_teams_ll_1617 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'deportivo', 'eibar', 'espanyol',
                     'granada', 'las-palmas', 'malaga', 'osasuna', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'sporting-gijon', 'valencia', 'villarreal']

## 2017-2018 La Liga
lst_teams_ll_1718 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'deportivo', 'eibar', 'espanyol',
                     'getafe', 'girona', 'las-palmas', 'levante', 'malaga', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'villarreal']

## 2018-2019 La Liga
lst_teams_ll_1819 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'eibar', 'espanyol',
                     'getafe', 'girona', 'huesca', 'leganes', 'levante', 'rayo-vallecano', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'valladolid']

## 2019-2020 La Liga
lst_teams_ll_1920 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'eibar', 'espanyol',
                     'getafe', 'granada', 'leganes', 'levante', 'mallorca', 'osasuna', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'valladolid']

## 2020-2021 La Liga
lst_teams_ll_2021 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'cadiz', 'celta-vigo', 'eibar',
                     'elche', 'getafe', 'granada', 'huesca', 'levante', 'osasuna', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'valladolid']

## 2021-2022 La Liga
lst_teams_ll_2122 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'cadiz', 'celta-vigo', 'elche',
                     'espanyol', 'getafe', 'granada', 'levante', 'mallorca', 'osasuna', 'rayo-vallecano', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia']

In [10]:
# Bundesliga

## 2013-2014 Bundesliga


## 2014-2015 Bundesliga


## 2015-2016 Bundesliga


## 2016-2017 Bundesliga
lst_teams_b_1617 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'darmstadt',
                    'eintracht-frankfurt', 'freiburg', 'hamburg', 'hertha-berlin', 'hoffenheim',
                    'ingolstadt', 'koln', 'leipzig', 'mainz', 'monchengladbach', 'schalke-04', 'werder-bremen', 
                    'wolfsburg']

## 2017-2018 Bundesliga
lst_teams_b_1718 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'eintracht-frankfurt',
                    'freiburg', 'hamburg', 'hannover', 'hertha-berlin', 'hoffenheim', 'koln',
                    'leipzig', 'mainz', 'monchengladbach', 'schalke-04', 'stuttgart', 'werder-bremen', 
                    'wolfsburg']

## 2018-2019 Bundesliga
lst_teams_b_1819 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'dusseldorf',
                    'eintracht-frankfurt', 'freiburg', 'hannover', 'hertha-berlin', 'hoffenheim',
                    'leipzig', 'mainz', 'monchengladbach', 'nurnberg', 'schalke-04', 'stuttgart', 'werder-bremen', 
                    'wolfsburg']

## 2019-2020 Bundesliga
lst_teams_b_1920 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'dusseldorf',
                    'eintracht-frankfurt', 'freiburg', 'hertha-berlin', 'hoffenheim', 'koln',
                    'leipzig', 'mainz', 'monchengladbach', 'paderborn', 'schalke-04', 'union-berlin', 'werder-bremen', 
                    'wolfsburg']

## 2020-2021 Bundesliga
lst_teams_b_2021 = ['arminia-bielefeld', 'augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund',
                    'eintracht-frankfurt', 'freiburg', 'hertha-berlin', 'hoffenheim', 'leipzig', 'mainz', 'monchengladbach',
                    'schalke-04', 'stuttgart', 'union-berlin', 'werder-bremen', 'wolfsburg']

## 2020-2021 Bundesliga
lst_teams_b_2122 = ['arminia-bielefeld', 'augsburg', 'bayer-leverkusen', 'bayern-munich', 'bochum', 'borussia-dortmund',
                    'eintracht-frankfurt', 'freiburg', 'furth', 'hertha-berlin', 'hoffenheim', 'koln', 'leipzig', 'mainz', 'monchengladbach',
                    'stuttgart', 'union-berlin', 'wolfsburg']

In [11]:
# 2.Bundesliga

## 2013-2014 2.Bundesliga


## 2014-2015 2.Bundesliga


## 2015-2016 2.Bundesliga


## 2016-2017 2.Bundesliga

## 2017-2018 2.Bundesliga

## 2018-2019 2.Bundesliga
lst_teams_b2_1819 = ['arminia-bielefeld', 'bochum', 'darmstadt', 'dynamo-dresden',
                     'erzgebirge-aue', 'furth', 'hamburg', 'hannover',
                     'heidenheim', 'holstein-kiel', 'jahn',
                     'karlsruher', 'nurnberg', 'osnabruck', 'sandhausen',
                     'st-pauli', 'stuttgart', 'wehen'
                    ]

## 2019-2020 2.Bundesliga
lst_teams_b2_1920 = ['bochum', 'braunschweiger', 'darmstadt', 'dusseldorf',
                     'erzgebirge-aue', 'furth', 'hamburg', 'hannover',
                     'heidenheim', 'holstein-kiel', 'jahn',
                     'karlsruher', 'nurnberg', 'osnabruck', 'paderborn', 'sandhausen',
                     'st-pauli', 'werder-bremen'
                    ]

## 2020-2021 2.Bundesliga
lst_teams_b2_2021 = ['darmstadt', 'dusseldorf', 'dynamo-dresden',
                     'erzgebirge-aue', 'hamburg', 'hannover', 'hansa-rostock',
                     'heidenheim', 'holstein-kiel', 'ingolstadt', 'jahn',
                     'karlsruher', 'nurnberg', 'paderborn', 'sandhausen',
                     'schalke-04', 'st-pauli', 'werder-bremen'
                    ]

In [12]:
# Ligue 1

## 2013-2014 


## 2014-2015


## 2015-2016


## 2016-2017 
lst_teams_l1_1617 = ['angers', 'bastia', 'bordeaux', 'caen', 'dijon', 'guingamp', 'lille', 'lorient',
                     'lyon', 'marseille', 'metz', 'monaco', 'montpellier', 'nancy', 'nantes', 'nice', 'psg', 'rennes', 
                     'st-etienne', 'toulouse'
                    ]

## 2017-2018 
lst_teams_l1_1718 = ['amiens', 'angers', 'bordeaux', 'caen', 'dijon', 'guingamp', 'lille', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'psg', 'rennes', 
                     'st-etienne', 'strasbourg', 'toulouse', 'troyes'
                    ]

## 2018-2019 
lst_teams_l1_1819 = ['amiens', 'angers', 'bordeaux', 'caen', 'dijon', 'guingamp', 'lille', 'lyon', 'marseille',
                     'monaco', 'montpellier', 'nantes', 'nice', 'nimes', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg', 'toulouse'
                    ]

## 2019-2020
lst_teams_l1_1920 = ['amiens', 'angers', 'bordeaux', 'brest', 'dijon', 'lille', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'nimes', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg', 'toulouse'
                    ]

## 2020-2021 
lst_teams_l1_2021 = ['angers', 'bordeaux', 'brest', 'dijon', 'lens', 'lille', 'lorient', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'nimes', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg'
                    ]

## 2021-2022 
lst_teams_l1_2122 = ['angers', 'bordeaux', 'brest', 'clermont', 'lens', 'lille', 'lorient', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg', 'troyes'
                    ]

In [13]:
# MLS

## 2013


## 2014


## 2015
lst_teams_mls_15 = ['chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-galaxy', 
                    'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]


## 2016
lst_teams_mls_16 = ['chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-galaxy', 
                    'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2017
lst_teams_mls_17 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2018
lst_teams_mls_18 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2019 
lst_teams_mls_19 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-cincinnati', 'fc-dallas', 'houston-dynamo', 'inter-miami', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'nashville-sc', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2020
lst_teams_mls_20 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-cincinnati', 'fc-dallas', 'houston-dynamo', 'inter-miami', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'nashville-sc', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2021 
lst_teams_mls_21 = ['atlanta-united', 'austin', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-cincinnati', 'fc-dallas', 'houston-dynamo', 'inter-miami', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'nashville-sc', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

In [14]:
# Belgian First Division A

## 2013-2014
lst_teams_belgian_1314 = ['anderlecht', 'cercle-brugges', 'charleroi', 
                          'club-brugges', 'genk',
                          'gent', 'kortrijk', 'leuven', 'lierse', 'lokeren', 'mechelen', 
                          'mons', 'oostende', 'standard-liege',
                          'waasland-beveren', 'zulte-waregem'
                         ]

## 2014-2015
lst_teams_belgian_1415 = ['anderlecht', 'charleroi', 
                          'club-brugges', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'lokeren', 'mechelen', 'oostende',
                          'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'westerlo', 'zulte-waregem'
                         ]

## 2015-2016
lst_teams_belgian_1516 = ['anderlecht', 'charleroi', 
                          'club-brugges', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'lokeren', 'mechelen', 'oostende',
                          'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'westerlo', 'zulte-waregem'
                         ]

## 2016-2017
lst_teams_belgian_1617 = ['anderlecht', 'charleroi', 
                          'club-brugges', 'eupen', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'lokeren', 'mechelen', 'oostende',
                          'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'westerlo', 'zulte-waregem'
                         ]

## 2017-2018
lst_teams_belgian_1718 = ['anderlecht', 'charleroi', 
                          'club-brugges', 'eupen', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'lokeren', 'mechelen', 'oostende',
                          'royal-antwerp', 'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'zulte-waregem'
                         ]

## 2018-2019
lst_teams_belgian_1819 = ['anderlecht', 'cercle-brugges', 'charleroi', 
                          'club-brugges', 'eupen', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'oostende',
                          'royal-antwerp', 'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'zulte-waregem'
                         ]

## 2019-2020
lst_teams_belgian_1920 = ['anderlecht', 'cercle-brugges', 'charleroi', 
                          'club-brugges', 'eupen', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'mechelen', 'oostende',
                          'royal-antwerp', 'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'zulte-waregem'
                         ]

## 2020-2021
lst_teams_belgian_2021 = ['anderlecht', 'beerschot-va', 'cercle-brugges',
                          'charleroi', 'club-brugges', 'eupen', 'excel-mouscron', 'genk',
                          'gent', 'kortrijk', 'leuven', 'mechelen', 'oostende',
                          'royal-antwerp', 'sint-truidense', 'standard-liege',
                          'waasland-beveren', 'zulte-waregem'
                         ]

## 2021-2022
lst_teams_belgian_2122 = ['anderlecht', 'beerschot-va', 'cercle-brugges',
                          'charleroi', 'club-brugges', 'eupen', 'genk',
                          'gent', 'kortrijk', 'leuven', 'mechelen', 'oostende',
                          'royal-antwerp', 'seraing', 'sint-truidense', 'standard-liege',
                          'union-sg', 'zulte-waregem'
                         ]

In [15]:
# SPL

## 2013-2014


## 2014-2015


## 2015-2016


## 2016-2017


## 2017-2018


## 2018-2019


## 2019-2020
lst_teams_spl_1920 = ['aberdeen', 'celtic', 'hamilton', 'hearts', 'hibernian',
                      'kilmarnock', 'livingston', 'motherwell', 'rangers', 'ross-county',
                      'st-johnstone', 'st-mirren'
                     ]

## 2020-2021
lst_teams_spl_2021 = ['aberdeen', 'celtic', 'dundee-united', 'hamilton', 'hibernian',
                      'kilmarnock', 'livingston', 'motherwell', 'rangers', 'ross-county',
                      'st-johnstone', 'st-mirren'
                     ]

## 2021-2022
lst_teams_spl_2122 = ['aberdeen', 'celtic', 'dundee', 'dundee-united', 'hearts', 'hibernian',
                      'livingston', 'motherwell', 'rangers', 'ross-county',
                      'st-johnstone', 'st-mirren'
                     ]

##### Seasons

In [16]:
lst_seasons = ['2016-2017', '2017-2018', '2018-2019', '2019-2020', '2020-2021']

### Defined Filepaths

In [17]:
base_dir = r'C:\Users\Arnau Climent\OneDrive\Documentos\1_MASTER\PORTFOLIO\scrap_football\capology'

# Set up initial paths to subfolders


data_dir = os.path.join(base_dir, 'data')
data_dir_capology = os.path.join(base_dir)
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')

#### Previous season scraper

In [None]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import os
from datetime import date
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from io import StringIO

# Assuming these are defined elsewhere in your notebook/script
# data_dir_capology = ...
# todays_date = ...

# Define function for scraping a defined season of Capology data
def scrape_capology_season_prev(lst_teams, season, comp):
    """
    Scrapes Capology data for a given season and competition, handling existing files.

    Args:
        lst_teams (list): List of team names.
        season (str): The season to scrape (e.g., '2020-2021').
        comp (str): The competition (e.g., 'premier-league').

    Returns:
        pd.DataFrame: A unified DataFrame containing the scraped data for all teams.
                      Returns an empty DataFrame on error.
    """
    ### Print statement
    print(f'Scraping for {comp} for the {season} season has now started...')

    ## Create empty list for DataFrames
    dfs_players = []

    for team in lst_teams:
        # Construct the filepath
        filepath = os.path.join(data_dir_capology, 'raw', comp, season, f'{team}_{comp}_{season}.csv')

        if not os.path.exists(filepath):
            # Create ChromeOptions *once* here
            options = Options()
            options.add_argument("--headless")
            options.add_argument("--disable-gpu")
            options.add_argument("--no-sandbox")

            url = f'https://www.capology.com/club/{team}/salaries/{season}/'
            print(f'Scraping {team} for the {season} season')
            try:
                wd = webdriver.Chrome(options=options)  # Pass options here
                wd.get(url)
                # Wait for the table to load.  Adjust the timeout and locator as needed.
                try:
                    WebDriverWait(wd, 10).until(
                        EC.presence_of_element_located((By.XPATH, '//table'))  # Find the table
                    )
                except:
                    print(f"Error: Table not found on Capology for {team} after 10 seconds.")
                    wd.quit()
                    dfs_players.append(pd.DataFrame())
                    continue

                html = wd.page_source
                dfs = pd.read_html(StringIO(html), header=None)  # Read all tables, no header

                if len(dfs) > 0:
                    df = dfs[0]  # default to the first table
                    print(f"Read table with index 0 for {team}")
                else:
                    print(f"Error: No tables found on Capology for {team}.")
                    wd.quit()
                    dfs_players.append(pd.DataFrame())
                    continue

                ### Data Engineering
                # Check if the DataFrame has at least 2 rows before proceeding.
                if len(df) > 1:
                    try:
                        print(f"DataFrame shape before slicing: {df.shape}")
                        print(df.head())
                        df = df.iloc[:, :]
                        print(f"DataFrame shape after dropping first column: {df.shape}")
                        print(df.head())
                        df = df[:-1]
                        print(f"DataFrame shape after dropping last row: {df.shape}")
                        print(df.head())
                        df = df.reset_index(drop=True)
                        print(f"DataFrame shape after resetting index: {df.shape}")
                        print(df.head())
                        df = df.drop(['Rank'], axis=1, errors='ignore')
                        print(f"DataFrame shape after dropping 'Rank' column: {df.shape}")
                        print(df.head())
                    except IndexError as e:
                        print(f"Data Engineering Error for {team}: {e}")
                        print(f"DataFrame shape before error: {df.shape}")
                        print(df.head())
                        df = pd.DataFrame()  # Assign an empty DataFrame to df
                elif len(df) == 1:
                    try:
                        print(f"DataFrame shape (len 1) before slicing: {df.shape}")
                        print(df.head())
                        df = df.iloc[0:0] # return an empty dataframe with the same columns
                        print(f"DataFrame shape after slicing (len 1): {df.shape}")
                        print(df.head())
                        df = df.iloc[:, 1:] #remove first column
                        print(f"DataFrame shape after dropping first column (len 1): {df.shape}")
                        print(df.head())
                        df = df.reset_index(drop=True)
                        print(f"DataFrame shape after reset index (len 1): {df.shape}")
                        print(df.head())
                        df = df.drop(['Rank'], axis=1, errors='ignore')
                        print(f"DataFrame shape after dropping Rank (len 1): {df.shape}")
                        print(df.head())
                    except IndexError as e:
                        print(f"Data Engineering Error for {team} (len 1): {e}")
                        print(f"DataFrame shape before error (len 1): {df.shape}")
                        print(df.head())
                        df = pd.DataFrame()
                else:
                    print(f"Error:  No data rows found for {team}.")
                    df = pd.DataFrame() # assign empty dataframe

                df['Team'] = team
                df['Team'] = df['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC').str.replace('Ac', 'AC')
                df['League'] = comp
                df['League'] = df['League'].str.replace('-', ' ').str.title()
                df['Season'] = season
                print(f'Saving DataFrame of {team} for the {season} season')

                ### Save to CSV
                df.to_csv(filepath)

                ### Append to joint DataFrame
                dfs_players.append(df)
            except Exception as e:
                print(f"Error scraping {team}: {e}")
                wd.quit()
                dfs_players.append(pd.DataFrame())
                continue
            finally:
                wd.quit()
        else:
            df = pd.read_csv(filepath, index_col=None, header=0)
            print(f'{team} already scraped and saved for the {season} season')
            dfs_players.append(df)

    ## Concatenate DataFrames to one DataFrame
    if dfs_players:
        df_players_all = pd.concat(dfs_players, ignore_index=True)
    else:
        df_players_all = pd.DataFrame()

    ## Engineer unified data
    if not df_players_all.empty:
        df_players_all['Team'] = df_players_all['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC').str.replace('Ac', 'AC')
        df_players_all['League'] = df_players_all['League'].str.replace('-', ' ').str.title()

    ## Save to CSV
    if not df_players_all.empty:
        df_players_all.to_csv(os.path.join(data_dir_capology, 'raw', comp, season, f'all_{comp}_{season}.csv'))

    ### Print statement
    print(f'Scraping for {comp} for the {season} season is now complete')

    ## Return unified season dataset
    return df_players_all


### Create Directory Structure

In [19]:
# Make the directory structure
for folder in ['bundesliga', 'championship', 'la-liga', 'ligue-1', 'mls', 'premier-league', 'serie-a']:
    path = os.path.join(data_dir_capology, 'raw', folder)
    if not os.path.exists(path):
        os.mkdir(path)

### Notebook Settings

In [20]:
# Display all columns of displayed pandas DataFrames
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

## <a id='#section2'>2. Project Brief</a>
This Jupyter notebook is part of a series of notebooks, to scrape, parse, engineer, and unify datasets, that can be used for modeling purposes.

This particular notebook is one of several **web scraping** notebooks, that takes player salary data from the [Capology](https://www.capology.com/), and scrapes it using [Selenium](https://www.selenium.dev/) and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) and manipulates it as Dataframes using [pandas](http://pandas.pydata.org/).


---

<a id='section3'></a>

## <a id='#section3'>3. Data Scraping</a>

### <a id='#section3.1'>3.1. Introduction</a>
Two different scrapers:
1. Previous seasons (`scrape_capology_season_prev`)
2. Current seasons (slightly different webpage structure, so needs to be different) (`scrape_capology_season_current`)

### <a id='#section3.2'>3.2. Scrape data by League and Season</a>
The scraper current iterates through manually written lists of teams per league/season, with each function downloading one league/season. Ideally, the scraper would be a for loop that would scrape all teams/leagues/seasons in one command, but this requires a little more work in Selenium, that I may work on at a later date. However, if you run the notebook, data for 17/18-20/21 seasons for the 'Big 5' European leagues + MLS will be scraped.

#### <a id='#section3.2.1'>3.2.1. Premier League

In [24]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
#df_players_all = scrape_capology_season_prev(lst_teams_pl_2021, '2020-2021', 'premier-league')

## Display DataFrame
#df_players_all.head()
# Solo el primer equipo de la lista
df_players_first = scrape_capology_season_prev([lst_teams_pl_2021[0]], '2020-2021', 'premier-league')


Scraping for premier-league for the 2020-2021 season has now started...
Scraping arsenal for the 2020-2021 season
Read table with index 0 for arsenal
DataFrame shape before slicing: (35, 7)
          Unnamed: 0_level_0  \
                      Player   
0                 Mesut Özil   
1  Pierre-Emerick Aubameyang   
2              Thomas Partey   
3        Alexandre Lacazette   
4                    Willian   

  Est. Base Salary  All salary figures are estimates and do not represent official figures.  \
                                                                            Gross P/W (GBP)   
0                                          £ 350,000                                          
1                                          £ 250,000                                          
2                                          £ 200,000                                          
3                                          £ 182,115                                          
4              

In [25]:
import os
import pandas as pd


# Ruta a la carpeta con los CSV
folder_path = r'C:\Users\Arnau Climent\OneDrive\Documentos\1_MASTER\PORTFOLIO\scrap_football\capology\raw\premier-league\2020-2021'

# Nombres deseados para las columnas válidas
new_column_names = [
    "ID", "Player", "Salario bruto por semana (GBP)", "Salario bruto por ano (GBP)", "Inflation (GBP)",
    "Pos", "Age", "Country", "Team", "League", "Season"
]

for filename in os.listdir(folder_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(folder_path, filename)

        # Leer sin encabezado original
        df = pd.read_csv(file_path, header=None)

        # Eliminar la primera fila
        df_cleaned = df.iloc[1:].reset_index(drop=True)
        df = df_cleaned


        # Cortar la primeras filas (cabecera + fila de nombres)
        df = df.iloc[1:].reset_index(drop=True)

        # Cortar solo las primeras 7 columnas
        df = df.iloc[:, :11]

        # Renombrar columnas
        df.columns = new_column_names

        # Guardar el archivo sobrescribiendo el original
        df.to_csv(file_path, index=False)

        print(f"✔ Procesado y renombrado: {filename}")



✔ Procesado y renombrado: all_premier-league_2020-2021.csv
✔ Procesado y renombrado: arsenal_premier-league_2020-2021.csv


In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
#df_players_all = scrape_capology_season_prev(lst_teams_pl_1819, '2018-2019', 'premier-league')

## Display DataFrame
#df_players_all.head()

#### <a id='#section3.2.2'>3.2.2. Serie A

In [47]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_sa_2021, '2020-2021', 'serie-a')

## Display DataFrame
df_players_all.head()

Scraping for serie-a for the 2020-2021 season has now started...
ac-milan already scraped and saved for the 2020-2021 season
atalanta already scraped and saved for the 2020-2021 season
benevento already scraped and saved for the 2020-2021 season
bologna already scraped and saved for the 2020-2021 season
cagliari already scraped and saved for the 2020-2021 season
crotone already scraped and saved for the 2020-2021 season
fiorentina already scraped and saved for the 2020-2021 season
genoa already scraped and saved for the 2020-2021 season
hellas-verona already scraped and saved for the 2020-2021 season
inter-milan already scraped and saved for the 2020-2021 season
juventus already scraped and saved for the 2020-2021 season
lazio already scraped and saved for the 2020-2021 season
napoli already scraped and saved for the 2020-2021 season
parma already scraped and saved for the 2020-2021 season
roma already scraped and saved for the 2020-2021 season
sampdoria already scraped and saved for t

Unnamed: 0.1,Unnamed: 0,Player,Weekly GrossBase Salary(IN EUR),Annual GrossBase Salary(IN EUR),"Adj. GrossBase Salary(2021, IN EUR)",Pos.,Age,Country,Team,League,Season
0,0,Gianluigi Donnarumma,"€ 213,654","€ 11,110,000","€ 11,110,000",K,21,Italy,Ac Milan,Serie A,2020-2021
1,1,Zlatan Ibrahimovic,"€ 172,500","€ 8,970,000","€ 8,970,000",F,39,Sweden,Ac Milan,Serie A,2020-2021
2,2,Alessio Romagnoli,"€ 124,615","€ 6,480,000","€ 6,480,000",D,25,Italy,Ac Milan,Serie A,2020-2021
3,3,Hakan Calhanoglu,"€ 89,038","€ 4,630,000","€ 4,630,000",F,26,Turkey,Ac Milan,Serie A,2020-2021
4,4,Ante Rebic,"€ 86,346","€ 4,490,000","€ 4,490,000",F,27,Croatia,Ac Milan,Serie A,2020-2021


#### <a id='#section3.2.3'>3.2.3. La Liga

In [48]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_ll_2021, '2020-2021', 'la-liga')

## Display DataFrame
df_players_all.head()

Scraping for la-liga for the 2020-2021 season has now started...
alaves already scraped and saved for the 2020-2021 season
athletic-club already scraped and saved for the 2020-2021 season
atletico-madrid already scraped and saved for the 2020-2021 season
barcelona already scraped and saved for the 2020-2021 season
cadiz already scraped and saved for the 2020-2021 season
celta-vigo already scraped and saved for the 2020-2021 season
eibar already scraped and saved for the 2020-2021 season
elche already scraped and saved for the 2020-2021 season
getafe already scraped and saved for the 2020-2021 season
granada already scraped and saved for the 2020-2021 season
huesca already scraped and saved for the 2020-2021 season
levante already scraped and saved for the 2020-2021 season
osasuna already scraped and saved for the 2020-2021 season
real-betis already scraped and saved for the 2020-2021 season
real-madrid already scraped and saved for the 2020-2021 season
real-sociedad already scraped and

Unnamed: 0.1,Unnamed: 0,Player,Weekly GrossBase Salary(IN EUR),Annual GrossBase Salary(IN EUR),"Adj. GrossBase Salary(2021, IN EUR)",Pos.,Age,Country,Team,League,Season
0,0,Rodrigo Battaglia,"€ 52,885","€ 2,750,000","€ 2,750,000",M,29,Argentina,Alaves,La Liga,2020-2021
1,1,Iñigo Córdoba,"€ 51,154","€ 2,660,000","€ 2,660,000",F,23,Spain,Alaves,La Liga,2020-2021
2,2,Jota Peleteiro,"€ 34,808","€ 1,810,000","€ 1,810,000",F,29,Spain,Alaves,La Liga,2020-2021
3,3,Florian Lejeune,"€ 28,846","€ 1,500,000","€ 1,500,000",D,29,France,Alaves,La Liga,2020-2021
4,4,Fernando Pacheco,"€ 28,269","€ 1,470,000","€ 1,470,000",K,28,Spain,Alaves,La Liga,2020-2021


#### <a id='#section3.2.4'>3.2.4. Bundesliga

In [49]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_b_2021, '2020-2021', 'bundesliga')

## Display DataFrame
df_players_all.head()

Scraping for bundesliga for the 2020-2021 season has now started...
arminia-bielefeld already scraped and saved for the 2020-2021 season
augsburg already scraped and saved for the 2020-2021 season
bayer-leverkusen already scraped and saved for the 2020-2021 season
bayern-munich already scraped and saved for the 2020-2021 season
borussia-dortmund already scraped and saved for the 2020-2021 season
eintracht-frankfurt already scraped and saved for the 2020-2021 season
freiburg already scraped and saved for the 2020-2021 season
hertha-berlin already scraped and saved for the 2020-2021 season
hoffenheim already scraped and saved for the 2020-2021 season
leipzig already scraped and saved for the 2020-2021 season
mainz already scraped and saved for the 2020-2021 season
monchengladbach already scraped and saved for the 2020-2021 season
schalke-04 already scraped and saved for the 2020-2021 season
stuttgart already scraped and saved for the 2020-2021 season
union-berlin already scraped and save

Unnamed: 0.1,Unnamed: 0,Player,Weekly GrossBase Salary(IN EUR),Annual GrossBase Salary(IN EUR),"Adj. GrossBase Salary(2021, IN EUR)",Pos.,Age,Country,Team,League,Season
0,0,Arne Maier,"€ 26,923","€ 1,400,000","€ 1,400,000",M,21,Germany,Arminia Bielefeld,Bundesliga,2020-2021
1,1,Michel Vlap,"€ 19,231","€ 1,000,000","€ 1,000,000",F,23,Netherlands,Arminia Bielefeld,Bundesliga,2020-2021
2,2,Mike van der Hoorn,"€ 14,231","€ 740,000","€ 740,000",D,28,Netherlands,Arminia Bielefeld,Bundesliga,2020-2021
3,3,Joakim Nilsson,"€ 8,846","€ 460,000","€ 460,000",D,26,Sweden,Arminia Bielefeld,Bundesliga,2020-2021
4,4,Marcel Hartel,"€ 8,269","€ 430,000","€ 430,000",F,24,Germany,Arminia Bielefeld,Bundesliga,2020-2021


#### <a id='#section3.2.6'>3.2.6. Ligue 1

In [50]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_l1_2021, '2020-2021', 'ligue-1')

## Display DataFrame
df_players_all.head()

Scraping for ligue-1 for the 2020-2021 season has now started...
angers already scraped and saved for the 2020-2021 season
bordeaux already scraped and saved for the 2020-2021 season
brest already scraped and saved for the 2020-2021 season
dijon already scraped and saved for the 2020-2021 season
lens already scraped and saved for the 2020-2021 season
lille already scraped and saved for the 2020-2021 season
lorient already scraped and saved for the 2020-2021 season
lyon already scraped and saved for the 2020-2021 season
marseille already scraped and saved for the 2020-2021 season
metz already scraped and saved for the 2020-2021 season
monaco already scraped and saved for the 2020-2021 season
montpellier already scraped and saved for the 2020-2021 season
nantes already scraped and saved for the 2020-2021 season
nice already scraped and saved for the 2020-2021 season
nimes already scraped and saved for the 2020-2021 season
psg already scraped and saved for the 2020-2021 season
reims alrea

Unnamed: 0.1,Unnamed: 0,Player,Weekly GrossBase Salary(IN EUR),Annual GrossBase Salary(IN EUR),"Adj. GrossBase Salary(2021, IN EUR)",Pos.,Age,Country,Team,League,Season
0,0,Sofiane Boufal,"€ 32,308","€ 1,680,000","€ 1,680,000",F,27,Morocco,Angers,Ligue 1,2020-2021
1,1,Ibrahim Amadou,"€ 23,462","€ 1,220,000","€ 1,220,000",M,27,France,Angers,Ligue 1,2020-2021
2,2,Loïs Diony,"€ 17,885","€ 930,000","€ 930,000",F,27,France,Angers,Ligue 1,2020-2021
3,3,Stéphane Bahoken,"€ 16,154","€ 840,000","€ 840,000",F,28,Cameroon,Angers,Ligue 1,2020-2021
4,4,Ismaël Traoré,"€ 15,000","€ 780,000","€ 780,000",D,34,Cote d'Ivoire,Angers,Ligue 1,2020-2021


#### <a id='#section3.2.7'>3.2.7. MLS

In [51]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_20, '2020', 'mls')

## Display DataFrame
df_players_all.head()

Scraping for mls for the 2020 season has now started...
atlanta-united already scraped and saved for the 2020 season
chicago-fire already scraped and saved for the 2020 season
colorado-rapids already scraped and saved for the 2020 season
columbus-crew already scraped and saved for the 2020 season
dc-united already scraped and saved for the 2020 season
fc-cincinnati already scraped and saved for the 2020 season
fc-dallas already scraped and saved for the 2020 season
houston-dynamo already scraped and saved for the 2020 season
inter-miami already scraped and saved for the 2020 season
la-fc already scraped and saved for the 2020 season
la-galaxy already scraped and saved for the 2020 season
minnesota-united already scraped and saved for the 2020 season
montreal-impact already scraped and saved for the 2020 season
nashville-sc already scraped and saved for the 2020 season
ne-revolution already scraped and saved for the 2020 season
nyc-fc already scraped and saved for the 2020 season
ny-red

Unnamed: 0.1,Unnamed: 0,Player,Weekly GrossBase Salary(IN USD),Annual GrossBase Salary(IN USD),"Adj. GrossBase Salary(2021, IN USD)",Pos.,Age,Country,Team,League,Season
0,0,Josef Martínez,"$ 58,808","$ 3,058,000","$ 3,058,000",F,26,Venezuela,Atlanta United,Mls,2020
1,1,Ezequiel Barco,"$ 27,404","$ 1,425,000","$ 1,425,000",F,20,Argentina,Atlanta United,Mls,2020
2,2,Gonzalo Martínez,"$ 17,308","$ 900,000","$ 900,000",F,26,Argentina,Atlanta United,Mls,2020
3,3,Brad Guzan,"$ 13,077","$ 680,004","$ 680,004",K,35,United States,Atlanta United,Mls,2020
4,4,Matheus Rossetto,"$ 8,000","$ 416,000","$ 416,000",M,23,Brazil,Atlanta United,Mls,2020


#### <a id='#section3.2.8'>3.2.8. Belgian First Division A 

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_belgian_2021, '2020-2021', 'belgian-first-division-a')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.9'>3.2.9. Scottish Premiership

In [None]:
# TO ADD CODE HERE

#### <a id='#section3.2.10'>3.2.10. Championship

In [None]:
# TO ADD CODE HERE

---

<a id='section4'></a>

## <a id='#section4'>4. Data Clean</a>

In [39]:

# Import data as a pandas DataFrame, df_capology_raw
df_capology_raw = pd.read_csv(r"C:\Users\Arnau Climent\OneDrive\Documentos\1_MASTER\PORTFOLIO\scrap_football\capology\raw\premier-league\2020-2021\arsenal_premier-league_2020-2021.csv", encoding='utf-8')


# Si eso no funciona, prueba con 'latin1' o 'iso-8859-1'
# df = pd.read_csv('archivo.csv', encoding='latin1')

print(df_capology_raw.head())


   ID                     Player  Salario bruto por semana (GBP)  \
0   0                 Mesut Özil                        350000.0   
1   1  Pierre-Emerick Aubameyang                        250000.0   
2   2              Thomas Partey                        200000.0   
3   3        Alexandre Lacazette                        182115.0   
4   4                    Willian                        138462.0   

   Salario bruto por ano (GBP)  Inflation (GBP) Pos   Age  Country     Team  \
0                   18200000.0       20976271.0   F  32.0  Germany  Arsenal   
1                   13000000.0       14983051.0   F  31.0    Gabon  Arsenal   
2                   10400000.0       11986441.0   M  27.0    Ghana  Arsenal   
3                    9470000.0       10914576.0   F  29.0   France  Arsenal   
4                    7200000.0        8298305.0   F  32.0   Brazil  Arsenal   

           League     Season  
0  Premier League  2020-2021  
1  Premier League  2020-2021  
2  Premier League  2020

In [40]:
df_capology_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              34 non-null     int64  
 1   Player                          34 non-null     object 
 2   Salario bruto por semana (GBP)  34 non-null     float64
 3   Salario bruto por ano (GBP)     34 non-null     float64
 4   Inflation (GBP)                 34 non-null     float64
 5   Pos                             34 non-null     object 
 6   Age                             34 non-null     float64
 7   Country                         34 non-null     object 
 8   Team                            34 non-null     object 
 9   League                          34 non-null     object 
 10  Season                          34 non-null     object 
dtypes: float64(4), int64(1), object(6)
memory usage: 3.1+ KB


In [37]:
df_capology_raw['ID'] = df_capology_raw['ID'].astype('int64')
cols_to_convert = ['Salario bruto por semana (GBP)', 'Salario bruto por ano (GBP)', 'Inflation (GBP)']
for col in cols_to_convert:
    df_capology_raw[col] = df_capology_raw[col].str.replace('£', '').str.replace(',', '').astype('float64')

print(df_capology_raw.dtypes)

AttributeError: Can only use .str accessor with string values!

---

<a id='section5'></a>

## <a id='#section5'>5. Export Data</a>

In [43]:
# Export DataFrames
df_capology_raw.to_csv(r"C:\Users\Arnau Climent\OneDrive\Documentos\1_MASTER\PORTFOLIO\scrap_football\capology\raw\premier-league\2020-2021\arsenal_premier-league_2020-2021.csv", index=None, header=True, encoding='utf-8-sig')

---

<a id='section6'></a>

## <a id='#section6'>6. Summary</a>
This notebook scrapes player statstics data from [Capology](https://www.capology.com/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames, [Selenium](https://www.selenium.dev/) and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.

___

<a id='section7'></a>

## <a id='#section7'>7. Next Steps</a>
This data is now ready to be engineered before being matched to other datasets such as data from [FBref](https://fbref.com/) and[TransferMarkt](https://www.transfermarkt.co.uk/).
