# Final Project: Money in European Soccer

<img src='assets/euro.jpg'>

&emsp; As long as there is money to be made from professional sports, money will play a role in an individual's or team's success. In soccer, club owners will look to invest in a variety of ways in hopes to improve the team, leading to higher quality play and an increasing fanbase as a result. These investments can be used for improved training facilities, improved medical facilities, stadium renovations, hiring better coaches, hiring data analysts and scientists, etc... All these improvements across the club can have big impacts on a team's success but each game is still won and lost on the field and having the right players to make the most these advantages is critical. 

&emsp; For this reason, the biggest investments from owners often come in the form of an increased transfer budget so the team can pay a transfer fee to acquire a desirable player currently under contract with another team. Through this process of buying and selling players in the 'transfer market', the best players are brought in for large fees to the best clubs year after year. In this project, I aim to examine how the money spent on transfers affects the final league position of a team in their domestic league.

## Part 1: Web Scraping

&emsp; My data collection method of choice was to scrape the data from the popular soccer website <a href='Transfermarkt.us'>Transfermarkt.us</a>. This website holds all kinds of information involving various soccer players, clubs, and leagues with a focus on transfer news and records so I knew this website would have all the information I needed.
<br><br>

<img src='assets/transfermarkt.jpg'>

<br>
&emsp; From the <a href='www.transfermarkt.us/wettbewerbe/europa'>European leagues and cups</a> page I was able to create a dictionary of the top 15 most valuable European leagues with the links to their own pages on the site and I was off to the races.

In [3]:
# Imports for Scraping.

# Libraries for obtaining and examining html.
import requests
import time
from bs4 import BeautifulSoup as bs

# Library for saving scraped data.
import pickle

### Scraping Challenges

#### Headers and Delay
&emsp; Unfortunately, 'Transfermarkt.us' was a little particular about how it received a request so I had to disguise my requests with the 'headers' parameter and add a delay using python's 'time' library in between requests to seem like a normal web surfer. 

#### Top Ten Teams
&emsp; I elected to scrape the results of the top ten teams (Based on current players' market value) of each league instead of all the teams in the league because promotion and relegation from and to the lower leagues meant that some of the other teams in the league would be more likely to be in and out of the 'First Tier'.

#### Starting Rows
&emsp; In my initial scraping attempts (Note that the pickle file is 'v3'), I thought that every page carrying the equivalent information of each team would be similarily structured. It turned out that most teams' transfer activity was tracked from the following season (24/25) but teams that already had a deal to sign a player for the 25/26 season started then. This lead to mismatched season and transfer information so I had to be more specific about where to start scraping from.

<img src='assets/transfers.jpg'>

After this mistake I made sure to confirm I was starting from the correct row for the 'transfer balance' data collection.

#### League Position Information
&emsp; With the transfer information secured, I also needed each team's final league position for each season. Fortunately there was a page on the site that held that information as well a bit of extra information about each season. The downside to using a transfers focused website to collect all my data was that information not pertaining to transfers was not as carefully archived.

<img src='assets/league.jpg'>

&emsp; For this team there is no record of the seasons between 95/96 and 00/01. Had I not noticed this problem, the data for the 95/96 season would've erroneously been assigned to the 99/00 season, 94/95 to 98/99, and so on. To solve this problem I implemented a way to only assign the info from the 'League rankings' page row if the season matched the season of the transfer information. This meant that some rows would have missing information about the league that season but would still have the correct transfer information and league position (if rows were missing it was because the team was in a lower league at the time and therefore their position could be marked as '≤10')

In [None]:
# Request European leagues page.

# Headers for disguising server request as a human.
headers = {'User-Agent': 
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}

url_prefix = "https://www.transfermarkt.us"

# Requesting page from Transfermarkt containing links to European leagues.
r_europe = requests.get(url_prefix + '/wettbewerbe/europa', headers=headers)

if (r_europe.status_code == 200):
    europe = bs(r_europe.content)
else:
    print('Error scraping Europe.')
    
# Extract links for examined leagues.
europe = bs(r_europe.content)

table = europe.find('tbody')
rows = table.find_all("tr")

league_links = {}

# Every row in the table had an extra row so the top 30 rows gave us the top 15 leagues.
for row in rows[1:30]:
    flag = row.find(class_="flaggenrahmen")
    if flag:
        league_links[flag['alt']] = url_prefix + row.find('a')['href']

In [None]:
# Extract transfer data for top ten teams in examined leagues and store them in one big dictionary.
all_data = {}
for league in league_links:
    
    # League page request.
    time.sleep(2)
    r_league = requests.get(league_links[league], headers=headers)
    if (r_league.status_code == 200):
        c_league = bs(r_league.content)
    else:
        print(f'Error scraping {league}.')
        continue
    
    # League transfers page request.
    time.sleep(2)
    r_balance = requests.get(league_links[league].replace('startseite', 'transferbilanz'), headers=headers)
    if (r_balance.status_code == 200):
        c_balance = bs(r_balance.content)
    else:
        print(f'Error scraping {league} balance.')
        continue
              
    # Finding league transfer balance table and the correct row to start from.
    all_rows = c_balance.find(class_="items").find("tbody").find_all("tr")
    for i, row in enumerate(all_rows):
        head = row.find("a")
        if head.text[-2:] == '23':
            start_row = i
            break
    balance_rows = c_balance.find(class_="items").find("tbody").find_all("tr")[start_row:]
    
    # Track progress while scraping by seeing which league is being scraped. 
    print(league)
    
    # Create a dictionary for each league in all data. 
    all_data[league] = {}
    
    # Finding team information from league table.
    table = c_league.find(class_='responsive-table')
    tbody = table.find('tbody')
    rows = tbody.find_all("tr")
    
    for i, row in enumerate(rows[:10]):
        link = row.find('a')
        team = link['title']
        all_data[league][team] = {}
               
        # Transfer and league rankings page's urls are similar to team overview page url.
        time.sleep(2)
        r_team = requests.get(url_prefix + link['href'][:-14].replace('startseite', 'alletransfers'), headers=headers)
        
        time.sleep(2)
        r_position = requests.get(url_prefix + link['href'][:-14].replace('startseite', 'platzierungen'), headers=headers)

        if (r_team.status_code == 200) & (r_position.status_code == 200):
            c_team = bs(r_team.content)
            c_position = bs(r_position.content)
        else:
            print(f'Error scraping {team}.')
            continue
        
        # Isolate useful information from webpages.
        season_rows = c_position.select("tbody tr")[2:]
        
        # Find the correct row to start from.
        all_rows = c_team.find_all(class_='row')
        for i, row in enumerate(all_rows):
            head = row.find("h2")
            if head:
                if head.text.strip()[-2:] == '23':
                    start_row = i
                    break
        transfer_rows = c_team.find_all(class_='row')[start_row:]
                
        # Season row information to handle missing 'League rankings' rows.
        season_counter = 0
        season_max = len(season_rows)
        
        # League and transfer data going back 30 seasons.
        for i, season in enumerate(transfer_rows[:30]):
            
            # Year determined by the last two chars (E.g. '22/23')
            year = season.find("h2").text.strip()[-2:]
            transfer_tables = season.find_all(class_='box')
            
            # If there was no data for the revenue or spend of a season, that value will be set to 0. 
            transfer_revenue = transfer_tables[1].select('tfoot td')
            revenue = '0' 
            if len(transfer_revenue) > 0:
                revenue = transfer_revenue[0].text
                
            transfer_spend = transfer_tables[0].select('tfoot td')
            spend = '0' 
            if len(transfer_spend) > 0:
                spend = transfer_spend[0].text
            
            # Changing season counter back to '0' means the years will never match again.
            if season_counter == season_max:
                season_counter = 0
            
            # Missing 'League ranking' rows handled here.
            season_info = season_rows[season_counter].find_all('td')
            
            # If the years match, set the variables to the information found in the row.
            if year == season_info[0].text[-2:]:
                season_counter += 1
                goals = season_info[7].text
                competition = season_info[3].text
                position = season_info[10].text
                wins = season_info[4].text
                ties = season_info[5].text
                losses = season_info[6].text
            
            # If the years don't match, set all variables except 'competition' and 'position' to NaN.
            else:
                goals = np.nan
                competition = 'Not First'
                position = '≤10'
                wins = np.nan
                ties = np.nan
                losses = np.nan
                
            # Create dictionary for each season that holds all data for that season.
            all_data[league][team][year] = {}
            
            # Store variables in dict.
            all_data[league][team][year]['revenue'] = revenue
            all_data[league][team][year]['spent'] = spend
            
            all_data[league][team][year]['goals'] = goals
            all_data[league][team][year]['competition'] = competition 
            all_data[league][team][year]['position'] = position
            all_data[league][team][year]['wins'] = wins
            all_data[league][team][year]['ties'] = ties
            all_data[league][team][year]['losses'] = losses
            
            # Total league transfer spend this season.
            all_data[league][team][year]['league_spent'] = balance_rows[i].find_all("td")[1].text

In [None]:
# Save scraped data dictionary as pickle file.
with open('all_datav3.pickle', 'wb') as handle:
    pickle.dump(all_data, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Final Thoughts 
&emsp; A data analysis is only as good as its data. If I were to do it over, I would source more information about each season from other websites in addition to the transfer information from 'Transfermarkt.us'. Including more leagues, teams, and seasons from each country, instead of just the top 10, would've been challenging but might've provided a more complete picture.