# 2. Crawling AFL Brownlow Votes from afltables.com for Data Cross Reference Validation #
## For Brownlow Predictor Project ##

Scrapes Brownlow Votes data from a secondary website to validate the votes

*FootyWire's Brownlow Data presented problems in that some votes were wrongly allocated. afltables also records each player's full name as opposed to a half-initials, so it will also solve issues of two players on field having same half-initials being both allocated votes*

**Author: `Lang (Ron) Chen` 2021.12-2022.1**

---

**0. Import Libraries**

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time
import random
from urllib.parse import urljoin
import os

**1. Data Processing functions**

In [2]:
def TransfName(name):
    """ Function which transform name format of afltables.com into the style our current file stores (i.e. style of FootyWire.com) """

    name_split = name.split(' ')
    firstname = name_split[0]
    lastname = name_split[1]

    if '-' in lastname:
        lastname_split = lastname.split('-')
        lastname = f'{lastname_split[0][0]}-{lastname_split[1]}'

    return f'{firstname} {lastname}'

In [3]:
def Teamname_transf(team):
    """ Function which transforms team name format of AFLTables.com into style our current data is stored (named) as (i.e. style of FootyWire.com) """

    if team == 'Gold':
        team = 'GoldCoast'

    elif team == 'North':
        team = 'NorthMelbourne'

    elif team == 'Port':
        team = 'PortAdelaide'

    elif team == 'St':
        team = 'StKilda'

    elif team == 'West':
        team = 'WestCoast'

    elif team == 'Western':
        team = 'WesternBulldogs'

    elif team == 'Greater':
        team = 'GWS'

    return team

In [4]:
def validate_stats(year, rd, team1, team2, new_brownlowdict):
    """ Function which opens up a file and checks whether the correct number of votes were given. 
    Warning: Written to accomodate the structure of the scraped data!! """

    try:
        df = pd.read_csv(
            f'../data/raw/OriginalData/{year} Round {rd} {team1} v {team2} (O).csv')
    except:
        df = pd.read_csv(
            f'../data/raw/OriginalData/{year} Round {rd} {team2} v {team1} (O).csv')
    players = list(df['Player'])
    votes = list(df['Brownlow Votes'])

    issue = False

    old_brownlowdict = {}

    # initialise a new list to store votes
    new_votes = [0 for i in range(len(votes))]

    for i in range(len(players)):  # Players who's names differed on afltables.com to our data
        if players[i] == 'Josh P. Kennedy' or players[i] == 'Joshua Kennedy':
            player = 'Josh Kennedy'
        elif players[i] == "Jaeger O'Meara":
            player = 'Jaeger OMeara'
        elif players[i] == 'Edward Curnow':
            player = 'Ed Curnow'
        elif players[i] == 'Zachary Merrett':
            player = 'Zach Merrett'
        elif players[i] == 'Joshua Kelly':
            player = 'Josh Kelly'
        elif players[i] == 'Jordan De Goey':
            player = 'Jordan de Goey'
        elif players[i] == 'Zachary Williams':
            player = 'Zac Williams'
        elif players[i] == 'Matthew De Boer':
            player = 'Matt de Boer'
        elif players[i] == 'Jackson Macrae':
            player = 'Jack Macrae'
        elif players[i] == 'Daniel Butler':
            player = 'Dan Butler'
        else:
            player = players[i]

        # if find a matching player, check their votes. If the votes don't match, then raise alarm
        if player in new_brownlowdict:
            new_votes[i] = new_brownlowdict[player]
            if new_brownlowdict[player] != votes[i]:
                issue = True

    # if gave more votes than should have due to same semi-initialised name, also raise alarm
    if not issue and sum(votes) != 6:
        issue = True

    # only if there's an issue: find out who the votes were allocated to in the current files; replace the votes column with the new_brownlowvotes and also print out the diagnostic information for those games.
    # also update the two other type of files.
    if issue:
        for i in range(len(votes)):
            if votes[i]:
                old_brownlowdict[players[i]] = votes[i]

        print(f'{year} Rd {rd} {team1} v {team2}: {sum(votes)}')
        print({sum(new_votes)})
        print(f'New: {new_brownlowdict}')
        print(f'Old: {old_brownlowdict}')
        print('\n')

        if not os.path.exists('../data (validation fix)/OriginalData'):
            os.makedirs('../data (validation fix)/OriginalData')

        df['Brownlow Votes'] = new_votes
        df.to_csv(
            f'../data (validation fix)/OriginalData/{year} Round {rd} {team1} v {team2} (O).csv', index=False)

For security purposes, did not allow crawler to overwrite files crawled off FootyWire. Instead, they were put into a new folder, which after human validation could be cut and pasted into the original data folder, replacing the old files with wrongly allocated Brownlow Votes

## 2. Crawl and Scrape ##
*(This is like the overall 'Main' function in this entire notebook)*

In [5]:
years = range(2015, 2024)

for year in years:

    page = requests.get(
        f'https://afltables.com/afl/brownlow/brownlow{year}rbr.html')
    soup = BeautifulSoup(page.text, 'html.parser')

    # just happened to work... found out thorugh experimentation
    section = soup.find('h1')

    results = section.findNext('table')
    rows = results.findAll('tr')

    # Code followed the specific format of the data: every 4 rows made up a game in the particular round
    result2 = soup.findAll('table')
    for rd in range(len(result2)):
        for row in rows:
            if row.findAll('a'):
                a = row.findAll('a')
                desired_object = re.findall(r'>.+<', str(a[0]))[0].strip('<>')
                if ' v ' in desired_object:
                    brownlowdict = {}
                    count = 0
                    team1_tmp = desired_object.split('v')[0].split()[0]
                    team2_tmp = desired_object.split('v')[1].split()[0]

                    team1 = Teamname_transf(team1_tmp)
                    team2 = Teamname_transf(team2_tmp)

                else:
                    if count == 1:
                        brownlowdict[TransfName(desired_object)] = 3

                    elif count == 2:
                        brownlowdict[TransfName(desired_object)] = 2

                    elif count == 3:
                        brownlowdict[TransfName(desired_object)] = 1
                        validate_stats(year, rd, team1, team2, brownlowdict)

                count += 1

        if rd + 1 < len(result2):
            results = results.findNext('table')
            rows = results.findAll('tr')

    time.sleep(random.uniform(0.5, 5))

## Logic behind the code: ##
Seeks to solve the problem of assigned more votes (because original scraper used semi-acronym format for brownlow vote receiver's names), wrong votes being recorded, and the 'Josh P. Kennedy' problem.


For each game, open up the file and run through the player name list, stopping if a name on the player name list is in the new_brownlowdict. If there's a mismatch we raise the alarm and write up a new list to replace the old brownlow list.
The data that is printed for each 'problem game' alerts of the original problem, allowing for diagnosis of further problems in original data and also whether the data has been fixed by this validation script

After this validation, the only 2 games with data mismatch and require manual update are:

2016 Rd 9 GoldCoast v Adelaide: 8
{8}
New: {'Taylor Walker': 3, 'Tom Lynch': 2, 'Daniel Talia': 1}
Old: {'Tom Lynch': 2, 'Daniel Talia': 1, 'Taylor Walker': 3}


2017 Rd 4 WestCoast v Sydney: 8
{8}
New: {'Luke Shuey': 3, 'Josh Kennedy': 2, 'Jamie Cripps': 1}
Old: {'Luke Shuey': 3, 'Jamie Cripps': 1, 'Joshua Kennedy': 2, 'Josh P. Kennedy': 2}