# First test scraping the Understat EPL table

## [1] Attempt to scrape the table regular
Ensure that beautifulsoup is installed
Fetch web page

Locate table

*problem* there are two tables without classes, how to differentiate?
*resolved issue* container has an id 'league-chemp'

*new problem* bs4 finds the div but it's empty
Table appears to be rendered with javascript so we need to try selienium to retrieve it.

*resolved issue* Selenium loads the content of the page, and then we're able to extract the table like we need to.

Extract the data from the rows as needed

*Question:* what is creating the empty list at index 0? - This is happening because these cells are the headers and are labeled as 'th' so no data is retrieved from this row.

Create a data frame with the table data

## [2] Attempt to sort the table alphabetically

This is best to achieve after scraping the table in to a dataframe. When I was copying and pasting the data into the odds calculator, I did this by using the 'sort' button on the understat website. This way we don't have to mess with understat through selenium any more than necessary.

```python
sorted_df = df.sort_values(by="Team", ascending=True)
```

## [3] Attempt to sort the table by AWAY results and HOME results



## [4] Write tests to confirm it's working correctly


In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By

# Set up WebDriver

driver = webdriver.Chrome()
driver.get("https://understat.com/league/EPL")

# wait for dynamic content to load
driver.implicitly_wait(10)

# extract page after JS loads
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

# fetch page
# url = "https://understat.com/league/EPL"
# response = requests.get(url)
# if response.status_code == 200:
#     soup = BeautifulSoup(response.content, "html.parser")
#     print('success')
# else:
#     print(f"Failed with status code {response.status_code}")

# locate table
div = soup.find("div", {
    "id": "league-chemp"
})

if div:
    table = div.find("table")
    if table:
        print("found the table")
        # extract the rows
        rows = table.find_all("tr")
        
        data = []
        for row in rows:
            cells = row.find_all("td")
            if len(cells) == 0:
                cells = row.find_all("th")
            data.append([cell.get_text(strip=True) for cell in cells])
        # print("DATA: ", data)
        if data:
            df = pd.DataFrame(data[1:], columns=data[0])
            print(df)
        else:
            print("No DATA")
    else:
        print("failed to find table")
else:
    print("div not found")


driver.quit()




found the table
     №                     Team   M  W  D  L   G  GA PTS          xG  \
0    1                Liverpool  11  9  1  1  21   6  28  24.62+3.62   
1    2          Manchester City  11  7  2  2  22  13  23  23.88+1.88   
2    3                  Chelsea  11  5  4  2  21  13  19  20.74-0.26   
3    4                  Arsenal  11  5  4  2  18  12  19  21.87+3.87   
4    5        Nottingham Forest  11  5  4  2  15  10  19  17.72+2.72   
5    6                 Brighton  11  5  4  2  19  15  19  19.33+0.33   
6    7                   Fulham  11  5  3  3  16  13  18  21.67+5.67   
7    8         Newcastle United  11  5  3  3  13  11  18  16.04+3.04   
8    9              Aston Villa  11  5  3  3  17  17  18  20.92+3.92   
9   10                Tottenham  11  5  1  5  23  13  16  23.15+0.15   
10  11                Brentford  11  5  1  5  22  22  16  19.44-2.56   
11  12              Bournemouth  11  4  3  4  15  15  15  21.25+6.25   
12  13        Manchester United  11  4  3  4  12

[2] 
trying to recreate the above scraping for understanding, the incorporate the clicking functionality with Selenium to return the Away table data frame

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

# ~ OPTIONS ~
#
# Prevents code from opening a window for understat to read the table
# Code works without these options
#

chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode
chrome_options.add_argument("--disable-gpu")  # Disable GPU (optional, improves performance on some systems)
chrome_options.add_argument("--no-sandbox")  # Bypass OS security model (useful on servers)

# Access url with Chrome webdriver
driver = webdriver.Chrome(options=chrome_options)
url = "https://understat.com/league/EPL"
driver.get(url)

# click the home button for home results
try:
    print('Trying to locate element...')
    home_button = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "home-away2"))
    )

    print('Located home button:', home_button)
    print("Home button HTML:", home_button.get_attribute('outerHTML'))
    
    '''
    NOTE:

    TO RUN .click() ON home_button:
    
    driver.execute_script('< scriptToExecute >', arguments)

        (1) Pass in home_button as arguments 
        (2) Access it with arguments[0]
        (3) Run .click() on it to simulate click
    
    This runs click on the home_button element. Had to be written this way
    because home_button.click() was not executing correctly. Returns
    "Message: element not interactable"
    
    '''
    
    # home_button.click() # first try
    driver.execute_script("arguments[0].click();", home_button) # corrected
    print("clicked the button")
    

except Exception as e:
    print(f"Error clicking the 'home' button broh: {e}" )
    driver.quit()
    exit()

# wait for table update
try:
    print('trying (2)...')
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "league-chemp"))
    )
    print("Table updated")
except Exception as e:
    print(f"Error waiting for table update broh: {e}")
    driver.quit()
    exit()

html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

div = soup.find("div", {
    "id": "league-chemp"
})

if div:
    table = div.find("table")
    print("found table")
    if table:
        rows = table.find_all("tr")
        data = []
        for row in rows:
            cells = row.find_all("td")
            if len(cells) == 0:
                cells = row.find_all("th")
            data.append([cell.get_text(strip=True) for cell in cells])
        if data:
            df = pd.DataFrame(data[1:], columns=data[0])
            print(df)
        else:
            print("no data")
            
    else:
        print("table not found")
else:
    print("div not found")


driver.quit()

Error sending stats to Plausible: error sending request for url (https://plausible.io/api/event)


Trying to locate element...
Located home button: <selenium.webdriver.remote.webelement.WebElement (session="c9810d1a26173285840e10ef13a02e6a", element="f.38FC7E63F6CE8C7D2F95D59824BA29F7.d.854A441A26B81BA09F7C6422B5DB11C3.e.12")>
Home button HTML: <input id="home-away2" type="radio" name="home-away" value="h">
clicked the button
trying (2)...
Table updated
found table
     №                     Team  M  W  D  L   G  GA PTS          xG  \
0    1                Brentford  6  5  1  0  18  11  16  15.41-2.59   
1    2                Liverpool  6  5  0  1  11   3  15  13.04+2.04   
2    3          Manchester City  5  4  1  0  12   6  13  11.92-0.08   
3    4                Tottenham  6  4  0  2  16   6  12  12.59-3.41   
4    5                 Brighton  6  3  3  0  11   8  12  11.82+0.82   
5    6                  Arsenal  5  3  2  0  12   6  11  14.44+2.44   
6    7              Bournemouth  5  3  1  1   8   4  10  10.15+2.15   
7    8                   Fulham  5  3  1  1   9   7  10  12.1

In [5]:
sorted_df = df.sort_values(by="Team", ascending=True)
print(sorted_df)

     №                     Team  M  W  D  L   G  GA PTS          xG  \
5    6                  Arsenal  5  3  2  0  12   6  11  14.44+2.44   
11  12              Aston Villa  5  2  2  1   7   6   8   9.36+2.36   
6    7              Bournemouth  5  3  1  1   8   4  10  10.15+2.15   
0    1                Brentford  6  5  1  0  18  11  16  15.41-2.59   
4    5                 Brighton  6  3  3  0  11   8  12  11.82+0.82   
10  11                  Chelsea  6  2  3  1   9   8   9  12.83+3.83   
16  17           Crystal Palace  6  1  2  3   3   7   5   9.48+6.48   
15  16                  Everton  5  1  2  2   5   8   5   5.65+0.65   
7    8                   Fulham  5  3  1  1   9   7  10  12.12+3.12   
19  20                  Ipswich  5  0  3  2   4   8   3   5.41+1.41   
14  15                Leicester  5  1  2  2   5   7   5   4.88-0.12   
1    2                Liverpool  6  5  0  1  11   3  15  13.04+2.04   
2    3          Manchester City  5  4  1  0  12   6  13  11.92-0.08   
9   10