# Example Code for Scraping a Webpage

First we import the necessary packages

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
#from get_gecko_driver import GetGeckoDriver
import pickle

In this example, we'll just deal with one url.  You may need to do this multiple times

In [2]:
url = 'https://crossroadsleague.com/sports/bsb/2024-25/teams/gracein?view=lineup'

## Getting the html of the webpage

Now we'll use the `selenium` `webdriver` in order to get the html of the webpage.

This code will open Firefox on our computer

In [3]:
driver = webdriver.Firefox()

And then we open the specific url:

In [4]:
driver.get(url)

Then we create the string variable `page` that contains all of the text of the webpage

In [33]:
page = driver.page_source

It is good practice to save this sting as an html file in the `raw-data` folder.

It would be more professional to split this task, with one script downloading the html and the other script pulling from the file in the `raw-data` folder.  This way, if the webpage changes in a way that breaks our code, we will still have the original html files that still work.

In [88]:
with open('../raw-data/first_try.html', "w") as a_file:
   a_file.write(page)

## Parsing the HTML file

Now we'll use `BeautifulSoup` to parse the html.

In [95]:
soup = BeautifulSoup(page)

Having run the code `print(soup.prettify())`, we know that the table we want is the third table

In [96]:
t = soup.find_all('table')[2] #t is the third table
t.colgroup.extract() #Delete the colgroup element
t.thead.unwrap() #Unwrap the thead element
t.tbody.unwrap() #Unwarp the tbody element
#print(t.prettify())

<tbody></tbody>

Below is a function for checking if data is an integer, and if it's not, checking if it is a float.

In [98]:
def nice_data(d):
    d = str.strip(d)
    #If we can turn it into an integer, then do that
    try:
        d=int(d)
    except:
    #If it's not an integer, it might be a float
        try:
            d = float(d)
        except:
            #If it's neither an integer nor a float, it is probably a string
            d = ' '.join(d.split()) #remove excess spaces
    return(d)

Now we use the function above as we put each element into a list of lists, and then turn that into a Pandas DataFrame

In [99]:
data = [[nice_data(d.text) for d in row.find_all(['td', 'th'])] for row in t.find_all('tr')]
df = pd.DataFrame(columns = data[0], data = data[1:])
df

Unnamed: 0,#,Name,Yr,Pos,gp,ab,h,rbi,bb,2b,...,obp,slg,hbp,sf,sh,hdp,go,fo,go/fo,pa
0,4,Ryan Stark,Sr,inf,47,152,54,44,19,12,...,0.436,0.763,6,4,2,-,17,20,0.85,183
1,7,Jose Ayala,Jr,OF,51,165,54,42,18,19,...,0.411,0.515,7,2,1,-,12,19,0.63,193
2,11,Maximo De leon,Sr,OF,51,186,58,35,28,13,...,0.427,0.484,10,1,1,1,24,25,0.96,226
3,23,Josh Cosme,Sr,C,42,115,34,30,24,9,...,0.443,0.47,8,2,5,-,12,22,0.55,154
4,1,Dalton Cody,Sr,inf,47,138,40,26,13,12,...,0.354,0.464,3,4,10,-,21,33,0.64,168
5,22,Grant Hartley,Sr,1B/RHP,46,118,33,31,22,5,...,0.41,0.475,4,0,1,1,10,9,1.11,145


## Writing the data to CSV

Write the dataframe as a csv in our `data` folder

In [100]:
df.to_csv('../data/first_try.csv')