# Introduction

The first step in our task is to obtain the data necessary for analysis. Since our company is in the early stages of development and does not have its own database, we intend to use publicly available resources.  
  
For this purpose, we have been recommended the website [Scrape This Site](https://www.scrapethissite.com/pages/forms/). However, before we start downloading data, it is important to carefully review the [FAQ](https://www.scrapethissite.com/faq/) section on the site. Particular attention should be paid to the restrictions on the number of requests, which is crucial for our solution.  
  
It is expected that after executing the code contained in this notebook, the `data/raw/` folder will be populated with data, which will serve as the source for the next stage of the project.

## Importing Required Libraries

In [6]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import os
import json

In [18]:
# Specify the path to hte folder
folder_path = '../drivers'
# List all path and directories in the folder
content = os.listdir(folder_path)
# Print the content
print(content)
# chromedrive file path
file_path = '../drivers/chromedriver.exe'

['chromedriver.exe']


## Driver and Selenium Configuration

In [20]:
# Set up Service and Chrome options
service = Service(file_path)
chrome_options = Options()

# Create the WebDriver instance with service and options
driver = webdriver.Chrome(service=service, options=chrome_options)

In [None]:
page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

# Fetching Website Content

This section of the notebook contains code for fetching website content. To properly execute the task, consider the following steps:  
- Ensure all available data on the site has been fetched by checking if there are additional data pages.  
- Locate the data of interest on the page using `html` inspection tools.  
- Navigate between subsequent data pages using browser mechanisms or by analyzing the `url` structure.  
  
> Remember to respect the query limits specified in the `FAQ`!  
  
Save the fetched data to the folder `data/raw/hockey_teams_page_{page_number}.html`. At this stage, we are retrieving data without processing it - analysis will be performed later.  
  
To fetch the `html` content of the page, you can use `browser.page_source`. Make sure the browser tool configuration (e.g., Selenium) is ready for use.  
  
> (Optional) If there are multiple pages to fetch, use the [zfill](https://www.programiz.com/python-programming/methods/string/zfill) function to maintain order in file names by adding leading zeros to the page numbers.



***Getting links***

In [22]:
request = requests.get("https://www.scrapethissite.com/pages/forms/?page_num=1")
request

<Response [200]>

In [24]:
result = BeautifulSoup(request.text, 'html.parser')

In [38]:
page_links = result.find_all("ul", class_="pagination")
links = []
for ul in page_links:
    a_tags = ul.find_all("a") 
    for a in a_tags:
        href = a.attrs['href'] 
        links.append(href)

In [46]:
del links[-1]

In [62]:
full_links = []
for link in links:
    link = "https://www.scrapethissite.com" + link
    full_links.append(link)

In [64]:
full_links

['https://www.scrapethissite.com/pages/forms/?page_num=1',
 'https://www.scrapethissite.com/pages/forms/?page_num=2',
 'https://www.scrapethissite.com/pages/forms/?page_num=3',
 'https://www.scrapethissite.com/pages/forms/?page_num=4',
 'https://www.scrapethissite.com/pages/forms/?page_num=5',
 'https://www.scrapethissite.com/pages/forms/?page_num=6',
 'https://www.scrapethissite.com/pages/forms/?page_num=7',
 'https://www.scrapethissite.com/pages/forms/?page_num=8',
 'https://www.scrapethissite.com/pages/forms/?page_num=9',
 'https://www.scrapethissite.com/pages/forms/?page_num=10',
 'https://www.scrapethissite.com/pages/forms/?page_num=11',
 'https://www.scrapethissite.com/pages/forms/?page_num=12',
 'https://www.scrapethissite.com/pages/forms/?page_num=13',
 'https://www.scrapethissite.com/pages/forms/?page_num=14',
 'https://www.scrapethissite.com/pages/forms/?page_num=15',
 'https://www.scrapethissite.com/pages/forms/?page_num=16',
 'https://www.scrapethissite.com/pages/forms/?pag

***Getting teams data from one page***

In [80]:
rows = result.find("table", class_="table").find_all('tr', class_='team')

In [113]:
teams = []

In [115]:
for row in rows:
    team_data = []
    teams_facts = row.find_all('td')
    for team_fact in teams_facts:
        data = team_fact.get_text(strip=True)
        team_data.append(data)
    teams.append(team_data)

In [117]:
teams

[['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35'],
 ['Buffalo Sabres', '1990', '31', '30', '', '0.388', '292', '278', '14'],
 ['Calgary Flames', '1990', '46', '26', '', '0.575', '344', '263', '81'],
 ['Chicago Blackhawks', '1990', '49', '23', '', '0.613', '284', '211', '73'],
 ['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25'],
 ['Edmonton Oilers', '1990', '37', '37', '', '0.463', '272', '272', '0'],
 ['Hartford Whalers', '1990', '31', '38', '', '0.388', '238', '276', '-38'],
 ['Los Angeles Kings', '1990', '46', '24', '', '0.575', '340', '254', '86'],
 ['Minnesota North Stars',
  '1990',
  '27',
  '39',
  '',
  '0.338',
  '256',
  '266',
  '-10'],
 ['Montreal Canadiens', '1990', '39', '30', '', '0.487', '273', '249', '24'],
 ['New Jersey Devils', '1990', '32', '33', '', '0.4', '272', '264', '8'],
 ['New York Islanders', '1990', '25', '45', '', '0.312', '223', '290', '-67'],
 ['New York Rangers', '1990', '36', '31', '', '0.45', '297', '265',

***Getting teams data from all pages***

In [127]:
teams = []
for link in full_links:
    request = requests.get(link)
    result = BeautifulSoup(request.text, 'html.parser')
    rows = result.find("table", class_="table").find_all('tr', class_='team')
    for row in rows:
        team_data = []
        teams_facts = row.find_all('td')
        for team_fact in teams_facts:
            data = team_fact.get_text(strip=True)
            team_data.append(data)
        teams.append(team_data)

In [129]:
teams

[['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35'],
 ['Buffalo Sabres', '1990', '31', '30', '', '0.388', '292', '278', '14'],
 ['Calgary Flames', '1990', '46', '26', '', '0.575', '344', '263', '81'],
 ['Chicago Blackhawks', '1990', '49', '23', '', '0.613', '284', '211', '73'],
 ['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25'],
 ['Edmonton Oilers', '1990', '37', '37', '', '0.463', '272', '272', '0'],
 ['Hartford Whalers', '1990', '31', '38', '', '0.388', '238', '276', '-38'],
 ['Los Angeles Kings', '1990', '46', '24', '', '0.575', '340', '254', '86'],
 ['Minnesota North Stars',
  '1990',
  '27',
  '39',
  '',
  '0.338',
  '256',
  '266',
  '-10'],
 ['Montreal Canadiens', '1990', '39', '30', '', '0.487', '273', '249', '24'],
 ['New Jersey Devils', '1990', '32', '33', '', '0.4', '272', '264', '8'],
 ['New York Islanders', '1990', '25', '45', '', '0.312', '223', '290', '-67'],
 ['New York Rangers', '1990', '36', '31', '', '0.45', '297', '265',

***Data to JSON conversion***

In [132]:
columns = ["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+ / -"]

In [138]:
teams_json = []
for team in teams:
    team_dict = dict(zip(columns, team))
    teams_json.append(team_dict)

In [140]:
teams_json 

[{'Team Name': 'Boston Bruins',
  'Year': '1990',
  'Wins': '44',
  'Losses': '24',
  'OT Losses': '',
  'Win %': '0.55',
  'Goals For (GF)': '299',
  'Goals Against (GA)': '264',
  '+ / -': '35'},
 {'Team Name': 'Buffalo Sabres',
  'Year': '1990',
  'Wins': '31',
  'Losses': '30',
  'OT Losses': '',
  'Win %': '0.388',
  'Goals For (GF)': '292',
  'Goals Against (GA)': '278',
  '+ / -': '14'},
 {'Team Name': 'Calgary Flames',
  'Year': '1990',
  'Wins': '46',
  'Losses': '26',
  'OT Losses': '',
  'Win %': '0.575',
  'Goals For (GF)': '344',
  'Goals Against (GA)': '263',
  '+ / -': '81'},
 {'Team Name': 'Chicago Blackhawks',
  'Year': '1990',
  'Wins': '49',
  'Losses': '23',
  'OT Losses': '',
  'Win %': '0.613',
  'Goals For (GF)': '284',
  'Goals Against (GA)': '211',
  '+ / -': '73'},
 {'Team Name': 'Detroit Red Wings',
  'Year': '1990',
  'Wins': '34',
  'Losses': '38',
  'OT Losses': '',
  'Win %': '0.425',
  'Goals For (GF)': '273',
  'Goals Against (GA)': '298',
  '+ / -': '-

In [146]:
with open('teams_data.json', 'w') as json_file:
    json.dump(teams_json, json_file, indent=4)