# Introduction

The purpose of this notebook is to extract and save information about hockey teams into JSON format, based on data from files located in the `/data/raw` directory, which were generated in the previous stage. The information to be scraped and saved includes:  

- Team Name (`Team Name`),  
- Year (`Year`),  
- Number of wins (`Wins`),  
- Number of losses (`Losses`),  
- Number of overtime losses (`OT Losses` - Overtime Losses),  
- Win percentage (`Win %`),  
- Number of goals scored (`Goals For (GF)`),  
- Number of goals conceded (`Goals Against (GA)`),  
- Goal differential (`+ / -`).  

Each collected record will be organized into a dictionary with the structure shown below and then added to the results list:  

```python  
{  
    'Team Name': 'Boston Bruins',  
    'Year': '1990',  
    'Wins': '44',  
    'Losses': '24',  
    'OT Losses': '',  
    'Win %': '0.55',  
    'Goals For (GF)': '299',  
    'Goals Against (GA)': '264',  
    '+ / -': '35'  
}  
```



The resulting data will be saved in a file named hockey_teams.json, which will be placed in the `data/interim/` folder. This file will serve as a data source for further analysis.

# Notebook Configuration

## Import Required Libraries

In [30]:
from bs4 import BeautifulSoup
import glob
import requests
import json

# Scraping

## List of HTML files

Using the `glob` module, finding all `html` files in the `data/raw` folder.

In [7]:
source_list = glob.glob(r"..\data\raw\*.html")

['..\\data\\raw\\hockey_teams_page_1.html', '..\\data\\raw\\hockey_teams_page_10.html', '..\\data\\raw\\hockey_teams_page_11.html', '..\\data\\raw\\hockey_teams_page_12.html', '..\\data\\raw\\hockey_teams_page_13.html', '..\\data\\raw\\hockey_teams_page_14.html', '..\\data\\raw\\hockey_teams_page_15.html', '..\\data\\raw\\hockey_teams_page_16.html', '..\\data\\raw\\hockey_teams_page_17.html', '..\\data\\raw\\hockey_teams_page_18.html', '..\\data\\raw\\hockey_teams_page_19.html', '..\\data\\raw\\hockey_teams_page_2.html', '..\\data\\raw\\hockey_teams_page_20.html', '..\\data\\raw\\hockey_teams_page_21.html', '..\\data\\raw\\hockey_teams_page_22.html', '..\\data\\raw\\hockey_teams_page_23.html', '..\\data\\raw\\hockey_teams_page_24.html', '..\\data\\raw\\hockey_teams_page_3.html', '..\\data\\raw\\hockey_teams_page_4.html', '..\\data\\raw\\hockey_teams_page_5.html', '..\\data\\raw\\hockey_teams_page_6.html', '..\\data\\raw\\hockey_teams_page_7.html', '..\\data\\raw\\hockey_teams_page_8.ht

## Scraping

Extracting data from `html` files, making sure to maintain the expected structure of a single record:


In [46]:
for_conversion = []

keys = ["Team Name", "Year", "Wins", "Losses", "Ot Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+ / -"]
expected_lenght = 9

for source in source_list:
    with open(source, "r", encoding="utf-8") as opened:
        opened_source = opened.read()
        
        soup = BeautifulSoup(opened_source, "html.parser")
        
        for data in soup.find_all(class_ ="team"):
            new_data = data.get_text(strip = 1, separator="/")
            values = new_data.split("/")
            if len(values) < expected_lenght:
                values.insert(4, "")
                
            data_dict = dict(zip(keys, values))
            for_conversion.append(data_dict)

[{'Team Name': 'Boston Bruins', 'Year': '1990', 'Wins': '44', 'Losses': '24', 'Ot Losses': '', 'Win %': '0.55', 'Goals For (GF)': '299', 'Goals Against (GA)': '264', '+ / -': '35'}, {'Team Name': 'Buffalo Sabres', 'Year': '1990', 'Wins': '31', 'Losses': '30', 'Ot Losses': '', 'Win %': '0.388', 'Goals For (GF)': '292', 'Goals Against (GA)': '278', '+ / -': '14'}, {'Team Name': 'Calgary Flames', 'Year': '1990', 'Wins': '46', 'Losses': '26', 'Ot Losses': '', 'Win %': '0.575', 'Goals For (GF)': '344', 'Goals Against (GA)': '263', '+ / -': '81'}, {'Team Name': 'Chicago Blackhawks', 'Year': '1990', 'Wins': '49', 'Losses': '23', 'Ot Losses': '', 'Win %': '0.613', 'Goals For (GF)': '284', 'Goals Against (GA)': '211', '+ / -': '73'}, {'Team Name': 'Detroit Red Wings', 'Year': '1990', 'Wins': '34', 'Losses': '38', 'Ot Losses': '', 'Win %': '0.425', 'Goals For (GF)': '273', 'Goals Against (GA)': '298', '+ / -': '-25'}, {'Team Name': 'Edmonton Oilers', 'Year': '1990', 'Wins': '37', 'Losses': '37',

In [47]:
json_for_export = json.dumps(for_conversion, indent=4)

[
    {
        "Team Name": "Boston Bruins",
        "Year": "1990",
        "Wins": "44",
        "Losses": "24",
        "Ot Losses": "",
        "Win %": "0.55",
        "Goals For (GF)": "299",
        "Goals Against (GA)": "264",
        "+ / -": "35"
    },
    {
        "Team Name": "Buffalo Sabres",
        "Year": "1990",
        "Wins": "31",
        "Losses": "30",
        "Ot Losses": "",
        "Win %": "0.388",
        "Goals For (GF)": "292",
        "Goals Against (GA)": "278",
        "+ / -": "14"
    },
    {
        "Team Name": "Calgary Flames",
        "Year": "1990",
        "Wins": "46",
        "Losses": "26",
        "Ot Losses": "",
        "Win %": "0.575",
        "Goals For (GF)": "344",
        "Goals Against (GA)": "263",
        "+ / -": "81"
    },
    {
        "Team Name": "Chicago Blackhawks",
        "Year": "1990",
        "Wins": "49",
        "Losses": "23",
        "Ot Losses": "",
        "Win %": "0.613",
        "Goals For (GF)": "284",
  

# Summary

After extracting the relevant information, the final step in preparation for analysis is to save the data.

### Saving the file
Here, we save the data to `data/interim/` and name the file `hockey_teams.json`

In [48]:
with open(r"..\data\interim\hockey_teams.json", "w", encoding="utf-8") as file:
    file.write(json_for_export)