# Introduction

The purpose of this task is to extract and save information about hockey teams into JSON format, based on data from files located in the `/data/raw` directory, which were generated in the previous stage. The information to be scraped and saved includes:  

- Team Name (`Team Name`),  
- Year (`Year`),  
- Number of wins (`Wins`),  
- Number of losses (`Losses`),  
- Number of overtime losses (`OT Losses` - Overtime Losses),  
- Win percentage (`Win %`),  
- Number of goals scored (`Goals For (GF)`),  
- Number of goals conceded (`Goals Against (GA)`),  
- Goal differential (`+ / -`).  

Each collected record should be organized into a dictionary with the structure shown below and then added to the results list:  

```python  
{  
    'Team Name': 'Boston Bruins',  
    'Year': '1990',  
    'Wins': '44',  
    'Losses': '24',  
    'OT Losses': '',  
    'Win %': '0.55',  
    'Goals For (GF)': '299',  
    'Goals Against (GA)': '264',  
    '+ / -': '35'  
}  
```

Place each item into the results list.

The resulting data should be saved in a file named hockey_teams.json, which will be placed in the `data/interim/` folder. This file will serve as a data source for further analysis in the next part of the workshop.

> At this point, converting HTML to JSON may seem complex and unnecessary, but it aims to consolidate knowledge regarding this data structure due to its universality and prevalence not only in the world of data analysis but generally in IT as well.


# Notebook Configuration

## Import Required Libraries

In [2]:
import json
from bs4 import BeautifulSoup
import glob



# Scraping

To scrape the required information from the saved files, follow these steps:

1. Find all HTML files in the `data/raw` folder using the `glob` module.
2. For each HTML file, use `BeautifulSoup` to scrape the page and extract the needed data.
3. Save the obtained data as partially processed in the `hockey_teams.json` file located in the `/data/interim/` folder.

These steps will allow for efficient processing of data from HTML files and prepare them for further analysis.

## List of HTML files

Using the `glob` module, find all `html` files in the `data/raw` folder.

In [5]:
# Najdi všechny HTML soubory ve složce data/raw
html_files = glob.glob("data/raw/*.html")
for file in html_files: 
    print (file)



data/raw\hockey_teams_all_pages.html


## Scraping

Extract data from `html` files, making sure to maintain the expected structure of a single record:

```python
{
    'Team Name': 'Boston Bruins',
    'Year': '1990',
    'Wins': '44',
    'Losses': '24',
    'OT Losses': '',
    'Win %': '0.55',
    'Goals For (GF)': '299',
    'Goals Against (GA)': '264',
    '+ / -': '35'
}
```

In [7]:

#připravím si prázdný seznam team_data, kam budu zapisovat data pro každý tým zvlášť
team_data = []

for file in glob.glob("data/raw/*.html"):
    with open(file, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, "html.parser")
        teams = soup.select("tr.team")
        for team in teams:
            team_info = {
                "name": team.select_one("td.name").text.strip(),
                "year": team.select_one("td.year").text.strip(),
                "wins": team.select_one("td.wins").text.strip(),
                "losses": team.select_one("td.losses").text.strip(),
                "ot_losses": team.select_one("td.ot-losses").text.strip(),
                "win_pct": team.select_one("td.pct").text.strip(),
                "gf": team.select_one("td.gf").text.strip(),
                 "ga": team.select_one("td.ga").text.strip(),
                 "diff": team.select_one("td.diff").text.strip(),
            }
            team_data.append(team_info)

In [8]:
team_data

[{'name': 'Boston Bruins',
  'year': '1990',
  'wins': '44',
  'losses': '24',
  'ot_losses': '',
  'win_pct': '0.55',
  'gf': '299',
  'ga': '264',
  'diff': '35'},
 {'name': 'Buffalo Sabres',
  'year': '1990',
  'wins': '31',
  'losses': '30',
  'ot_losses': '',
  'win_pct': '0.388',
  'gf': '292',
  'ga': '278',
  'diff': '14'},
 {'name': 'Calgary Flames',
  'year': '1990',
  'wins': '46',
  'losses': '26',
  'ot_losses': '',
  'win_pct': '0.575',
  'gf': '344',
  'ga': '263',
  'diff': '81'},
 {'name': 'Chicago Blackhawks',
  'year': '1990',
  'wins': '49',
  'losses': '23',
  'ot_losses': '',
  'win_pct': '0.613',
  'gf': '284',
  'ga': '211',
  'diff': '73'},
 {'name': 'Detroit Red Wings',
  'year': '1990',
  'wins': '34',
  'losses': '38',
  'ot_losses': '',
  'win_pct': '0.425',
  'gf': '273',
  'ga': '298',
  'diff': '-25'},
 {'name': 'Edmonton Oilers',
  'year': '1990',
  'wins': '37',
  'losses': '37',
  'ot_losses': '',
  'win_pct': '0.463',
  'gf': '272',
  'ga': '272',
  

# Summary

After extracting the relevant information, the final step in preparation for analysis is to save the data to disk.

### Saving the file
Here, save the data to `data/interim/` and name the file `hockey_teams.json`

> Note: Remember to import the appropriate library for handling the JSON format beforehand.

In [12]:
import json

with open(r"C:\Users\lucia\Documents\DATOVÝ ANALYTIK\CODERS LAB kurz\03 PYTHON\05_PÁTÝ_VÍKEND\Workshop_-_files\data\processed/hockey_teams.json", "w", encoding="utf-8") as f:
    json.dump(team_data, f, indent=2, ensure_ascii=False)

print("Data byla uložena jako JSON.")

Data byla uložena jako JSON.
