## Objectives
- In this project I will use webscraping to get hockey teams data
- Once I've got all the necessary data, I'll do Extract, Transform, and Load on the data i've collected.

### Import Libraries
I using several python libraries for this project:
- pandas
- requests
- BeautifulSoup
- html5lib

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Extract Data Using Web Scraping
The book list webpage https://www.scrapethissite.com/ provide information about list of hockey teams as well as their Team Name, Year, Wins, Losses, and etc. We will scrape the data for all teams in the list and store it in csv files.

#### Webpage Contents
Gather the contents of the webpage and convert into text format using the requests library and assign it to variable html_data

#### Scraping the Data
Using the contents and beautiful soup load the data from webpage into pandas dataframe.

Using BeautifulSoup parse the contents of the webpage.

In [None]:
# create hockey_teams_data dataframe will be used for store the data, with the columns as well as displayed below
hockey_teams_data = pd.DataFrame(columns=["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Againts (GA)", "+/-"])

# Looping to find the all page data we will scrape
for i in range(1, 25):
    url = 'https://www.scrapethissite.com/pages/forms/?page_num='+str(i)
    html_data = requests.get(url).text

    soup = BeautifulSoup(html_data, "html.parser")
    table = soup.find('table')

    # Remove table head since I don't need it right now, I'll add the table head with pandas dataframe we're create before
    remove_head = table.find('tr') #<---- find only the first element 'tr' in table
    remove_head.decompose() #<---- remove that element

    # Looping to find all table row and table column
    for row in table.find_all('tr'):
        cols = row.find_all('td')
        team_name = cols[0].text.strip()
        year = cols[1].text.strip()
        wins = cols[2].text.strip()
        losses = cols[3].text.strip()
        ot_losses = cols[4].text.strip()
        win_rate = cols[5].text.strip()
        gf = cols[6].text.strip()
        ga = cols[7].text.strip()
        diff = cols[8].text.strip()
        hockey_teams_data = hockey_teams_data.append({"Team Name": team_name, "Year": year, "Wins": wins, "Losses": losses, "OT Losses": ot_losses, "Win %": win_rate, "Goals For (GF)": gf, "Goals Againts (GA)": ga, "+/-": diff}, ignore_index=True)

In [None]:
hockey_teams_data

#### Export Data
Load the `pandas` dataframe created above into a CSV file named `hockey_teams_data.csv` using the `to_csv()` function.

In [5]:
hockey_teams_data.to_csv("hockey_teams_data.csv", index=False)

### ETL
- Read CSV file
- Extract data
- Transform data
- Save the transformed data

In [4]:
import glob
from datetime import datetime

#### Set Paths

In [5]:
tmpfile = "tmpfile.tmp"                     # file used to store all extracted data
logfile = "logfile.txt"                     # all event logs will be stored in this file
targetfile = "transformed_data.csv"         # file where transformed data is stored

#### Extract

In [6]:
def extract_from_csv(file_to_process):
    dataframe = pd.read_csv(file_to_process)
    return dataframe

#### Extract Function

In [7]:
def extract():
    extracted_data = pd.DataFrame(columns=["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Againts (GA)", "+/-"]) #create an empty dataframe to hold extracted_data

    for csvfile in glob.glob("*.csv"):
        extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)
    return extracted_data

#### Transform
The users have made a little mistake in inputting the data, and they now realize that something gets worst if they don't get things back the right way.
- The `year` column should be start at 2000 

In [8]:
def transform(data):
    data['Year'] = data['Year'] + 10
    return data

#### Loading

In [9]:
def load (targetfile, data_to_load):
    data_to_load.to_csv(targetfile)

#### Logging

In [10]:
def log(message):
    timestamp_format = '%Y-%h-%d-%H:%M:%S' #Year-Monthname-Day-Hour-Minute-Second
    now = datetime.now() #get current timestamp
    timestamp = now.strftime(timestamp_format)
    with open("logfile.txt", "a") as f:
        f.write(timestamp + ',' + message + '\n')

#### Running ETL Process

In [11]:
log("ETL Job Started")

In [12]:
log("Extract phase Started")
extracted_data = extract()
log("Extract phase Ended")
extracted_data

  extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)


Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Againts (GA),+/-
0,Boston Bruins,1990,44,24,,0.550,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8.0,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10.0,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9.0,0.622,249,198,51
580,Washington Capitals,2011,42,32,8.0,0.512,222,230,-8


In [13]:
log("Transform phase Started")
transformed_data = transform(extracted_data)
log("Transform phase Ended")
transformed_data 

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Againts (GA),+/-
0,Boston Bruins,2000,44,24,,0.550,299,264,35
1,Buffalo Sabres,2000,31,30,,0.388,292,278,14
2,Calgary Flames,2000,46,26,,0.575,344,263,81
3,Chicago Blackhawks,2000,49,23,,0.613,284,211,73
4,Detroit Red Wings,2000,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2021,38,36,8.0,0.463,235,281,-46
578,Toronto Maple Leafs,2021,35,37,10.0,0.427,231,264,-33
579,Vancouver Canucks,2021,51,22,9.0,0.622,249,198,51
580,Washington Capitals,2021,42,32,8.0,0.512,222,230,-8


In [14]:
log("Load phase Started")
load(targetfile,transformed_data)
log("Load phase Ended")

In [15]:
log("ETL Job Ended")