## Objectives
- Use webscraping to gather list of hockey teams and their information

### Import Libraries
I using several python libraries for this project:
- pandas
- request
- BeautifulSoup
- html5lib
- lxml
- urllib

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Extract Data Using Web Scraping
The book list webpage https://www.scrapethissite.com/ provide information about list of hockey teams as well as their Team Name, Year, Wins, Losses, and etc. We will scrape the data for all teams in the list and store it in csv files. 

### Webpage Contents
Gather the contents of the webpage and convert into text format using the `requests` library and assign it to variable `html_data`

### Scraping the data
Using the contents and `beautiful soup` load the data from webpage into `pandas` dataframe.

Using BeautifulSoup parse the contents of the webpage.

In [9]:
# hockey_teams_data dataframe will be used for store the data, with the columns as well as displayed below
hockey_teams_data = pd.DataFrame(columns=["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+ / -"])

# Looping to find the next page we will scrape
for i in range(1, 25):
    url = 'https://www.scrapethissite.com/pages/forms/?page_num='+str(i)
    html_data = requests.get(url).text
    
    soup = BeautifulSoup(html_data, "html.parser")
    table = soup.find('table')

    # Check wheteher the loop works well or not by displaying the link of each page
    # pagination = soup.find('ul', class_="pagination")
    # np = pagination.find('a').get("href")
    # full_np = "https://www.scrapethissite.com" + np
    # print(full_np)    



    # In case need to replace all element "th" into "td" to make scrap all table data more easier, run this code
    # new_tags_string = ["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+ / -"]
    # for replace in table.find_all('th'):
    #     new_tags = soup.new_tag('td')
    #     replace.replace_with(new_tags)
    #     for n in new_tags_string:
    #         new_tags.string = n
    #         break
    #     del(new_tags_string[0])


    # Remove table head since i don't need it right now, i'll add the table head later with pandas
    remove_tag = table.find('tr') #<---- find only the first element 'tr' in table
    remove_tag.decompose() #<---- remove that element

    for row in table.find_all('tr'):
        cols = row.find_all('td')
        team_name = cols[0].text.strip()
        year = cols[1].text.strip()
        wins = cols[2].text.strip()
        losses = cols[3].text.strip()
        ot_losses = cols[4].text.strip()
        win_rate = cols[5].text.strip()
        gf = cols[6].text.strip()
        ga = cols[7].text.strip()
        diff = cols[8].text.strip()
        hockey_teams_data = hockey_teams_data.append({"Team Name": team_name, "Year": year, "Wins": wins, "Losses": losses, "OT Losses": ot_losses, "Win %": win_rate, "Goals For (GF)": gf, "Goals Against (GA)": ga, "+ / -": diff}, ignore_index=True)


In [10]:
hockey_teams_data

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9,0.622,249,198,51
580,Washington Capitals,2011,42,32,8,0.512,222,230,-8


In [11]:
hockey_teams_data.to_csv("hockey_teams_data.csv", index=False)