## Objectives
- Use webscraping to gather list of hockey teams and their information

### Import Libraries
I using several python libraries for this project:
- pandas
- request
- BeautifulSoup
- html5lib
- lxml
- urllib

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import html5lib
import lxml
from urllib.parse import urljoin

### Extract Data Using Web Scraping
The book list webpage https://www.scrapethissite.com/ provide information about list of hockey teams as well as their Team Name, Year, Wins, Losses, and etc. We will scrape the data for all teams in the list and store it in csv files. 

### Webpage Contents
Gather the contents of the webpage and convert into text format using the `requests` library and assign it to variable `html_data`

In [2]:
url = 'https://www.scrapethissite.com/pages/forms/?page_num=1'
data = requests.get(url).text

In [4]:
# print(html_data)

### Scraping the data
Using the contents and `beautiful soup` load the data from webpage into `pandas` dataframe.

Using BeautifulSoup parse the contents of the webpage.

In [3]:
soup = BeautifulSoup(data, "html.parser")

In [4]:
table = soup.find('table')

In [5]:
table

<table class="table">
<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            2

In [6]:
# In this case I need to replace all element "th" into "td" to make scrap all table data more easier
new_tags_string = ["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+ / -"]

for replace in table.find_all('th'):
    new_tags = soup.new_tag('td')
    replace.replace_with(new_tags)
    for n in new_tags_string:
        new_tags.string = n
        break
    del(new_tags_string[0])

In [11]:
table.prettify()

'<table class="table">\n <tr>\n  <td>\n   Team Name\n  </td>\n  <td>\n   Year\n  </td>\n  <td>\n   Wins\n  </td>\n  <td>\n   Losses\n  </td>\n  <td>\n   OT Losses\n  </td>\n  <td>\n   Win %\n  </td>\n  <td>\n   Goals For (GF)\n  </td>\n  <td>\n   Goals Against (GA)\n  </td>\n  <td>\n   + / -\n  </td>\n </tr>\n <tr class="team">\n  <td class="name">\n   Boston Bruins\n  </td>\n  <td class="year">\n   1990\n  </td>\n  <td class="wins">\n   44\n  </td>\n  <td class="losses">\n   24\n  </td>\n  <td class="ot-losses">\n  </td>\n  <td class="pct text-success">\n   0.55\n  </td>\n  <td class="gf">\n   299\n  </td>\n  <td class="ga">\n   264\n  </td>\n  <td class="diff text-success">\n   35\n  </td>\n </tr>\n <tr class="team">\n  <td class="name">\n   Buffalo Sabres\n  </td>\n  <td class="year">\n   1990\n  </td>\n  <td class="wins">\n   31\n  </td>\n  <td class="losses">\n   30\n  </td>\n  <td class="ot-losses">\n  </td>\n  <td class="pct text-danger">\n   0.388\n  </td>\n  <td class="gf">\n 

In [10]:
hockey_teams_data = pd.DataFrame(columns=["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+ / -"])

for row in table.find_all('tr'):
    cols = row.find_all('td')
    team_name = cols[0].string
    year = cols[1].string
    wins = cols[2].string
    losses = cols[3].string
    ot_losses = cols[4].string
    win_rate = cols[5].string
    gf = cols[6].string
    ga = cols[7].string
    diff = cols[8].string
    hockey_teams_data = hockey_teams_data.append({"Team Name": team_name, "Year": year, "Wins": wins, "Losses": losses, "OT Losses", "Win %": win_rate, "Goals For (GF)": gf, "Goals Against (GA)": ga, "+ / -": diff}, ignore_index=True)

Team Name | Year | Wins | Losses | OT Losses | Win % | Goals For (GF) | Goals Against (GA) | + / -

                            Boston Bruins
                         | 
                            1990
                         | 
                            44
                         | 
                            24
                         | 
 | 
                            0.55
                         | 
                            299
                         | 
                            264
                         | 
                            35
                        

                            Buffalo Sabres
                         | 
                            1990
                         | 
                            31
                         | 
                            30
                         | 
 | 
                            0.388
                         | 
                            292
                         | 
                            278
 