## Initiation to Web Scraping

To download a module for a specific python version, on Powershell, use the following command line : 
```bash 
py -3.10 -m pip install bs4
```

In [49]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Let's try to download a HTML Code from a given website page. For our example, we'll use a [Giannis Antetokoumpo's Wikipedia page](#https://en.wikipedia.org/wiki/Giannis_Antetokounmpo)

In [10]:
url = requests.get('https://en.wikipedia.org/wiki/Giannis_Antetokounmpo')
url.status_code
content = BeautifulSoup(url.content, 'html.parser')

Extracting the title as an example : 

In [13]:
title = content.find('h1').text
print(title)

Giannis Antetokounmpo


Let's say we want to extract Giannis' stats table.

In [22]:
stats_table = content.find('table', {'class':'wikitable'})
print(stats_table)

<table class="wikitable sortable plainrowheaders" style="text-align:right;">
<caption>
</caption>
<tbody><tr>
<th scope="col">Year
</th>
<th scope="col">Team
</th>
<th scope="col"><abbr title="Games played">GP</abbr>
</th>
<th scope="col"><abbr title="Games started">GS</abbr>
</th>
<th scope="col"><abbr title="Minutes per game">MPG</abbr>
</th>
<th scope="col"><abbr title="Field goal percentage">FG%</abbr>
</th>
<th scope="col"><abbr title="3-point field-goal percentage">3P%</abbr>
</th>
<th scope="col"><abbr title="Free-throw percentage">FT%</abbr>
</th>
<th scope="col"><abbr title="Rebounds per game">RPG</abbr>
</th>
<th scope="col"><abbr title="Assists per game">APG</abbr>
</th>
<th scope="col"><abbr title="Steals per game">SPG</abbr>
</th>
<th scope="col"><abbr title="Blocks per game">BPG</abbr>
</th>
<th scope="col"><abbr title="Points per game">PPG</abbr>
</th></tr>
<tr>
<td style="text-align:left;"><a href="/wiki/2013%E2%80%9314_NBA_season" title="2013–14 NBA season">2013–14</a>

We need to extract only the content of the table, and not the headers... and all the rows.

We can use the `find_all` method to extract all the rows of the table. A row is represented by the `tr` tag.
We will then be able to extract each column of each row, once again using the `find_all` method. A column is represented by the `td` tag.

We will also use the `text.strip()` method to extract the text of each tag, and remove the leading and trailing whitespaces.
We will the store the data in a list, and then convert it to a DataFrame.


In [52]:
rows = stats_table.find_all('tr')
stats = []
for row in rows:
    columns = row.find_all('td')
    if columns:
        stats.append([column.text.strip() for column in columns])  # Why not use a comprehension as we learned them in a previous course !
print(stats)

[['2013–14', 'Milwaukee', '77', '23', '24.6', '.414', '.347', '.683', '4.4', '1.9', '.8', '.8', '6.8'], ['2014–15', 'Milwaukee', '81', '71', '31.4', '.491', '.159', '.741', '6.7', '2.6', '.9', '1.0', '12.7'], ['2015–16', 'Milwaukee', '80', '79', '35.3', '.506', '.257', '.724', '7.7', '4.3', '1.2', '1.4', '16.9'], ['2016–17', 'Milwaukee', '80', '80', '35.6', '.522', '.272', '.770', '8.7', '5.4', '1.6', '1.9', '22.9'], ['2017–18', 'Milwaukee', '75', '75', '36.7', '.529', '.307', '.760', '10.0', '4.8', '1.5', '1.4', '26.9'], ['2018–19', 'Milwaukee', '72', '72', '32.8', '.578', '.256', '.729', '12.5', '5.9', '1.3', '1.5', '27.7'], ['2019–20', 'Milwaukee', '63', '63', '30.4', '.553', '.304', '.633', '13.6', '5.6', '1.0', '1.0', '29.5'], ['2020–21†', 'Milwaukee', '61', '61', '33.0', '.569', '.303', '.685', '11.0', '5.9', '1.2', '1.2', '28.1'], ['2021–22', 'Milwaukee', '67', '67', '32.9', '.553', '.293', '.722', '11.6', '5.8', '1.1', '1.4', '29.9'], ['2022–23', 'Milwaukee', '63', '63', '32.1'

But we want to extract the column names as well. We can use the `find_all` method to extract the headers of the table. A header is represented by the `th` tag.

In [53]:
headers = [header.text.strip() for header in rows[0].find_all('th')]
print(headers)

['Year', 'Team', 'GP', 'GS', 'MPG', 'FG%', '3P%', 'FT%', 'RPG', 'APG', 'SPG', 'BPG', 'PPG']


Let's store the stats in a DataFrame, with the column names extracted from the headers of the table.

In [54]:
statsdf = pd.DataFrame(stats, columns=headers)

In [56]:
statsdf.head(14)

Unnamed: 0,Year,Team,GP,GS,MPG,FG%,3P%,FT%,RPG,APG,SPG,BPG,PPG
0,2013–14,Milwaukee,77,23.0,24.6,0.414,0.347,0.683,4.4,1.9,0.8,0.8,6.8
1,2014–15,Milwaukee,81,71.0,31.4,0.491,0.159,0.741,6.7,2.6,0.9,1.0,12.7
2,2015–16,Milwaukee,80,79.0,35.3,0.506,0.257,0.724,7.7,4.3,1.2,1.4,16.9
3,2016–17,Milwaukee,80,80.0,35.6,0.522,0.272,0.77,8.7,5.4,1.6,1.9,22.9
4,2017–18,Milwaukee,75,75.0,36.7,0.529,0.307,0.76,10.0,4.8,1.5,1.4,26.9
5,2018–19,Milwaukee,72,72.0,32.8,0.578,0.256,0.729,12.5,5.9,1.3,1.5,27.7
6,2019–20,Milwaukee,63,63.0,30.4,0.553,0.304,0.633,13.6,5.6,1.0,1.0,29.5
7,2020–21†,Milwaukee,61,61.0,33.0,0.569,0.303,0.685,11.0,5.9,1.2,1.2,28.1
8,2021–22,Milwaukee,67,67.0,32.9,0.553,0.293,0.722,11.6,5.8,1.1,1.4,29.9
9,2022–23,Milwaukee,63,63.0,32.1,0.553,0.275,0.645,11.8,5.7,0.8,0.8,31.1
