### How Does Web Scraping Work?
When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. Generally, our code downloads that page’s source code, just as a browser would. But instead of displaying the page visually, it filters through the page looking for HTML elements we’ve specified, and extracting whatever content we’ve instructed it to extract.

For example, if we wanted to get all of the titles inside H2 tags from a website, we could write some code to do that. Our code would request the site’s content from its server and download it. Then it would go through the page’s HTML looking for the H2 tags. Whenever it found an H2 tag, it would copy whatever text is inside the tag, and output it in whatever format we specified.

### The Components of a Web Page
HTML — contain the main content of the page.

CSS — add styling to make the page look nicer.

JS — Javascript files add interactivity to web pages.

Images — image formats, such as JPG and PNG allow web pages to show pictures.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
from requests import get

In [2]:
import bs4 as bs
import urllib.request

### Add URL From Which Data Scrap

In [3]:
src = urllib.request.urlopen('https://www.nytimes.com/').read()

In [4]:
bsoup = bs.BeautifulSoup(src, 'lxml')

In [6]:
print(bsoup)

<!DOCTYPE html>
<html class="nytapp-vi-homepage" lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<meta charset="utf-8"/>
<title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>
<meta content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. and international news, politics, business, technology, science, health, arts, sports and more." data-rh="true" name="description"/><meta content="https://www.nytimes.com" data-rh="true" property="og:url"/><meta content="website" data-rh="true" property="og:type"/><meta content="The New York Times - Breaking News, US News, World News and Videos" data-rh="true" property="og:title"/><meta content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. and in

In [7]:
### .text shows only text from the particular class
print(bsoup.title.text)

The New York Times - Breaking News, US News, World News and Videos


In [8]:
for link in bsoup.find_all('a'):
    print(link.get('href'))

#site-content
#site-index
#after-dfp-ad-top
/
/
/international/
/ca/
https://www.nytimes.com/es/
https://cn.nytimes.com
https://www.nytimes.com/section/todayspaper
/
https://www.nytimes.com/section/us
https://www.nytimes.com/section/us
https://www.nytimes.com/section/politics
https://www.nytimes.com/section/nyregion
https://www.nytimes.com/column/california-today
https://www.nytimes.com/section/education
https://www.nytimes.com/section/health
https://www.nytimes.com/section/obituaries
https://www.nytimes.com/section/science
https://www.nytimes.com/section/climate
https://www.nytimes.com/section/sports
https://www.nytimes.com/section/business
https://www.nytimes.com/section/technology
https://www.nytimes.com/section/upshot
https://www.nytimes.com/section/magazine
https://www.nytimes.com/news-event/2024-election
https://www.nytimes.com/topic/organization/us-supreme-court
https://www.nytimes.com/topic/organization/us-congress
https://www.nytimes.com/spotlight/joe-biden
https://www.nytimes

In [9]:
print(bsoup.p)

<p>Make sense of the day’s news and ideas.</p>


### Find All data in Particular Class

In [10]:
print(bsoup.find_all('p'))

[<p>Make sense of the day’s news and ideas.</p>, <p>Analysis that explains politics, policy and everyday life.</p>, <p class="css-1druyfb">See all newsletters</p>, <p>The biggest stories of our time, in 20 minutes a day.</p>, <p>On the campaign trail with Astead Herndon.</p>, <p class="css-1druyfb">See all podcasts</p>, <p>Get what you need to know to start your day.</p>, <p>Original analysis on the week’s biggest global stories.</p>, <p>News, features and opinion for readers in the region.</p>, <p>Backstories and analysis from our Canadian correspondents.</p>, <p class="css-1druyfb">See all newsletters</p>, <p>The most crucial business and policy news you need to know.</p>, <p class="css-1druyfb">See all newsletters</p>, <p>Our tech journalists help you make sense of the rapidly changing tech world.</p>, <p class="css-1druyfb">See all podcasts</p>, <p>Book recommendations from our critics.</p>, <p>Streaming TV and movie recommendations.</p>, <p class="css-1druyfb">See all newsletters<

In [11]:
print(bsoup.find('p').get_text())

Make sense of the day’s news and ideas.


In [12]:
ptags = bsoup.find_all('p')
for p in ptags:
    print(p.text)

Make sense of the day’s news and ideas.
Analysis that explains politics, policy and everyday life.
See all newsletters
The biggest stories of our time, in 20 minutes a day.
On the campaign trail with Astead Herndon.
See all podcasts
Get what you need to know to start your day.
Original analysis on the week’s biggest global stories.
News, features and opinion for readers in the region.
Backstories and analysis from our Canadian correspondents.
See all newsletters
The most crucial business and policy news you need to know.
See all newsletters
Our tech journalists help you make sense of the rapidly changing tech world.
See all podcasts
Book recommendations from our critics.
Streaming TV and movie recommendations.
See all newsletters
The podcast that takes you inside the literary world.
Pop music news, new songs and albums, and artists of note.
See all podcasts
The latest news on what we wear, by our chief fashion critic.
Real stories of relationship highs, lows and woes.
See all newslette

In [13]:
src = urllib.request.urlopen('http://www.espn.com/nba/statistics/player/_/stat/assists/sort/avgAssists/').read()
bsoup = bs.BeautifulSoup(src, 'lxml')
tbl = bsoup.find('table')

In [14]:
tbl_rows = tbl.find_all('tr')
for tr in tbl_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

[]
['1', 'Joel EmbiidPHI']
['2', 'Luka DoncicDAL']
['3', 'Giannis AntetokounmpoMIL']
['4', 'Shai Gilgeous-AlexanderOKC']
['5', 'Kevin DurantPHX']
['6', 'Donovan MitchellCLE']
['7', 'Stephen CurryGS']
['8', 'Devin BookerPHX']
['9', 'Trae YoungATL']
['10', "De'Aaron FoxSAC"]
['11', 'Jalen BrunsonNY']
['12', 'Jayson TatumBOS']
['13', 'Nikola JokicDEN']
['14', 'Tyrese MaxeyPHI']
['15', 'Anthony EdwardsMIN']
['16', 'Damian LillardMIL']
['17', 'LeBron JamesLAL']
['18', 'Anthony DavisLAL']
['19', 'Desmond BaneMEM']
['20', 'Kawhi LeonardLAC']
['21', 'Julius RandleNY']
['22', 'Lauri MarkkanenUTAH']
['23', 'Paolo BancheroORL']
['24', 'Tyrese HaliburtonIND']
['25', 'Paul GeorgeLAC']
['26', 'Karl-Anthony TownsMIN']
['27', 'Cade CunninghamDET']
['28', 'Jaylen BrownBOS']
['29', 'Zion WilliamsonNO']
['30', 'Jaren Jackson Jr.MEM']
['31', 'Pascal SiakamIND/TOR']
['32', 'DeMar DeRozanCHI']
['33', 'Brandon IngramNO']
['34', 'Kyle KuzmaWSH']
['35', 'Mikal BridgesBKN']
['36', 'Alperen SengunHOU']
['37', 'C

In [15]:
type(row)

list

In [16]:
import pandas as pd 
data = pd.read_html("http://www.espn.com/nba/statistics/player/_/stat/assists/sort/avgAssists/")
for df in data:
    print(df)

    RK                        Name
0    1              Joel EmbiidPHI
1    2              Luka DoncicDAL
2    3    Giannis AntetokounmpoMIL
3    4  Shai Gilgeous-AlexanderOKC
4    5             Kevin DurantPHX
5    6         Donovan MitchellCLE
6    7             Stephen CurryGS
7    8             Devin BookerPHX
8    9               Trae YoungATL
9   10             De'Aaron FoxSAC
10  11             Jalen BrunsonNY
11  12             Jayson TatumBOS
12  13             Nikola JokicDEN
13  14             Tyrese MaxeyPHI
14  15          Anthony EdwardsMIN
15  16           Damian LillardMIL
16  17             LeBron JamesLAL
17  18            Anthony DavisLAL
18  19             Desmond BaneMEM
19  20            Kawhi LeonardLAC
20  21             Julius RandleNY
21  22         Lauri MarkkanenUTAH
22  23           Paolo BancheroORL
23  24        Tyrese HaliburtonIND
24  25              Paul GeorgeLAC
25  26       Karl-Anthony TownsMIN
26  27          Cade CunninghamDET
27  28             J