<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> 
<br>Scraping static websites<br>
Tiago Ventura </center> <h1> 

---

# Learning Goals

In the class today, we will focus on:

- Understand different strategies to acquire digital data
- Understanding html structure to look up content on a website
- Scrape content from a static website
- Build a scraper to systematically draw content from similarly organized webpages.

# The Digital information age

We start our first lecture looking at this graph. It shows two things: 

- in the past few years we have produced and stored an enourmous among of data
- Most of this data is produced and stored in digital environments. 

<div>
<img src="http://media3.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.jpg" width="60%"/>
</div>


Not all this data is available on digital spaces (like websites, social media apps, and digital archives). But some are. And as data scientists a primary skill that is expected from you is to be able to acquire, process, store and analyze this data. Today, we will focus on **acquiring data in the digital information era.** 

There are three primary techniques through which you can acquire digital data: 

- Scrap data from self-contained (static) websites
- Scrap data from dynamic (javascript powered) websites
- Access data through Application Programming Interfaces

## What is scraping? 

**Scraping** consists of automatically collecting data available on websites. In theory, you can collect website data  by hand, or asking a couple of friends to help you. However, in a world of abundant data, this is likely not feasible, and in general, it may become more difficult once you have learned to collect it automatically.

Let me give you some **examples of websites** I have alread scraped: 

- Electoral data from many different countries;
- Composition of elites around the world;
- Wikipedia; 
- Toutiao, a news aggregation from China; 
- Political Manifestos in Brazil 
- Fact-Checking News
- Facebook and Youtube Live Chats. 
- Property Prices from Zillow. 

Scraping can be summarize in: 

- leveraging the structure of a website to **grab it's contents**

- using a programming environment (such as R, Python, Java, etc.) to **systematically extract** that content.

- accomplishing the above in an "unobtrusive" and **legal** way.



## Scraping vs APIs


An API is a set of rules and protocols that allows software applications to communicate with each other. APIs provide an front door for a developer to interact with a website. 

APIs are used for many different types of online communication and information sharing, among those, many **APIs have been developed to provide an easy and official way for developers and data scientists to access data**. 

As these APIs are developed by data owners, they are often secure, practical, and more organized than acquiring data through scrapping. 

Scraping is a back door for when there’s no API or when we need content beyond the structured fields the API returns

**if you can use the API to access a dataset, that's where you will want to go**

## Ethical Challenges with Scraping

Webscraping is legal **as long as the scraped data is publicly available and the scraping activity does not harm the website being scraped**. These are two hugely relevant conditionals.  For this reason, before we start coding, it is carefully understand what each entails. 

Each call to a web server takes time, server cycles, and memory. Most servers can handle significant traffic, but they can't necessarily handle the strain induced by massive automated requests. Your code can overload the site, taking it offline, or causing the site administrator to ban your IP. See [Denial-of-service attack (DoS)](https://en.wikipedia.org/wiki/Denial-of-service_attack).

We do not want compromise the functioning of a website just because of our research. First, this overload can crash a server and prevent other users from accessing the site. Second, servers and hosters can, and do, implement countermeasures (i.e. block our access from our IP and so on). 

In addition, take as a best practice of only collecting public information. Think about Facebook. In my personal view, it is okay to collect public posts, or data from public groups. If by some way you manage to get into private groups, and group members have an expectation of privacy, it is not okay to collect their data. 

Here is a list of good practices for scraping:

- Respect robots.txt
- Don't hit servers too often
- Slow down your code to the speed humans would manually do
- Find trusted source sites
- Do not shave during peak hours
- Improve your code speed
- Use data responsibly (As academics often do)

## Scraping Routine

Scraping often involves the following routine: 

- **Step 1:** Find a website with information you want to collect
- **Step 2:** Understand the website
- **Step 3:** Write code to collect one realization of the data
- **Step 4:** Build a scraper -- generalize you code into a function.

And repeat!

## Step 1: Find a Website... but what is a website? 

A website in general is a combination of **HTML, CSS, XML, PHP, and Javascript**. We will care mostly about HTMLs and CSSs. 


### Static vs Dynamic Websites

HTML forms what we call **static websites** - everything you see is there in the source behind the website. Javascript produces dynamic sites - ones that you browse and click on and the url doesn't change - and are sites typically powered by a database deep within the programming. 

Today we will deal with static websites using the Python library `Beautiful Soup`. For dynamic websites, we will learn next class about working with `selenium` in Python. 

### HTML Website

HTML stands for **HyperText Markup Language**. As it is explict from the name, it is  a markup language used to create web pages and is a cornerstone technology of the internet. It is not a programming language as Python, R and Java.  Web browsers read HTML documents and render them into visible or audible web pages.

See an example of an html file: 


```
<html>
<head>
  <title> Michael Cohen's Email </title>
  <script>
    var foot = bar;
  <script>
</head>
<body>
  <div id="payments">
  <h2>Second heading</h2>
  <p class='slick'>information about <br/><i>payments</i></p>
  <p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>
```

HTML code is structured using tags, and information is organized hierarchcially (like a list or an array) from top to bottom. 

Some of the most important tags we will use for scraping are: 


- **p** – paragraphs
- **a href** – links
- **div** – divisions
- **h** – headings
- **table** – tables

See [here for more about html tags](https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9)

<div class="alert alert-block alert-danger", style="font-size: 20px;">
Scraping is all about finding tags and collecting the data associated with them
</div>

### What else exists on HTML beyond tags?

The tags are the target. The information we need from html usually come from texts and attributes of the tag. Very often your work will consist on finding the tag, and then capturing the information you need. The figure below summarizes well this difference on html files. 

<div>
<img src="https://static.semrush.com/blog/uploads/media/59/fc/59fc528eecc00e43b1a3ed5d9b9933ee/4YA3vCJ_Hw6DucoVZ40FbKFRppAReJVOkLKHcZlDkO-9geydLO6tw9uzFJFZf5nam3QcT7p0hRdpFyL2uPhoDISD8CPZwfPE5GTqgpH53q9M99QWgDVhjgQrCMOlQI9fA1T2dCxJ5T2goCV3k1wo-Jc.webp
" width="60%"/>
</div>

Source:[https://www.semrush.com/blog/html-anchor/](https://www.semrush.com/blog/html-anchor/)

## Step 2: Understand the website

As you anticipate, a huge part of the scraping work is to understand your website and find the tags/information you are interested in. There are two ways to go about it: 

- ### Inspect the website: `command` + `shift` + `i` or select element, right click in the mouse, and inspect. 

    - See [an example in practice](https://storage.googleapis.com/lds-media/documents/css_selector_vs_xpath.mp4) (*Source: Ultimate Guide to Web Scraping with Python by Brenda Marting*) 

<br>

- ### Use selector gadget: selector gadget is a tool that allow us to use CSS selector to scrap websites. 
    - See the [documentation](https://selectorgadget.com/) and a [tutorial here](https://www.youtube.com/watch?v=YdIWI6K64zo).

## Break : Install selector gadget

Here: https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb

## Step 3: Collect a realization of the data

To do webscraping, we will use two main libraries: 

- `requests`: to make a `get()` request and access the html behind the pages
- `BeautifulSoup`: to parse the html


See this [nice tutorial here](https://realpython.com/python-web-scraping-practical-introduction/) to understand the difference between parsing html with `BeautifulSoup` and using text mining methods to scrap website. 

In [1]:
# install libraries - Take the # out if this is the first time you are installing these packages. 
#!pip install requests
#!pip install beautifulsoup4

### Scraping: CNN Politics

Let's scrap our first website. We will start scrapping some news from BBC

In [2]:
# setup
import pandas as pd
import requests # For downloading the website
from bs4 import BeautifulSoup # For parsing the website
import time # To put the system to sleep
import random # for random numbers

#### Get Request to collect html data

In [3]:
# Get access to the website
url = "https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc"
page = requests.get(url)

In [4]:
# check object type - object class requests
type(page)

requests.models.Response

In [5]:
# check if you got a connection
page.status_code # 200 == Connection

200

In [6]:
# See the content. 
# notice we downloaded the entire website.
# Do inspect to make sure of this in the web browser
page.content[0:1000]

b'  <!DOCTYPE html>\n<html lang="en" data-uri="cms.cnn.com/_pages/cmh7tyh9n004d27qk0cma6iti@published" data-layout-uri="cms.cnn.com/_layouts/layout-article-elevate/instances/politics-article-elevate_small-v1@published">\n  <head>\n<link rel="dns-prefetch" href="//tpc.googlesyndication.com">\n\n<link rel="preconnect" href="//tpc.googlesyndication.com">\n\n<link rel="dns-prefetch" href="//pagead2.googlesyndication.com">\n\n<link rel="preconnect" href="//pagead2.googlesyndication.com">\n\n<link rel="dns-prefetch" href="//www.googletagservices.com">\n\n<link rel="preconnect" href="//www.googletagservices.com">\n\n<link rel="dns-prefetch" href="//www.google.com">\n\n<link rel="preconnect" href="//www.google.com">\n\n<link rel="dns-prefetch" href="//c.amazon-adsystem.com">\n\n<link rel="preconnect" href="//c.amazon-adsystem.com">\n\n<link rel="dns-prefetch" href="//ib.adnxs.com">\n\n<link rel="preconnect" href="//ib.adnxs.com">\n\n<link rel="dns-prefetch" href="//cdn.adsafeprotected.com">\n\

#### Saving and Loading a HTML locally

After we make a request and retrieve a web page's content, we can store that content locally with Python's `open()` function. Saving a html source could avoid you to hit the website multiple times. 

In [7]:
# save html locally
with open("cnn_news1", 'wb') as f:
    f.write(page.content)

And here is how to open:

In [8]:
# open a locally saved html
with open("cnn_news1", 'rb') as f:
    html = f.read()
# see it
print(html[0:1000])

b'  <!DOCTYPE html>\n<html lang="en" data-uri="cms.cnn.com/_pages/cmh7tyh9n004d27qk0cma6iti@published" data-layout-uri="cms.cnn.com/_layouts/layout-article-elevate/instances/politics-article-elevate_small-v1@published">\n  <head>\n<link rel="dns-prefetch" href="//tpc.googlesyndication.com">\n\n<link rel="preconnect" href="//tpc.googlesyndication.com">\n\n<link rel="dns-prefetch" href="//pagead2.googlesyndication.com">\n\n<link rel="preconnect" href="//pagead2.googlesyndication.com">\n\n<link rel="dns-prefetch" href="//www.googletagservices.com">\n\n<link rel="preconnect" href="//www.googletagservices.com">\n\n<link rel="dns-prefetch" href="//www.google.com">\n\n<link rel="preconnect" href="//www.google.com">\n\n<link rel="dns-prefetch" href="//c.amazon-adsystem.com">\n\n<link rel="preconnect" href="//c.amazon-adsystem.com">\n\n<link rel="dns-prefetch" href="//ib.adnxs.com">\n\n<link rel="preconnect" href="//ib.adnxs.com">\n\n<link rel="dns-prefetch" href="//cdn.adsafeprotected.com">\n\

### Here it comes the beautifulsoup

Next, you will create a `beautifulsoup` object. A beautifulsoup object is just a parser. It allows us to easily access elements from the raw html.  

In [9]:
# create an bs object.
# input 1: request content; input 2: tell you need an html parser
soup = BeautifulSoup(page.content, 'html.parser') 

# Let's look at the raw code of the downloaded website
print(soup.prettify()[:1000]) 

<!DOCTYPE html>
<html data-layout-uri="cms.cnn.com/_layouts/layout-article-elevate/instances/politics-article-elevate_small-v1@published" data-uri="cms.cnn.com/_pages/cmh7tyh9n004d27qk0cma6iti@published" lang="en">
 <head>
  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//tpc.googlesyndication.com" rel="preconnect"/>
  <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//pagead2.googlesyndication.com" rel="preconnect"/>
  <link href="//www.googletagservices.com" rel="dns-prefetch"/>
  <link href="//www.googletagservices.com" rel="preconnect"/>
  <link href="//www.google.com" rel="dns-prefetch"/>
  <link href="//www.google.com" rel="preconnect"/>
  <link href="//c.amazon-adsystem.com" rel="dns-prefetch"/>
  <link href="//c.amazon-adsystem.com" rel="preconnect"/>
  <link href="//ib.adnxs.com" rel="dns-prefetch"/>
  <link href="//ib.adnxs.com" rel="preconnect"/>
  <link href="//cdn.adsafeprotected.com" rel="dns-prefetch"/>
  <link

With the parser, we can look start looking at the data. The functions we will use the most are: 

- `.find_all()`: to find tags by their names
- `.select()`: to select tags by using the CSS selector 
- `.get_text()`: to access the text in the tag
- `["attr"]`: to access attributes of a tag

Let's start trying to grab all the textual information of the news. These are often under the tag `<p>` for paragraph

In [10]:
## find paragraph
cnn_par = soup.find_all('p')

In [11]:
cnn_par

[<p class="paragraph-elevate inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/cmh84863600033b6ne2m78hw0@published">
     </p>,
 <p class="paragraph-elevate inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/cmh8i7w0h001v3b6nhbmetwum@published">
             “While Donald Trump’s donor billionaires think they have the money to buy this election, we have a movement of the masses,” Mamdani said to a roaring crowd.
     </p>,
 <p class="paragraph-elevate inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/cmh8i7w0h001w3b6nzsewqmym@published">
             Mamdani took the stage at a raucous rally alongside Sen. Bernie Sanders and Rep. Ale

In [12]:
## let's see how it looks like
len(cnn_par)

30

In [13]:
## let print one
cnn_par[3]

<p class="paragraph-elevate inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/cmh8i7usv001t3b6nonqtrfrd@published">
            The rally was part closing argument, part rallying of the troops ahead of the November 4 election, with Mamdani casting the race as a choice between democracy and oligarchy and Sanders and Ocasio Cortez touting Mamdani’s campaign as the vanguard of a progressive movement itching to push back on the second Trump administration.
    </p>

You see you just parsed the full tag for all paragraphs of the text. Let's remove all html tags using the `.get_text()` method

In [14]:
# get the text. 
# This is what is in between the tags <p> TEXT </p>
cnn_par[0].get_text()



In [15]:
# use our friend list compreehension to parse all
all_par = [par.get_text() for par in cnn_par]
all_par

 '\n            “While Donald Trump’s donor billionaires think they have the money to buy this election, we have a movement of the masses,” Mamdani said to a roaring crowd.\n    ',
 '\n            Mamdani took the stage at a raucous rally alongside Sen. Bernie Sanders and Rep. Alexandria Ocasio-Cortez at Forest Hills Stadium in Queens, where thousands of people chanted Mamdani’s name and repeated in unison his signature proposals to freeze the rent, make buses fast and free, and provide universal child care.\n    ',
 '\n            The rally was part closing argument, part rallying of the troops ahead of the November 4 election, with Mamdani casting the race as a choice between democracy and oligarchy and Sanders and Ocasio Cortez touting Mamdani’s campaign as the vanguard of a progressive movement itching to push back on the second Trump administration.\n    ',
 '\n            “I’m talking to you, Donald Trump,” Ocasio-Cortez declared, saying that “in nine short days we will work our 

You see neverthless that you did collect some junk that are not the paragraph information you are looking for. 

This happens because there are multiple instances (under different tags) in which the tag `<p>` is used for. For example, if you look at the last element of the `all_par` list, you will see your scraper is collecting the footer of the webpage. 

### Solution

**Be more specific. Work with a CSS selector.**

A CSS selector is a pattern used to select and style one or more elements in an HTML document. It is a way to chain multiple style and attributes of an html file. 

Another way to do this is using XPATH, which can be super useful to learn, but a bit more complicated for begginers.

See this tutorial [here](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors) to understand the concept of a css selector

Let's use the selector gadget tool ([see tutorial](https://www.youtube.com/watch?v=YdIWI6K64zo)) to get a css selector for all the paragraphs. 

In [16]:
# open your webbrowser and use the selector gadget
# website: https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc

In [17]:
# Use a css selector to target specific content
cnn_par = soup.select(".vossi-paragraph")

In [18]:
cnn_par[-1]

<p class="paragraph-elevate inline-placeholder vossi-paragraph" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/cmh8gcpa4000l3b6nrhgkn2n1@published">
<em>CNN’s Ethan Cohen contributed to this report. </em>
</p>

In [19]:
story_content = [i.get_text() for i in cnn_par]
story_content 

 '\n            “While Donald Trump’s donor billionaires think they have the money to buy this election, we have a movement of the masses,” Mamdani said to a roaring crowd.\n    ',
 '\n            Mamdani took the stage at a raucous rally alongside Sen. Bernie Sanders and Rep. Alexandria Ocasio-Cortez at Forest Hills Stadium in Queens, where thousands of people chanted Mamdani’s name and repeated in unison his signature proposals to freeze the rent, make buses fast and free, and provide universal child care.\n    ',
 '\n            The rally was part closing argument, part rallying of the troops ahead of the November 4 election, with Mamdani casting the race as a choice between democracy and oligarchy and Sanders and Ocasio Cortez touting Mamdani’s campaign as the vanguard of a progressive movement itching to push back on the second Trump administration.\n    ',
 '\n            “I’m talking to you, Donald Trump,” Ocasio-Cortez declared, saying that “in nine short days we will work our 

In [20]:
story_content[0].strip()



In [21]:
# Clean and join together with string methods
story_text = "\n".join([i.strip() for i in story_content])
print(story_text)

“While Donald Trump’s donor billionaires think they have the money to buy this election, we have a movement of the masses,” Mamdani said to a roaring crowd.
Mamdani took the stage at a raucous rally alongside Sen. Bernie Sanders and Rep. Alexandria Ocasio-Cortez at Forest Hills Stadium in Queens, where thousands of people chanted Mamdani’s name and repeated in unison his signature proposals to freeze the rent, make buses fast and free, and provide universal child care.
The rally was part closing argument, part rallying of the troops ahead of the November 4 election, with Mamdani casting the race as a choice between democracy and oligarchy and Sanders and Ocasio Cortez touting Mamdani’s campaign as the vanguard of a progressive movement itching to push back on the second Trump administration.
“I’m talking to you, Donald Trump,” Ocasio-Cortez declared, saying that “in nine short days we will work our hearts out to elect Zohran Kwame Mamdani as the next mayor of the great city of New York

### What else can we collect from this news? 

- title
- author
- date

Let's do it. 

In [22]:
# title
css_loc = "#maincontent"
story_title = soup.select(css_loc)
story_title[0]

<h1 class="headline__text inline-placeholder vossi-headline-text" data-editable="headlineText" id="maincontent">
      Mamdani rallies with Sanders and Ocasio-Cortez as Democrats close ranks around NYC mayoral nominee
    </h1>

In [23]:
story_title = story_title[0].get_text()
print(story_title)


      Mamdani rallies with Sanders and Ocasio-Cortez as Democrats close ranks around NYC mayoral nominee
    


In [24]:
# story date
story_date = soup.select(".vossi-timestamp")[0].get_text()
print(story_date)





Updated Oct 26, 2025, 10:33 PM ET
Published Oct 26, 2025, 12:03 PM ET





Updated Oct 26, 2025, 10:33 PM ET
PUBLISHED Oct 26, 2025, 12:03 PM ET




In [25]:
# story authors
story_author = soup.select(".byline__name")[0].get_text()
print(story_author)

Gloria Pazmino


In [26]:
# let's nest all in a list
entry = [url, story_title.strip(),story_date.strip(),story_text]
entry

['https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc',
 'Mamdani rallies with Sanders and Ocasio-Cortez as Democrats close ranks around NYC mayoral nominee',
 'Updated Oct 26, 2025, 10:33 PM ET\nPublished Oct 26, 2025, 12:03 PM ET\n\n\n\n\n\nUpdated Oct 26, 2025, 10:33 PM ET\nPUBLISHED Oct 26, 2025, 12:03 PM ET',

## Practice with Latin News Website

Your task: Look a this website here: https://www.latinnews.com/latinnews-country-database.html?country=2156

Do the following taks: 

- Select a single link for you to explore. 
- Write code to scrape the link. 
- Collect the following information:
    - text of the news
    - title of the news
    - url associated with the news
    - date of the post
- Return all of them as a pandas dataframe.     

In [27]:
# write your solution here

### Step 4: Build a scraper

After you have your scrapper working for a single case, you will generalize your work. As we learned before, we do this by creating a function (or a full class with different methods). 

Notice, here it is also important to add the good practices which try to imitate human behavior inside of our functions. 

Let's start with a function to scrap news from CNN

In [28]:
# Building a scraper
# The idea here is to just wrap the above in a function.
# Input: url
# Output: relevant content

def cnn_scraper(url=None):
    '''
    this function scraps relevant content from cnn website
    input: str, url from cnn
    '''

    # Get access to the website
    page = requests.get(url)
    # create an bs object.
    soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

    # check if you got a connection
    if page.status_code == 200:
        
        # parse text
        bbc_par = soup.select(".vossi-paragraph")
        story_content = [i.get_text() for i in bbc_par]
        story_text = "\n".join([i.strip() for i in story_content])

        # parse title
        css_loc = "#maincontent"
        story_title = soup.select(css_loc)[0].get_text()
        
        # story date
        story_date = soup.select(".vossi-timestamp")[0].get_text()
        
        # story authors
        story_author = soup.select(".byline__name")[0].get_text()

        # let's nest all in a list
        entry = {"url":url, "story_title":story_title.strip(),
                  "story_date":story_date.strip(),"text":story_text}
        
        # return 
        return entry
   

Let's see if our function works:

In [29]:
# Test on the same case
url = "https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc"
scrap_news = cnn_scraper(url=url)
scrap_news

{'url': 'https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc',
 'story_title': 'Mamdani rallies with Sanders and Ocasio-Cortez as Democrats close ranks around NYC mayoral nominee',
 'story_date': 'Updated Oct 26, 2025, 10:33 PM ET\nPublished Oct 26, 2025, 12:03 PM ET\n\n\n\n\n\nUpdated Oct 26, 2025, 10:33 PM ET\nPUBLISHED Oct 26, 2025, 12:03 PM ET',

### beautiful!

Let's now assume you actually have a list of urls. So we will iterated our scrapper through this list.



In [30]:
# create a list of urls
urls = ["https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc", 
        "https://www.cnn.com/2025/10/26/politics/jessica-tisch-zohran-mamdani-election",
       "https://www.cnn.com/2025/10/24/politics/fact-check-ballroom-press-secretary", 
       "https://www.cnn.com/2025/09/26/politics/cnn-independents-poll-methodology"]

In [31]:
# Then just loop through and collect
scraped_data = []

for url in urls:

    # Scrape the content
    scraped_data.append(cnn_scraper(url))

    # Put the system to sleep for a random draw of time (be kind)
    time.sleep(random.uniform(.5,1))
    
    print(url)

https://www.cnn.com/2025/10/26/politics/mamdani-sanders-aoc-rally-nyc
https://www.cnn.com/2025/10/26/politics/jessica-tisch-zohran-mamdani-election
https://www.cnn.com/2025/10/24/politics/fact-check-ballroom-press-secretary
https://www.cnn.com/2025/09/26/politics/cnn-independents-poll-methodology


In [32]:
# Look at the data object
scraped_data[3]

{'url': 'https://www.cnn.com/2025/09/26/politics/cnn-independents-poll-methodology',
 'story_title': 'How CNN conducted its deep dive into independent US adults',
 'story_date': 'Updated Sep 26, 2025, 5:00 AM ET\nPublished Sep 26, 2025, 5:00 AM ET\n\n\n\n\n\nPUBLISHED Sep 26, 2025, 5:00 AM ET',
 'text': 'The CNN poll conducted by SSRS was designed to offer a deep-dive look at the different types of independents that make up this critical group in American politics.\nThe survey was conducted both online and by telephone from August 21 through September 1. The 2,077 adults who took the poll were selected from two sample sources – the SSRS Opinion Panel and a list of registered voters, including some who had previously taken a CNN survey conducted in April. The survey included 1,006 political independents, defined as people who said they were independents or did not identify with either major political party. The sample was designed to include a larger than usual number of political indep

In [33]:
# Organize as a pandas data frame
dat = pd.DataFrame(scraped_data)
dat.head()

Unnamed: 0,url,story_title,story_date,text
0,https://www.cnn.com/2025/10/26/politics/mamdan...,Mamdani rallies with Sanders and Ocasio-Cortez...,"Updated Oct 26, 2025, 10:33 PM ET\nPublished O...",As New Yorkers cast their ballots in the city’...
1,https://www.cnn.com/2025/10/26/politics/jessic...,"Mamdani takes a risk, courts NYPD Commissioner...","Updated Oct 26, 2025, 1:56 PM ET\nPublished Oc...",One of the most high-profile job interviews in...
2,https://www.cnn.com/2025/10/24/politics/fact-c...,Fact check: Democratic leaders misleadingly sn...,"Updated Oct 24, 2025, 2:08 PM ET\nPublished Oc...",Top Democrats in the Senate and House of Repre...
3,https://www.cnn.com/2025/09/26/politics/cnn-in...,How CNN conducted its deep dive into independe...,"Updated Sep 26, 2025, 5:00 AM ET\nPublished Se...",The CNN poll conducted by SSRS was designed to...


This completes all the steps of scraping: 
    
- Step 1: Find a website with information you want to collect
- Step 2: Understand the website
- Step 3: Write code to collect one realization of the data
- Step 4: Build a scraper -- generalize you code into a function.
- Step 5: Save


### Collecting multiple urls

It is unlikely you will ever have a complete list of urls you want to scrap. Most likely collecting the full list of sources will be a step on your scraping task. Remember, urls usually come embedded as tags attributes. So let's write a function to collect multiple urls from the CNN website. Let's do so following all our pre-determined steps

In [34]:
# Step 1: Find a website with information you want to collect
## let's get links on cnn politics
url = "https://www.cnn.com/politics"

In [35]:
url

'https://www.cnn.com/politics'

In [36]:
# Step 2: Understand the website
# links are embedded across multiple titles. 
# these titles have the follwing tag <.container__headline span>

In [37]:
# Step 3: Write code to collect one realization of the data
# Get access to the website
page = requests.get(url)
# create an bs object.
soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

In [38]:
# Step 3: Write code to collect one realization of the data
# with a css selector
links = soup.select(".container_lead-plus-headlines__item--type-section")
#links = soup.select(".container_lead-plus-headlines__headline")
links

[<li class="card container__item container__item--type-media-image container__item--type-section container_lead-plus-headlines__item container_lead-plus-headlines__item--type-section container_lead-plus-headlines__selected" data-component-name="card" data-created-updated-by="true" data-open-link="/2025/10/21/politics/va-therapists-treatment-sessions-limited" data-page="cms.cnn.com/_pages/cmgzg04d1004326p2g8ur5dmw@published" data-unselectable="true" data-uri="cms.cnn.com/_components/card/instances/clbdmol44002m3d6euos0g5zg_fill_1@published" data-word-count="1741">
 <a class="container__link container__link--type-article container_lead-plus-headlines__link" data-link-type="article" href="/2025/10/21/politics/va-therapists-treatment-sessions-limited">
 <div class="container__item-media-wrapper container_lead-plus-headlines__item-media-wrapper" data-breakpoints='{"card--media-large": 525, "card--media-extra-large": 660, "card--media-card-label-show": 200}'>
 <div class="container__item-med

In [39]:
links[0]

<li class="card container__item container__item--type-media-image container__item--type-section container_lead-plus-headlines__item container_lead-plus-headlines__item--type-section container_lead-plus-headlines__selected" data-component-name="card" data-created-updated-by="true" data-open-link="/2025/10/21/politics/va-therapists-treatment-sessions-limited" data-page="cms.cnn.com/_pages/cmgzg04d1004326p2g8ur5dmw@published" data-unselectable="true" data-uri="cms.cnn.com/_components/card/instances/clbdmol44002m3d6euos0g5zg_fill_1@published" data-word-count="1741">
<a class="container__link container__link--type-article container_lead-plus-headlines__link" data-link-type="article" href="/2025/10/21/politics/va-therapists-treatment-sessions-limited">
<div class="container__item-media-wrapper container_lead-plus-headlines__item-media-wrapper" data-breakpoints='{"card--media-large": 525, "card--media-extra-large": 660, "card--media-card-label-show": 200}'>
<div class="container__item-media c

In [40]:
links[0].attrs["data-open-link"]

'/2025/10/21/politics/va-therapists-treatment-sessions-limited'

In [41]:
# grab links
links_from_cnn = []

# iterate
for link in links:
    links_from_cnn.append(link["data-open-link"])
    
# print
print(links_from_cnn)

['/2025/10/21/politics/va-therapists-treatment-sessions-limited', '/2025/10/27/politics/joe-biden-kennedy-award', '/2025/10/27/politics/donald-trump-mri-health-walter-reed', '/2025/10/27/politics/mike-braun-indiana-redistricting', '/2025/10/27/politics/nuclear-weapons-shutdown-delay-trump-nnsa', '/2025/10/27/politics/government-shutdown-snap-food-stamps', '/2025/10/27/politics/sean-duffy-nasa-transportation', '/2025/10/26/politics/navy-aircraft-crash-south-china-sea', '/2025/10/27/politics/trump-asia-latin-america', '/2025/10/25/politics/national-guard-troops-cities-grover-cleveland-explainer', '/2025/10/25/politics/east-wing-demolition-white-house-trump-analysis', '/2025/10/23/politics/trump-potential-230-million-doj-payment-analysis', '/2025/10/21/politics/white-house-renovations-history', '/2025/10/22/politics/trump-israel-gaza-vance-ukraine-russia-putin-analysis', '/2025/10/21/politics/government-shutdown-congress-federal-workers-analysis', '/2025/10/20/politics/trump-no-kings-prot

In [42]:
## another way to do this is by using href attributes of a tag
links_from_cnn = []

# Extract relevant and unique links
for tag in soup.find_all("a"):
    href = tag.attrs.get("href")
    links_from_cnn.append(href)

# much more extensive set of links
print(links_from_cnn)

['https://www.cnn.com', 'https://www.cnn.com/politics', 'https://www.cnn.com/politics/president-donald-trump-47', 'https://www.cnn.com/politics/fact-check', 'https://www.cnn.com/polling', 'https://www.cnn.com/election/2025', None, 'https://www.cnn.com/politics/president-donald-trump-47', 'https://www.cnn.com/politics/fact-check', 'https://www.cnn.com/polling', 'https://www.cnn.com/election/2025', 'https://www.cnn.com/video', 'https://www.cnn.com/audio', 'https://www.cnn.com/live-tv', '/account/settings', '/newsletters', '/follow?iid=fw_var-nav', '#', '#', '/account/settings', '/newsletters', '/follow?iid=fw_var-nav', '#', '#', 'https://www.cnn.com/live-tv', 'https://www.cnn.com/audio', 'https://www.cnn.com/video', 'https://us.cnn.com?hpt=header_edition-picker', 'https://edition.cnn.com?hpt=header_edition-picker', 'https://arabic.cnn.com?hpt=header_edition-picker', 'https://cnnespanol.cnn.com/?hpt=header_edition-picker', 'https://us.cnn.com?hpt=header_edition-picker', 'https://edition.c

In [43]:
import re
## clean the output. 
# Keep only stories that starts with "/" and "fourdigits"
# combine with the base url
links_from_cnn_reduced = ["https://www.cnn.com" + l for l in links_from_cnn if re.match(r'^/(\d{4})', str(l))]
links_from_cnn_reduced

['https://www.cnn.com/2025/10/21/politics/va-therapists-treatment-sessions-limited',
 'https://www.cnn.com/2025/10/21/politics/va-therapists-treatment-sessions-limited',
 'https://www.cnn.com/2025/10/27/politics/joe-biden-kennedy-award',
 'https://www.cnn.com/2025/10/27/politics/donald-trump-mri-health-walter-reed',
 'https://www.cnn.com/2025/10/27/politics/mike-braun-indiana-redistricting',
 'https://www.cnn.com/2025/10/27/politics/nuclear-weapons-shutdown-delay-trump-nnsa',
 'https://www.cnn.com/2025/10/27/politics/government-shutdown-snap-food-stamps',
 'https://www.cnn.com/2025/10/27/politics/sean-duffy-nasa-transportation',
 'https://www.cnn.com/2025/10/26/politics/navy-aircraft-crash-south-china-sea',
 'https://www.cnn.com/2025/10/27/politics/trump-asia-latin-america',
 'https://www.cnn.com/2025/10/27/politics/trump-asia-latin-america',
 'https://www.cnn.com/2025/10/25/politics/national-guard-troops-cities-grover-cleveland-explainer',
 'https://www.cnn.com/2025/10/25/politics/eas

In [44]:
## Step 4: Build a scraper -- generalize you code into a function.

# Let's write the above as a single function
def collect_links_cnn(url=None):
    """Scrape multiple CNN URLS.

    Args:
        url (list): list of valid CNN page to collect links.
    Returns:
        DataFrame: frame containing headline, date, and content fields
    """
    
    # Get access to the website
    page = requests.get(url)
    # create an bs object.
    soup = BeautifulSoup(page.content, 'html.parser') # input 1: request content; input 2: tell you need an html parser

    ## another way to do this is by using href attributes of a tag
    links_from_cnn = []

    # Extract relevant and unique links
    for tag in soup.find_all("a"):
        href = tag.attrs.get("href")
        links_from_cnn.append(href)
        
    ## clean the output. 
    # Keep only stories that starts with "/" and "fourdigits"
    # combine with the base url
    links_from_cnn_reduced = ["https://www.cnn.com" + l for l in links_from_cnn if re.match(r'^/(\d{4})', str(l))]
    links_from_cnn_reduced

    return links_from_cnn_reduced

In [45]:
links_cnn = collect_links_cnn("https://www.cnn.com/politics")
links_cnn[:10]

['https://www.cnn.com/2025/10/21/politics/va-therapists-treatment-sessions-limited',
 'https://www.cnn.com/2025/10/21/politics/va-therapists-treatment-sessions-limited',
 'https://www.cnn.com/2025/10/27/politics/joe-biden-kennedy-award',
 'https://www.cnn.com/2025/10/27/politics/donald-trump-mri-health-walter-reed',
 'https://www.cnn.com/2025/10/27/politics/mike-braun-indiana-redistricting',
 'https://www.cnn.com/2025/10/27/politics/nuclear-weapons-shutdown-delay-trump-nnsa',
 'https://www.cnn.com/2025/10/27/politics/government-shutdown-snap-food-stamps',
 'https://www.cnn.com/2025/10/27/politics/sean-duffy-nasa-transportation',
 'https://www.cnn.com/2025/10/26/politics/navy-aircraft-crash-south-china-sea',
 'https://www.cnn.com/2025/10/27/politics/trump-asia-latin-america']

With this list, you can apply your scrapper function to multiple links:

In [46]:
len(links_cnn)

66

In [48]:
# let's get some cases
links_cnn_ = links_cnn[10:15]

# Then just loop through and collect
scraped_data = []

for url in links_cnn_:

    # check what is going on
    print(url)
    
    # Scrape the content
    scraped_data.append(cnn_scraper(url))

    # Put the system to sleep for a random draw of time (be kind)
    time.sleep(random.uniform(.5,3))

# save as pandas df   
# Organize as a pandas data frame
dat = pd.DataFrame(scraped_data)
dat.head()

https://www.cnn.com/2025/10/27/politics/trump-asia-latin-america
https://www.cnn.com/2025/10/25/politics/national-guard-troops-cities-grover-cleveland-explainer
https://www.cnn.com/2025/10/25/politics/east-wing-demolition-white-house-trump-analysis
https://www.cnn.com/2025/10/23/politics/trump-potential-230-million-doj-payment-analysis
https://www.cnn.com/2025/10/21/politics/white-house-renovations-history


Unnamed: 0,url,story_title,story_date,text
0,https://www.cnn.com/2025/10/27/politics/trump-...,MAGA hero hands big win to Trump’s Latin Ameri...,"Updated Oct 27, 2025, 9:20 AM ET\nPublished Oc...",It doesn’t take America’s most lethal aircraft...
1,https://www.cnn.com/2025/10/25/politics/nation...,Why one of Trump’s favorite presidents sent tr...,"Updated Oct 25, 2025, 7:00 AM ET\nPublished Oc...",If there’s one former president whom Donald Tr...
2,https://www.cnn.com/2025/10/25/politics/east-w...,The East Wing demolition speaks to Trump’s wre...,"Updated Oct 25, 2025, 4:30 AM ET\nPublished Oc...",Everyone says the same thing the first time th...
3,https://www.cnn.com/2025/10/23/politics/trump-...,Trump’s potential $230 million DOJ payment wou...,"Updated Oct 23, 2025, 12:34 AM ET\nPublished O...",The idea of a president convincing his own Jus...
4,https://www.cnn.com/2025/10/21/politics/white-...,The new East Wing: Caesars Palace meets the Pa...,"Updated Oct 22, 2025, 7:56 AM ET\nPublished Oc...",The ballroom President Donald Trump is demolis...


In [78]:
# add an error
links_cnn_ = links_cnn[10:15]
links_cnn_.append("https://www.latinnews.com/component/k2/item/107755.html?period=2025&archive=3&Itemid=6&cat_id=837700:in-brief-brazil-s-current-account-deficit-widens-in-september")

# run the loop in a secure setup
# Then just loop through and collect
scraped_data = []
list_of_errors = []

for url in links_cnn_:

    # check what is going on
    print(url)
    
    # Scrape the content
    try: 
        scraped_data.append(cnn_scraper(url))

        # Put the system to sleep for a random draw of time (be kind)
        time.sleep(random.uniform(.5,3))
    except Exception as e:
        list_of_errors.append([url, e])

https://www.cnn.com/2025/10/27/politics/trump-asia-latin-america
https://www.cnn.com/2025/10/25/politics/national-guard-troops-cities-grover-cleveland-explainer
https://www.cnn.com/2025/10/25/politics/east-wing-demolition-white-house-trump-analysis
https://www.cnn.com/2025/10/23/politics/trump-potential-230-million-doj-payment-analysis
https://www.cnn.com/2025/10/21/politics/white-house-renovations-history
https://www.latinnews.com/component/k2/item/107755.html?period=2025&archive=3&Itemid=6&cat_id=837700:in-brief-brazil-s-current-account-deficit-widens-in-september


In [81]:
dat = pd.DataFrame(scraped_data)
dat.head()

Unnamed: 0,url,story_title,story_date,text
0,https://www.cnn.com/2025/10/27/politics/trump-...,MAGA hero hands big win to Trump’s Latin Ameri...,"Updated Oct 27, 2025, 9:20 AM ET\nPublished Oc...",It doesn’t take America’s most lethal aircraft...
1,https://www.cnn.com/2025/10/25/politics/nation...,Why one of Trump’s favorite presidents sent tr...,"Updated Oct 25, 2025, 7:00 AM ET\nPublished Oc...",If there’s one former president whom Donald Tr...
2,https://www.cnn.com/2025/10/25/politics/east-w...,The East Wing demolition speaks to Trump’s wre...,"Updated Oct 25, 2025, 4:30 AM ET\nPublished Oc...",Everyone says the same thing the first time th...
3,https://www.cnn.com/2025/10/23/politics/trump-...,Trump’s potential $230 million DOJ payment wou...,"Updated Oct 23, 2025, 12:34 AM ET\nPublished O...",The idea of a president convincing his own Jus...
4,https://www.cnn.com/2025/10/21/politics/white-...,The new East Wing: Caesars Palace meets the Pa...,"Updated Oct 22, 2025, 7:56 AM ET\nPublished Oc...",The ballroom President Donald Trump is demolis...


In [83]:
list_of_errors

[['https://www.latinnews.com/component/k2/item/107755.html?period=2025&archive=3&Itemid=6&cat_id=837700:in-brief-brazil-s-current-account-deficit-widens-in-september',
  IndexError('list index out of range')]]

## Practice II with Latin News Website

Now, return to https://www.latinnews.com/latinnews-country-database.html?country=2156

Do the following: 

- Build a function encapsulating your code from the Practice I
- Build another function to collect all links from Brazil in Latin News 
- Scrape all news for a single country using these two functions

In [None]:
# setup
import pandas as pd
import requests # For downloading the website
from bs4 import BeautifulSoup # For parsing the website
import time # To put the system to sleep
import random # for random numbers
import re # regular expressions

In [84]:
!jupyter nbconvert _week-07_scraping_static.ipynb --to html --template classic


[NbConvertApp] Converting notebook _week-07_scraping_static.ipynb to html
[NbConvertApp] Writing 496814 bytes to _week-07_scraping_static.html
