
# **Class 2: Collecting and Cleaning Data**

**SOC 2070 - Text as Data for Social Science Research**
Tiziano Rotesi, Brown University 

This code will provide an overview of methods to collect data from the internet, build a dataset, and perform preliminary tasks to prepare the dataset for analysis.


## Web Scraping

- **Opening Web Pages and Retrieving Information**: Techniques for accessing and extracting data from HTML pages.
- **Using Selenium**: A tool for automating web browsers, enabling dynamic interactions with web pages.

### Static pages
We will start by considering the case of static webpages in web scraping. A static webpage is one that displays the same content for every user and does not change unless someone updates its code. This kind of webpage is typically written in HTML and doesn't change based on user interaction or database information. Static webpages are considered the simplest case in web scraping because their content remains consistent, making it easier to extract data without worrying about dynamic changes.

A typical use case arises if the website you're interested in uses predictable URLs. For example, suppose you want to list all events that Wikipedia considers "notable." You might discover that this information can be accessed by visiting pages with URLs following the pattern:

https://en.wikipedia.org/wiki/{month}_{day}

In this scenario, you could write code that systematically opens each page conforming to this pattern for every day of the year.

We will use the package `beautifulsoup4` for parsing HTML and XML documents. More details can be found at [beautifulsoup4 on PyPI](https://pypi.org/project/beautifulsoup4/).

To install this package, run the following command in your terminal:
```bash
conda install anaconda::beautifulsoup4
```
We will proceed by opening the Wikipedia page on `Natural Language Processing`.

In [6]:
import requests
from bs4 import BeautifulSoup
import re

# URL of the page you want to scrape
url = "https://en.wikipedia.org/wiki/Natural_language_processing"

response = requests.get(url)

We just created a new object called `response` that contains the content available on the web page. 

In [7]:
print(type(response))

<class 'requests.models.Response'>


The `<class 'requests.models.Response'>` is a class in the `requests` library in Python. It represents the response received from making an HTTP request using the `requests` library.

We can examine the content in `response`. To understand it better, compare it with what you observe when you:
1. Open the page using Google Chrome.
2. Right-click on the page and select "Inspect".

In [8]:
response.content

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Natural language processing - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-f

In [9]:
# Parse the content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Print the prettified version of the response content
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Natural language processing - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpre

We can see here a "prettified" version of the content, which highlights the structure of the document. We can use bs4 to find elements in the text. For example, notice that on top of the page there is a `<title>` tag.

In [10]:
title = soup.find('title')
print(title.get_text())

Natural language processing - Wikipedia


Looking at the page, we may be interested in selecting the headlines, to have a first idea of how the text is organized. 

In [11]:
headlines = soup.find_all(class_="mw-headline")
for headline in headlines:
    print(headline.get_text())

History
Symbolic NLP (1950s – early 1990s)
Statistical NLP (1990s–2010s)
Neural NLP (present)
Approaches: Symbolic, statistical, neural networks
Statistical approach
Neural networks
Common NLP tasks
Text and speech processing
Morphological analysis
Syntactic analysis
Lexical semantics (of individual words in context)
Relational semantics (semantics of individual sentences)
Discourse (semantics beyond individual sentences)
Higher-level NLP applications
General tendencies and (possible) future directions
Cognition
See also
References
Further reading
External links


How can you understand how to select the elements you are interested in? Typically, the best way to proceed is to start by looking at the page using a web browser and identify the elements that are of interest to you, such as specific text, images, or links.

Once you've located these elements, right-click on them and select the **Inspect** option (on Chrome). This action will open the browser's developer tools and highlight the corresponding HTML code.

Pay close attention to the highlighted HTML segment. Look for distinctive features that uniquely identify the elements you're interested in. These features may include:

- **Class Attributes**: Classes are common identifiers. They are often used to style elements with CSS and can be useful for pinpointing specific elements. For example, `<div class="content-section">`.

- **HTML Tags**: The type of tag (`<div>`, `<span>`, `<p>`, `<a>`, etc.) can sometimes be a straightforward identifier, particularly if it's less common, like `<article>` or `<aside>`.

- **ID Attributes**: IDs are unique identifiers for elements on a page (`<div id="main-content">`). They are very specific, but not as common as classes.

- **Other Attributes**: Look for other attributes like `name`, `href`, `src`, etc., which can provide additional specificity.

- **Styling Details**: Sometimes, visual cues like font type or color, which correspond to specific CSS properties, can help identify elements, though these are less reliable as unique identifiers.

- **Element Hierarchy**: The position of an element within the HTML structure can also be a guide. Nested elements or the relationship of an element with its siblings or parent can help in locating it.

Once you've identified the distinguishing features of your desired elements, use them to formulate your selectors in BeautifulSoup. For example, `soup.find_all('div', class_='content-section')` would find all `div` elements with a class of `content-section`.

### Dynamic pages

Dynamic webpages are designed to respond interactively to user actions, such as text input in fields, selection from drop-down menus, or clicks on buttons. Unlike static pages, where content remains unchanged unless the HTML code is manually updated, dynamic pages adapt and display new information based on user interactions.

In such cases, merely constructing the URL and examining the HTML code may not suffice to understand or scrape the webpage's content. 

A prime example of this is the Patent Public Search Basic page at 

https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html 

On this site, users input search parameters, and though the page's URL remains constant, the displayed data and links change dynamically in response to the inputs. This makes scraping more complex, often requiring more advanced techniques.


For this task we will use Selenium, see here how to install it: https://selenium-python.readthedocs.io/installation.html

In [14]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select  

import time
# Set up WebDriver with the new syntax
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open the webpage
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")

# Wait for elements to load (optional, might require additional imports)
time.sleep(5)
# Locate the dropdown for 'Publication date' and select it
dropdown = Select(driver.find_element(By.ID, 'searchField1'))
dropdown.select_by_visible_text('Publication date')

# Locate the input field for 'For' and input the date
input_field = driver.find_element(By.ID, 'searchText1')
input_field.send_keys('20221201')

#Click on the search button
button = driver.find_element(By.ID, 'basicSearchBtn')
button.click()

time.sleep(2)
#The search will open the table
# Locate the table
table = driver.find_element(By.ID, 'searchResults')

# Find all <tr> elements in the table
rows = table.find_elements(By.TAG_NAME, 'tr')

list_ids=[]
# Loop through each row and get the text from the second column
for row in rows:
    # Get all columns/cells in the row
    cols = row.find_elements(By.TAG_NAME, 'td')

    # Check if columns exist and then access the second column
    if len(cols) > 1:
        second_column_text = cols[1].text
        list_ids.append(second_column_text)


# Additional steps for submitting the search, handling the results, etc.
time.sleep(5)
# Close the driver when done
driver.quit()

In [13]:
print(list_ids)

['US-20220386518-P1', 'US-20220383414-A1', 'US-20220379036-A1', 'US-20220385993-A1', 'US-20220380362-A1', 'US-20220379032-A1', 'US-20220385965-A1', 'US-20220380423-A1', 'US-20220385062-A1', 'US-20220381767-A1', 'US-20220378898-A1', 'US-20220386213-A1', 'US-20220378750-A1', 'US-20220382085-A1', 'US-20220385658-A1', 'US-20220379063-A1', 'US-20220384736-A1', 'US-20220385763-A1', 'US-20220381307-A1', 'US-20220382822-A1', 'US-20220378797-A1', 'US-20220378847-A1', 'US-20220378932-A1', 'US-20220378344-A1', 'US-20220380891-A1', 'US-20220385742-A1', 'US-20220384699-A1', 'US-20220386017-A1', 'US-20220379037-A1', 'US-20220378419-A1', 'US-20220378906-A1', 'US-20220378559-A1', 'US-20220380827-A1', 'US-20220378103-A1', 'US-20220386113-A1', 'US-20220386344-A1', 'US-20220382643-A1', 'US-20220386218-A1', 'US-20220380938-A1', 'US-20220384622-A1', 'US-20220379113-A1', 'US-20220380775-A1', 'US-20220385416-A1', 'US-20220385720-A1', 'US-20220381691-A1', 'US-20220381692-A1', 'US-20220381795-A1', 'US-20220383

In the previous code, we used Selenium to open a web page in a browser, select "Publication Date" as the search criterion, and input a specific date (12-01-2022). After clicking the search button, we gathered the first 50 IDs presented in the results. To determine which buttons to click or which fields to target with our text, we employed an approach similar to the one discussed in the previous section. Here, Selenium enables us to perform more complex interactions.

## REST APIs 

Another popular method of data retrieval is through REST APIs. These are sets of rules defined by the data owner for interacting with their database. Most REST APIs share several common features:
- Interaction is facilitated via a URL.
- Authentication may or may not be required.
- The system tracks the number of requests made by each user and may impose limits.
- A successful request typically generates a response in JSON or XML format.

As an example, let's consider the `Crossref Unified Resource API`. This API enables users to retrieve metadata for publications, identified by their DOI (Digital Object Identifier) addresses. For more information about DOIs, visit [Digital Object Identifier on Wikipedia](https://en.wikipedia.org/wiki/Digital_object_identifier). For details regarding the API, see the [Crossref API documentation](https://api.crossref.org/swagger-ui/index.html), which includes also examples on how to use the API.

In this case, the main way of interacting with the database is by building URL as: https://api.crossref.org/works/{doi}, where {doi} is the DOI of the paper we are interested in. 

Let's download the metadata for the paper: Trinh, T.H., Wu, Y., Le, Q.V. et al. _Solving Olympiad Geometry without Human Demonstrations_. Nature 625, 476–482 (2024). [DOI: 10.1038/s41586-023-06747-5](https://doi.org/10.1038/s41586-023-06747-5)


Notice that crossref also provides data that can be downloaded in bulk, without the need of using the API! Depending on your needs, it may be better do use this method. [Link](https://www.crossref.org/blog/new-public-data-file-120-million-metadata-records/)

In [16]:
#Since we need to open an URL, we use the package request, as we have already seen above 
#Notice I am adding my email, in the documentation they ask to make "polite" requests
headers = {
    'User-Agent': 'Tiziano Rotesi (tiziano_rotesi@brown.edu)'
}

response = requests.get("https://api.crossref.org/works/10.1038/s41586-023-06747-5", headers=headers)
data = response.json()
print(data)

{'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0', 'message': {'indexed': {'date-parts': [[2024, 1, 18]], 'date-time': '2024-01-18T00:29:06Z', 'timestamp': 1705537746780}, 'reference-count': 69, 'publisher': 'Springer Science and Business Media LLC', 'issue': '7995', 'license': [{'start': {'date-parts': [[2024, 1, 17]], 'date-time': '2024-01-17T00:00:00Z', 'timestamp': 1705449600000}, 'content-version': 'tdm', 'delay-in-days': 0, 'URL': 'https://creativecommons.org/licenses/by/4.0'}, {'start': {'date-parts': [[2024, 1, 17]], 'date-time': '2024-01-17T00:00:00Z', 'timestamp': 1705449600000}, 'content-version': 'vor', 'delay-in-days': 0, 'URL': 'https://creativecommons.org/licenses/by/4.0'}], 'content-domain': {'domain': ['link.springer.com'], 'crossmark-restriction': False}, 'short-container-title': ['Nature'], 'published-print': {'date-parts': [[2024, 1, 18]]}, 'abstract': '<jats:title>Abstract</jats:title><jats:p>Proving mathematical theorems at the olympiad level rep

The API returns information in the form of JSON, a format akin to Python dictionaries. Let's examine some of the details stored within this dictionary.

In [17]:
data.keys()

dict_keys(['status', 'message-type', 'message-version', 'message'])

In [18]:
print(data['message'])

{'indexed': {'date-parts': [[2024, 1, 18]], 'date-time': '2024-01-18T00:29:06Z', 'timestamp': 1705537746780}, 'reference-count': 69, 'publisher': 'Springer Science and Business Media LLC', 'issue': '7995', 'license': [{'start': {'date-parts': [[2024, 1, 17]], 'date-time': '2024-01-17T00:00:00Z', 'timestamp': 1705449600000}, 'content-version': 'tdm', 'delay-in-days': 0, 'URL': 'https://creativecommons.org/licenses/by/4.0'}, {'start': {'date-parts': [[2024, 1, 17]], 'date-time': '2024-01-17T00:00:00Z', 'timestamp': 1705449600000}, 'content-version': 'vor', 'delay-in-days': 0, 'URL': 'https://creativecommons.org/licenses/by/4.0'}], 'content-domain': {'domain': ['link.springer.com'], 'crossmark-restriction': False}, 'short-container-title': ['Nature'], 'published-print': {'date-parts': [[2024, 1, 18]]}, 'abstract': '<jats:title>Abstract</jats:title><jats:p>Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning<jats:sup>1–4</jat

The majority of the information is stored in data['message']; we need to examine it more closely. Similar as we did with bd4, we can use the module json to "prettify" the dictionary.

In [19]:
import json
pretty_json = json.dumps(data['message'], indent=4)
print(pretty_json)

{
    "indexed": {
        "date-parts": [
            [
                2024,
                1,
                18
            ]
        ],
        "date-time": "2024-01-18T00:29:06Z",
        "timestamp": 1705537746780
    },
    "reference-count": 69,
    "publisher": "Springer Science and Business Media LLC",
    "issue": "7995",
    "license": [
        {
            "start": {
                "date-parts": [
                    [
                        2024,
                        1,
                        17
                    ]
                ],
                "date-time": "2024-01-17T00:00:00Z",
                "timestamp": 1705449600000
            },
            "content-version": "tdm",
            "delay-in-days": 0,
            "URL": "https://creativecommons.org/licenses/by/4.0"
        },
        {
            "start": {
                "date-parts": [
                    [
                        2024,
                        1,
                        17
      

You can notice that the structure resembles a tree, with branches containing smaller dictionaries that hold specific information.

In [35]:
data['message'].keys()

dict_keys(['indexed', 'reference-count', 'publisher', 'issue', 'license', 'content-domain', 'short-container-title', 'published-print', 'abstract', 'DOI', 'type', 'created', 'page', 'update-policy', 'source', 'is-referenced-by-count', 'title', 'prefix', 'volume', 'author', 'member', 'published-online', 'reference', 'container-title', 'original-title', 'language', 'link', 'deposited', 'score', 'resource', 'subtitle', 'short-title', 'issued', 'references-count', 'journal-issue', 'alternative-id', 'URL', 'relation', 'ISSN', 'issn-type', 'subject', 'published', 'assertion'])

For example, let's see the abstract of the paper.

In [21]:
abstract = data['message']['abstract']
print(abstract)

<jats:title>Abstract</jats:title><jats:p>Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning<jats:sup>1–4</jats:sup>, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges<jats:sup>1,5</jats:sup>, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through in

We can observe that this text contains some elements that we may want to remove:

- Parts like `<jats:title>` are tags used to identify specific parts of the text (you can ask ChatGPT what `<jats:title>` is). We can handle these tags using Beautiful Soup, similar to how we did for HTML.

- In the last line, we notice a string `\xa0`, which may seem strange to you. It is a Unicode character, and in general, we want to remove these special characters.

We will address both of these cleaning tasks using regular expressions, a method for identifying specific patterns in strings. We will use the `re` package [documentation](https://docs.python.org/3/library/re.html). This approach is flexible and is often used for cleaning tasks or data extraction. We will be using regular expressions again in the next classes, if you want to see a deeper explanation of how to use them you can see [Chapter 2.1](https://web.stanford.edu/~jurafsky/slp3/2.pdf) in Jurafsky and Martin book. Regular expressions are very useful, but may not always be intuitive to use. Fortunately, ChatGPT works well at translating natural language sentences into regex commands.


In [22]:
import re 
abstract = re.sub(r'<jats:title>.*?</jats:title>', '', abstract, flags=re.DOTALL)
print(abstract)

<jats:p>Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning<jats:sup>1–4</jats:sup>, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges<jats:sup>1,5</jats:sup>, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challe

In [48]:
abstract = re.sub(r'[^\x00-\x7F]+', ' ', abstract) # This line replaces all characters that are not standard in basic English with an empty space 
print(abstract)

<jats:p>Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning<jats:sup>14</jats:sup>, owing to their reputed difficulty among the worlds best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges<jats:sup>1,5</jats:sup>, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challeng