<a href="https://colab.research.google.com/github/JordanDCunha/Automated-Data-Collection-with-R-A-Practical-Guide-to-Web-Scraping-and-Text-Mining/blob/main/Chapter_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study: UNESCO World Heritage Sites in Danger

## Introduction

In this case study, we explore how to scrape and analyze data from Wikipedia using **R**.  
The goal is to examine the geographic distribution of UNESCO World Heritage Sites that are currently listed as *in danger*.

The data source is the Wikipedia page:

http://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger

The table on this page contains:
- Site name  
- Location (including coordinates)  
- Type of heritage (cultural or natural)  
- Year of inscription  
- Year of endangerment  

We will:

1. Load required packages.
2. Scrape the HTML table from Wikipedia.
3. Clean and prepare the data.
4. Extract latitude and longitude using regular expressions.
5. Plot the sites on a world map.

This example demonstrates a core principle:

> **Data are abundant ‚Äî retrieve them, prepare them, use them.**

---

## Load Required Libraries

We use:
- `stringr` for text manipulation
- `XML` for HTML parsing
- `maps` for visualization

---

## Scrape and Clean the Data

We:
- Parse the HTML page
- Extract all tables
- Select the relevant table
- Rename variables
- Convert years to numeric
- Extract coordinates using regular expressions

---

## Visualize the Sites

We:
- Plot a world map
- Add points for endangered sites
- Use different symbols for cultural vs natural sites

Cultural sites ‚Üí triangles  
Natural sites ‚Üí dots  

This allows us to visually inspect geographic clustering.

---

## Key Observations

- Many endangered sites are located in Africa, the Middle East, and Southwest Asia.
- Cultural sites appear clustered in the Middle East and Southwest Asia.
- Natural sites are more prominent in Africa.
- Many sites were listed as endangered shortly after inscription.

This raises interesting political and institutional questions about UNESCO‚Äôs designation process.

---

Now let‚Äôs implement everything in R.


In [None]:
# Install packages if necessary
# install.packages(c("stringr", "XML", "maps"))

library(stringr)
library(XML)
library(maps)

# Scrape the Wikipedia page
url <- "http://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger"
heritage_parsed <- htmlParse(url)
tables <- readHTMLTable(heritage_parsed, stringsAsFactors = FALSE)

# Select the second table (current endangered sites)
danger_table <- tables[[2]]

# Rename relevant columns (adjust if necessary depending on Wikipedia structure)
colnames(danger_table)[1:5] <- c("name", "locn", "crit", "y_ins", "y_end")

# Keep only relevant columns
danger_table <- danger_table[, c("name", "locn", "crit", "y_ins", "y_end")]

# Recode cultural/natural
danger_table$crit <- ifelse(grepl("Natural", danger_table$crit), "nat", "cult")

# Convert inscription year to numeric
danger_table$y_ins <- as.numeric(str_extract(danger_table$y_ins, "[[:digit:]]{4}"))

# Extract last 4-digit year from endangerment column
danger_table$y_end <- as.numeric(str_extract(danger_table$y_end, "[[:digit:]]{4}$"))

# Regular expressions for coordinates
reg_y <- "[/][ -]*[[:digit:]]*[.]*[[:digit:]]*[;]"
reg_x <- "[;][ -]*[[:digit:]]*[.]*[[:digit:]]*"

# Extract latitude
y_coords <- str_extract(danger_table$locn, reg_y)
y_coords <- as.numeric(str_sub(y_coords, 3, -2))
danger_table$y_coords <- y_coords

# Extract longitude
x_coords <- str_extract(danger_table$locn, reg_x)
x_coords <- as.numeric(str_sub(x_coords, 3, -1))
danger_table$x_coords <- x_coords

# Remove messy location column
danger_table$locn <- NULL

# Basic inspection
dim(danger_table)
head(danger_table)

# Plot map
pch_vals <- ifelse(danger_table$crit == "nat", 19, 2)

map("world", col = "darkgrey", lwd = 0.5,
    mar = c(0.1, 0.1, 0.1, 0.1))

points(danger_table$x_coords,
       danger_table$y_coords,
       pch = pch_vals)

box()


# 1.2 Some Remarks on Web Data Quality

## Introduction

The previous example demonstrated how easily web data can be scraped and visualized.  
However, before collecting large amounts of online data, it is essential to ask:

- What type of data best answers the research question?
- Is the data quality sufficient?
- Could the information be systematically biased or flawed?

This section highlights key considerations when working with web data.

---

## Origins of Web Data

Web data may be:

- **Firsthand data** (e.g., tweets, forum posts, reviews)
- **Secondhand data** (copied from offline sources)
- **Scraped data** (collected from other online platforms)

Sometimes the original source cannot be traced.  
Even so, web data can still be useful ‚Äî provided we apply critical evaluation.

For example, Wikipedia‚Äôs accuracy has been widely debated.  
Some studies suggest it is comparable to traditional encyclopedias,  
while others report inconsistencies.  

The lesson:

> Cross-validation is essential for any secondary data source.

Reputation alone does not prevent errors.

---

## Data Quality Depends on Purpose

Data quality is not absolute ‚Äî it depends on the intended use.

Example:

- A random sample of tweets may be suitable for analyzing hashtag usage.
- The same sample may be biased for predicting election outcomes if collected during a political convention.

Thus, representativeness matters depending on the research objective.

For factual data (e.g., capital cities, wildlife populations),  
there are clearer standards for validation.

---

## Web Data vs Traditional Data Collection

Consider measuring popularity of a new phone.

Traditional method:
- Conduct a survey
- Ask respondents about preferences

Potential issues:
- Sampling bias
- Poor question wording
- Non-response bias

Web-based alternative:
- Analyze online sales rankings
- Use product reviews as proxies

Advantages:
- Larger coverage
- Behavioral data instead of self-reported preferences
- Lower cost

Challenges:
- Platform bias
- Coverage limitations
- Comparability across product generations

Choosing data sources often involves trade-offs:

- Accuracy vs completeness
- Coverage vs validity
- Cost vs precision

---

## Five-Step Guide for Web Data Collection

1. **Define the exact information needed.**
   - Be specific where possible.

2. **Identify potential web sources.**
   - Direct or indirect indicators.
   - Consider official sites, social media, commercial platforms.

3. **Understand the data generation process.**
   - Who created the data?
   - When and why?
   - Are there systematic gaps?

4. **Balance advantages and disadvantages.**
   - Availability and legality
   - Collection cost
   - Compatibility with existing research
   - Possibility of validation

5. **Make a documented decision.**
   - Choose the most suitable source.
   - If feasible, collect from multiple sources for validation.

---

## Key Takeaway

Web data does not inherently have lower quality than traditional data.  
However, it requires:

- Careful validation
- Awareness of biases
- Clear alignment between research question and data source

Ultimately:

> Data quality depends on the user‚Äôs purpose.


In [None]:
# Simple Framework for Evaluating Web Data Quality

# Step 1: Define research question
research_question <- "What is the popularity of a new smartphone model?"

# Step 2: Identify potential data sources
data_sources <- c(
  "Twitter posts",
  "Online sales rankings",
  "Customer reviews",
  "Survey data"
)

# Step 3: Evaluate each source
evaluation <- data.frame(
  Source = data_sources,
  Representativeness = NA,
  Coverage = NA,
  Potential_Bias = NA,
  Validation_Possible = NA,
  stringsAsFactors = FALSE
)

evaluation

# Step 4: Manually document reasoning after inspection
evaluation$Representativeness <- c(
  "Low (event-driven bias)",
  "Medium (platform-specific users)",
  "Medium (self-selection bias)",
  "High (if properly sampled)"
)

evaluation$Coverage <- c(
  "High volume but noisy",
  "Limited to platform",
  "Limited to buyers",
  "Depends on sample size"
)

evaluation$Potential_Bias <- c(
  "Political/event bias",
  "Platform sales bias",
  "Extreme opinions overrepresented",
  "Response bias"
)

evaluation$Validation_Possible <- c(
  "Yes (cross-platform comparison)",
  "Yes (compare multiple retailers)",
  "Yes (compare with surveys)",
  "Yes (replication)"
)

evaluation


# 1.3 Technologies for Disseminating, Extracting, and Storing Web Data

Collecting web data is not always as simple as scraping an HTML table.
Modern websites use complex structures, dynamic content, and multiple data formats.
To effectively scrape and process web data in R, we need a basic understanding of three major technological pillars:

1. Technologies for disseminating content  
2. Technologies for information extraction  
3. Technologies for data storage  

This section provides a structured overview of each.

---

# 1Ô∏è‚É£ Technologies for Disseminating Content

These technologies define **how data are delivered on the Web**.

## HTML (Hypertext Markup Language)
- Structures how information is displayed in browsers.
- Data appear in tables, lists, text, links.
- Scrapers must understand how data are stored in the underlying HTML code.
- Parsed using HTML parsers.

## XML (Extensible Markup Language)
- Designed for storing and exchanging structured data.
- Uses user-defined tags.
- More flexible than HTML.
- Requires XML parsers.

## JSON (JavaScript Object Notation)
- Lightweight data exchange format.
- Frequently used by APIs (e.g., Twitter API).
- Easy to parse in R.
- Language-independent standard.

## AJAX
- Enables asynchronous loading of content.
- Dynamically updates webpages without reloading.
- Complicates scraping because data may not appear in static HTML.
- Often requires browser tools or Selenium.

## Plain Text
- Unstructured data.
- Requires pattern recognition techniques.
- Processed using regular expressions or text mining.

## HTTP (Hypertext Transfer Protocol)
- The communication standard between client and server.
- Most web scraping relies on HTTP requests.
- Advanced scraping may require custom HTTP requests.

---

# 2Ô∏è‚É£ Technologies for Information Extraction

Once documents are retrieved, we must extract relevant information.

## XPath
- Query language for navigating HTML/XML.
- Selects specific nodes or elements.
- Powerful for structured documents.

## JSON Parsers
- Automatically decode JSON objects into R structures.
- No query language required.

## Selenium
- Browser automation framework.
- Handles dynamic (AJAX-heavy) websites.
- Simulates clicks and inputs.

## Regular Expressions
- Pattern-matching tools for extracting structured text.
- Useful for numbers, names, coordinates.
- Helpful when markup structure cannot be exploited.

## Text Mining
- Extracts latent patterns from unstructured text.
- Enables classification and clustering.
- Used for sentiment analysis, topic modeling, etc.

---

# 3Ô∏è‚É£ Technologies for Data Storage

After extraction, data must be stored efficiently.

## Databases & SQL
- Reliable, scalable storage.
- Support multi-user access.
- Fast querying for large datasets.
- Useful for large-scale scraping projects.

## R Native Storage
- CSV files
- RDS files
- Binary formats
- Suitable for small to medium projects.

---

# Key Insight

Web scraping requires understanding:

- How data are delivered (HTML, JSON, AJAX)
- How to extract them (XPath, regex, parsers)
- How to store them (R files, databases)

You do not need to be an expert in all technologies ‚Äî
but you must understand the basics to build effective scrapers.

> Web scraping is not just about downloading data ‚Äî
> it is about understanding the full data pipeline.


In [None]:
# Install packages if necessary
# install.packages(c("httr", "xml2", "rvest", "jsonlite", "DBI", "RSQLite", "stringr"))

library(httr)      # HTTP communication
library(xml2)      # XML/HTML parsing
library(rvest)     # Web scraping tools
library(jsonlite)  # JSON parsing
library(stringr)   # Regular expressions
library(DBI)       # Database interface
library(RSQLite)   # SQLite database

# ------------------------------
# 1Ô∏è‚É£ Dissemination: HTTP Request
# ------------------------------
response <- GET("https://httpbin.org/get")
status_code(response)

# ------------------------------
# 2Ô∏è‚É£ Extraction: HTML Parsing
# ------------------------------
html_page <- read_html("https://example.com")
title_node <- html_element(html_page, "title")
html_text(title_node)

# ------------------------------
# 3Ô∏è‚É£ Extraction: JSON Parsing
# ------------------------------
json_data <- fromJSON('{"product":"phone","price":799}')
json_data

# ------------------------------
# 4Ô∏è‚É£ Extraction: Regular Expression
# ------------------------------
text_sample <- "The price is $799."
str_extract(text_sample, "[0-9]+")

# ------------------------------
# 5Ô∏è‚É£ Storage: Database Example
# ------------------------------
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "products", data.frame(name="phone", price=799))
dbReadTable(con, "products")

dbDisconnect(con)


# 1.4 Structure of the Book

This book is written for readers with diverse backgrounds and goals.  
Depending on your experience with R and web technologies, you may read it sequentially or selectively.

---

## Who Should Read What?

### If you have basic R knowledge but little web experience:
Follow the book in order to build a strong foundation.

### If you already have text data:
Start with:
- Chapter 8: Regular Expressions and String Functions
- Chapter 10: Statistical Text Processing

### If you are mainly interested in web scraping:
Focus on scraping chapters.
You may skip Chapter 10 (text mining),  
but Chapter 8 (text manipulation basics) is strongly recommended.

### If you are a teacher:
- Exercises are provided after most chapters in Parts I and II.
- Partial solutions are available on the book‚Äôs website.
- Exercises can be used for homework or exams.

---

# Overview of the Three Parts

---

# üìò Part I: A Primer on Web and Data Technologies

This section introduces foundational technologies:

- HTTP
- HTML
- XML
- JSON
- AJAX
- SQL
- XPath
- Regular Expressions

Goal:
- Understand how the Web works.
- Learn how data are structured and transmitted.
- Build core technical skills needed for scraping.

Includes:
- Concept explanations
- Practical exercises

---

# üõ† Part II: A Practical Toolbox for Web Scraping and Text Mining

This section focuses on implementation.

Core topics include:

## Web Scraping Techniques
- Regular expressions
- XPath
- APIs
- Source-specific scraping methods
- Legal and ethical considerations

## Statistical Text Processing
- Supervised text classification
- Unsupervised methods
- Extracting latent information

## Data Project Management in R
- File system organization
- Efficient coding with loops
- Automating scraping tasks
- Scheduling recurring data collection

Goal:
Turn foundational knowledge into applied skills.

---

# üìä Part III: Case Studies

This section provides real-world applications.

Each case study includes:
- A research motivation
- Data collection procedures
- Cleaning and preprocessing
- Analysis
- Discussion of pitfalls

Additionally:
- Summary tables of techniques used
- Key R packages and functions
- Practical workflow insights

---

# Key Takeaway

The book moves from:

1. Understanding web technologies  
2. Applying scraping and text mining tools  
3. Executing full real-world projects  

You may follow the full journey or jump directly to the sections most relevant to your goals.

> The structure supports both learning and practical application.


In [None]:
# Representation of the Book Structure in R

book_structure <- list(

  Part_I = list(
    Title = "Primer on Web and Data Technologies",
    Topics = c("HTTP", "HTML", "XML", "JSON", "AJAX", "SQL",
               "XPath", "Regular Expressions"),
    Goal = "Understand foundational web technologies"
  ),

  Part_II = list(
    Title = "Web Scraping and Text Mining Toolbox",
    Topics = c("Scraping Techniques",
               "APIs",
               "Legal/Ethical Issues",
               "Supervised Text Classification",
               "Unsupervised Text Mining",
               "Workflow & Automation"),
    Goal = "Apply scraping and text mining in practice"
  ),

  Part_III = list(
    Title = "Case Studies",
    Topics = c("Real-world scraping applications",
               "Data cleaning",
               "Workflow management",
               "Common pitfalls"),
    Goal = "Integrate techniques into complete projects"
  )
)

book_structure
