<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Parsing Unstructured Data: Scrapping Part I<br><br>
Tiago Ventura</center> <h1> 

---

# Learning Goals

In the class today, we will focus on:

- Understand different strategies to acquire digital data
- Understanding html structure to look up content on a website
- Scrape content from a static website
- Build a scraper to systematically draw content from similarly organized webpages.

# The Digital information age

We start our first lecture looking at this graph. It shows two things: 

- in the past few years we have produced and stored an enourmous among of data
- Most of this data is produced and stored in digital environments. 

<div>
<img src="http://media3.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.jpg" width="60%"/>
</div>


Not all this data is available on digital spaces (like websites, social media apps, and digital archives). But some are. And as data scientists a primary skill that is expected from you is to be able to acquire, process, store and analyze this data. Today, we will focus on **acquiring data in the digital information era.** 

There are three primary techniques through which you can acquire digital data: 

- Scrap data from self-contained (static) websites
- Scrap data from dynamic (javascript powered) websites
- Access data through Application Programming Interfaces

## What is scraping? 

**Scraping** consists of automatically collecting data available on websites. In theory, you can collect website data  by hand, or asking a couple of friends to help you. However, in a world of abundant data, this is likely not feasible, and in general, it may become more difficult once you have learned to collect it automatically.

Let me give you some **examples of websites** I have alread scraped: 

- Electoral data from many different countries;
- Composition of elites around the world;
- Wikipedia; 
- Toutiao, a news aggregation from China; 
- Political Manifestos in Brazil 
- Fact-Checking News
- Facebook and Youtube Live Chats. 
- Property Prices from Zillow. 

Scraping can be summarize in: 

- leveraging the structure of a website to **grab it's contents**

- using a programming environment (such as R, Python, Java, etc.) to **systematically extract** that content.

- accomplishing the above in an "unobtrusive" and **legal** way.



## Scraping vs APIs


An API is a set of rules and protocols that allows software applications to communicate with each other. APIs provide an front door for a developer to interact with a website. 

APIs are used for many different types of online communication and information sharing, among those, many **APIs have been developed to provide an easy and official way for developers and data scientists to access data**. 

As these APIs are developed by data owners, they are often secure, practical, and more organized than acquiring data through scrapping. 

Scraping is a back door for when there’s no API or when we need content beyond the structured fields the API returns

**if you can use the API to access a dataset, that's where you will want to go**

## Ethical Challenges with Scraping

Webscraping is legal **as long as the scraped data is publicly available and the scraping activity does not harm the website being scraped**. These are two hugely relevant conditionals.  So before we begin coding, it is important to consider these issues.  

Each call to a web server takes time, server cycles, and memory. Most servers can handle significant traffic, but they can't necessarily handle the strain induced by massive automated requests. Your code can overload the site, taking it offline, or causing the site administrator to ban your IP. 

We do not want to be seen as compromising the functioning of a website just because of our research. First, this overload can crash a server and prevent other users from accessing the site. Second, servers and hosters can, and do, implement countermeasures (i.e. block our access from our IP and so on). 

In addition, only collect public information. Think about Facebook. It is okay to collect public posts, or data from public groups. If by some way you manage to get into private groups, and group members have an expectation of privacy, it is not okay to collect their data. 

Here is a list of good practices for scraping:

- Respect robots.txt
- Don't hit servers too often
- Slow down your code to the speed humans would manually do
- Find trusted source sites
- Do not shave during peak hours
- Improve your code speed
- Use data responsibly (As academics often do)

## Scraping Routine

Scraping often involves the following routine: 

- **Step 1:** Find a website with information you want to collect
- **Step 2:** Understand the website
- **Step 3:** Write code to collect one realization of the data
- **Step 4:** Build a scraper -- generalize you code into a function.
- **Step 5:** Save

And repeat!

## Step 1: Find a Website... but what is a website? 

A website in general is a combination of **HTML, CSS, XML, PHP, and Javascript**. We will care mostly about HTMLs and CSSs. 


### Static vs Dynamic Websites

HTML forms what we call **static websites** - everything you see is there in the programming. Javascript produces dynamic sites - ones that you browse and click on and the url doesn't change - and are sites typically powered by a database deep within the programming. 

Today we will deal with static websites using the Python library ` Beautiful Soup`. For dynamic websites, we will learn next class about working with `selenium` in Python. 

### HTML Website

HTML stands for **HyperText Markup Language**. As it is explict from the name, it is  a markup language used to create web pages and is a cornerstone technology of the internet. It is not a programming language as Python, R and Java.  Web browsers read HTML documents and render them into visible or audible web pages.

See an example of an html file: 


```
<html>
<head>
  <title> Michael Cohen's Email </title>
  <script>
    var foot = bar;
  <script>
</head>
<body>
  <div id="payments">
  <h2>Second heading</h2>
  <p class='slick'>information about <br/><i>payments</i></p>
  <p>Just <a href="http://www.google.com">google it!</a></p>
</body>
</html>
```

HTML code is structured using tags, and information is organized hierarchcially (like a list or an array) from top to bottom. 

Some of the most important tags we will use for scraping are: 


- **p** – paragraphs
- **a href** – links
- **div** – divisions
- **h** – headings
- **table** – tables

See [here for more about html tags](https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9)

### Scraping is all about finding tags and collecting the data associated with them

## Step 2: Understand the website

## Step 3: Collect one realization of the data

Let's 
