# Data Science: Web Scrapping 
#### By: Javier Orduz
<!--
<img
src="https://jaorduz.github.io/images/Javier%20Orduz_01.jpg" width="50" align="center">
-->

[license-badge]: https://img.shields.io/badge/License-CC-orange
[license]: https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en

[![CC License][license-badge]][license]  [![DS](https://img.shields.io/badge/downloads-DS-green)](https://github.com/Earlham-College/DS_Fall_2022)  [![Github](https://img.shields.io/badge/jaorduz-repos-blue)](https://github.com/jaorduz/)  ![Follow @jaorduc](https://img.shields.io/twitter/follow/jaorduc?label=follow&logo=twitter&logoColor=lkj&style=plastic)

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#about_WS">About Web Scrapping</a></li>
<!--         <ol>
            <li><a href="about_dataset"></a> Libraries, modules, and data set</li>
        </ol> -->
        <li><a href="#theWebsite">The Website</a></li>
        <li><a href="#exercies">Exercises</a></li>
<!--         <li><a href="#practice">Practice</a></li> -->
    </ol>
</div>
<br>
<hr>

In [None]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
%matplotlib inline 
import matplotlib.pyplot as plt
import requests
import json

## Web Scrapping

Web scraping is the process of systematically extracting data from websites for analysis using custom software tools. Python has become one of the predominant programming languages used for web scraping due to its accessibility, flexibility, and abundance of specialized libraries dedicated to the task.

### Goals with this NB

- Provide an overview about WS.
- We will use a website to explore its information.

### Characteristics with Python

- **Simplicity:** Python is relatively easy to learn, even for beginners. This makes it a good choice for those who are new to web scraping.
- **Flexibility:** Python has a large number of libraries and tools available for web scraping. This makes it possible to scrape data from a wide variety of websites.
- **Efficiency:** Python can be very efficient at scraping data. This is especially important when scraping large amounts of data.

### Remarks

- It can be time-consuming to write web scraping scripts.
- Web scraping can be fragile. Websites can change their HTML code at any time, which can break your scraping scripts.
- Web scraping can violate the terms of service of a website.

- Web scrapers built solely for performance may opt for languages like **Java, C#,** or **Go** over Python. Though Python scrapers can be highly performant, compiled languages exceed Python's efficiency for particularly resource-intensive jobs. 
- **Legality and ethics** are also persistent issues for web scraping in general, regardless of language choice. Scrapers introduce scalability concerns, can fail without warning, and may violate a website's terms of service if used carelessly or excessively.

Overall, web scraping with Python is a powerful tool that can be used to collect data from a wide variety of websites. However, it is important to be aware of the potential risks involved.

---
---
# The website

Turns out that https://www.nobelprize.org/prizes/lists/all-nobel-prizes/ has the data we want. 

Let's take a look at the [website](https://www.nobelprize.org/prizes/lists/all-nobel-prizes/) and to look at the underhood HTML
<!---
: right-click and click on `inspect` . Try to find structure in the tree-structured HTML.
--->

Consider the `nobelprize.org` server is a little slow sometimes. 

<!---
Fortunately, the Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!).
We'll just give you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.
--->

#### Revise the meaning
What is a this Response [200]? Let's google: [`response 200 meaning`](https://www.google.com/search?q=response+200+meaning&oq=response+%5B200%5D+m&aqs=chrome.1.69i57j0l5.6184j0j7&sourceid=chrome&ie=UTF-8). All possible codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

In [None]:
snapshot_url = 'https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'

In [None]:
snapshot = requests.get(snapshot_url)
snapshot

In [None]:
type(snapshot)

Try to request "www.xoogle.be"? What happens?

In [None]:
snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'
snapshot = requests.get(snapshot_url2)
snapshot

## Important information
Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, `time.sleep(1)`, with the `time` library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.

In [None]:
snapshot = requests.get(snapshot_url)
raw_html = snapshot.text
print(raw_html[:560])

## Regular Expressions
You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
- https://docs.python.org/3.3/library/re.html
- https://regexone.com
- https://docs.python.org/3/howto/regex.html.

Specify a specific sequence with the help of regex special characters. Some examples: 
- ```\S``` : Matches any character which is not a Unicode whitespace character
- ```\d``` : Matches any Unicode decimal digit 
- ```*``` : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

**Let's find all the occurances of 'Marie' in our raw_html:**

In [None]:
import re 
from bs4 import BeautifulSoup

In [None]:
re.findall(r'Marie',raw_html)

**Using ```\S``` to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':**

In [None]:
re.findall(r'Marie \S',raw_html)

Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.  

## Parse the HTML with BeautifulSoup

In [None]:
soup = BeautifulSoup(raw_html, 'html.parser')

Key BeautifulSoup functions we’ll be using in this section:
- **`tag.prettify()`**: Returns cleaned-up version of raw HTML, useful for printing
- **`tag.select(selector)`**: Return a list of nodes matching a [CSS selector](https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Simple_selectors)
- **`tag.select_one(selector)`**: Return the first node matching a CSS selector
- **`tag.text/soup.get_text()`**: Returns visible text of a node (e.g.,"`<p>Some text</p>`" -> "Some text")
- **`tag.contents`**: A list of the immediate children of this node

You can also use these functions to find nodes.
- **`tag.find_all(tag_name, attrs=attributes_dict)`**: Returns a list of matching nodes
- **`tag.find(tag_name, attrs=attributes_dict)`**: Returns first matching node

BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Let's practice some BeautifulSoup commands...

**Print a cleaned-up version of the raw HTML**

In [None]:
print(soup.prettify())

**Find the first “title” object** 

In [None]:
soup.title

**Extract the text of first “title” object** 

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.a

In [None]:
soup.find_all('a')

In [None]:
soup.find(id="search-mobile-trigger-js")

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
print(soup.get_text())

# Exercises


1. Run this entire Notebook with the previous link revise the results, do the same thing with this link 
`http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/` and figure out why there are differences if any.

1. Find a different website and run the same previous commands.

## References
[1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/