   ##                                                Web Scrapping 

Web scraping, also known as web data extraction, is formally known as the process of obtaining and structuring data from web using intelligent automation. It can be used to potentially retrive hundreds, millions, or even a billions of data points from the internet's seemingly endless frontier.

There are two parts often associated with scraping.
- Web crawling
    - A web crawler, generally referred to as a 'spider', is a program that browses the internet to index and search for content by following links and exploring, like a person with too much time on their hands. This is basically what every search engine, like google, does. They send these spider bots out there, to explode, inspect and categorize web pages.
- Web scrapers
    - A web scraper is the program that actually does the data extraction from a web page and may be specifically designed with a certain web site in mind. Web scrapers vary widely in design and complexity, depending on the project. 
    
Although there are some crawlers that are not related to data extraction they are not a necessary step and are often employed in big projects only.

Sometimes scraping may not be the most efficient way of obtaining data. Data collection, or better data exchange is fundamental for the normal functioning of the modern internet economy. That's why there are numerous compaines out there that may provide the data we may interseted in, in a clean and concise way through API's.

(rottentomatoes.com/robots.txt)

### APIs
To facilitate the data extraction, some data providers maintain APIs
API - Application Programming Interface

- Basically, an API specifies how software components should interact. You may think of it as a contract between a client and a server. If the client makes a request in a specific format, the server will always respond in a documented format or initiate a defined action.

- Examples of web-based APIs include 
    - Currency exchange rates
    - Job boards
    - Weather forecast
   
- All APIs should have some form of documentation. A file or a web page explaining exactly how to use it, its response format and so on.



### HTTP

HTTP - HyperText Transfer Protocol
- We will explore how information is exchanged on the web, by looking at "HTTP" requests. Specifies how requests and responses are to be formatted and transmitted.

The two popular types of request types are
- GET
    - It is primarily used to obtain data from a server.
    - It remains in the browser history and server logs, can be bookmarked.
    - Sometimes, in order to recieve a more specific response, parameters are added directly to the URL.
    - Since the URL is visible, this request is not used for sensitive information.
    
- POST
    - Alter state or send confidential information.
    - It can carry information in a seperate body, thus making it more secure from prying eyes.
    - Parameters are added in a seperate body.

To indicate, whether a request was succesful, the response from the server contains a status code. Every status code has a given meaning, 200 for successfull, 404 for error.

## JSON

JSON - JavaScript Object Notation.
As it was derived from the JavaScript programming language.

Having a predictable, standard format in which we exchange information with servers is obviously beneficial for everyone involved. One such standard is the JSON format.

The JSON format relies on 3 key concepts:
- It should be easy for humans to read and write;
- Easy for programs to process and generate, regardless of the programming language;
- Return in plain text.

It achieves that by using conventions familar to almost all programmers, by building upon 2 structures.
- Dictionaries
    - A dictionary is a data structure that contains key-value pairs, surronded by curly brackets. The unique value of each key is signified after a colon and the pairs are seperated by commas.
- Lists
    - A list is a collection of items, It is contained inside square brackets.

link - exchangeratesapi.io

link for API data - http://api.exchangeratesapi.io/v1/latest?access_key=ccec7a1a517787e63c7893f1c236d952

Access key - ccec7a1a517787e63c7893f1c236d952

#### Pulling data from public APIs - GET request

In [16]:
base_url = 'http://api.exchangeratesapi.io/v1/latest?access_key=ccec7a1a517787e63c7893f1c236d952'


#### Extracting data on currency exchange rates

#### Send a GET request

This is most easily achieved in python with the help of the 'requests' library. 
After importing the package we have the access to its 'get' method

In [2]:
import requests

The '.get()' method submits a GET request to the indicated URL, and returns the response from the server.

In [3]:
# We have just made a GET request
response = requests.get('http://api.exchangeratesapi.io/v1/latest?access_key=ccec7a1a517787e63c7893f1c236d952')

#### Investigating the response
The requests package provides a couple of very useful attributes for this purpose.
- We can check if the request went through without problems with the 'OK' attribute. 
- We can access the status code directly using 'status_code' attribute.
- The body of the response can be obtained by writing '.text' attribute. Which returns it as a regular string.
- Alternatively, we can use '.content' which returns it in bytes format.

In [4]:
response.ok

True

In [5]:
response.status_code

200

In [6]:
response.text

'{"success":true,"timestamp":1637645343,"base":"EUR","date":"2021-11-23","rates":{"AED":4.127998,"AFN":105.372718,"ALL":121.384651,"AMD":537.854917,"ANG":2.032645,"AOA":657.473483,"ARS":112.839339,"AUD":1.556197,"AWG":2.023275,"AZN":1.91859,"BAM":1.956066,"BBD":2.277175,"BDT":96.760979,"BGN":1.954164,"BHD":0.423772,"BIF":2245.421015,"BMD":1.123886,"BND":1.535985,"BOB":7.776339,"BRL":6.279153,"BSD":1.127836,"BTC":1.9881327e-5,"BTN":83.954433,"BWP":13.160088,"BYN":2.822941,"BYR":22028.158732,"BZD":2.273374,"CAD":1.427773,"CDF":2255.086767,"CHF":1.047697,"CLF":0.033082,"CLP":912.83125,"CNY":7.177024,"COP":4412.094089,"CRC":721.877128,"CUC":1.123886,"CUP":29.78297,"CVE":110.27831,"CZK":25.455789,"DJF":200.784234,"DKK":7.436523,"DOP":63.845706,"DZD":156.55166,"EGP":17.663105,"ERN":16.858634,"ETB":54.290353,"EUR":1,"FJD":2.374152,"FKP":0.837783,"GBP":0.83884,"GEL":3.51208,"GGP":0.837783,"GHS":6.910734,"GIP":0.837783,"GMD":58.781261,"GNF":10657.886374,"GTQ":8.725353,"GYD":235.958479,"HKD":8.7

In [7]:
response.content

b'{"success":true,"timestamp":1637645343,"base":"EUR","date":"2021-11-23","rates":{"AED":4.127998,"AFN":105.372718,"ALL":121.384651,"AMD":537.854917,"ANG":2.032645,"AOA":657.473483,"ARS":112.839339,"AUD":1.556197,"AWG":2.023275,"AZN":1.91859,"BAM":1.956066,"BBD":2.277175,"BDT":96.760979,"BGN":1.954164,"BHD":0.423772,"BIF":2245.421015,"BMD":1.123886,"BND":1.535985,"BOB":7.776339,"BRL":6.279153,"BSD":1.127836,"BTC":1.9881327e-5,"BTN":83.954433,"BWP":13.160088,"BYN":2.822941,"BYR":22028.158732,"BZD":2.273374,"CAD":1.427773,"CDF":2255.086767,"CHF":1.047697,"CLF":0.033082,"CLP":912.83125,"CNY":7.177024,"COP":4412.094089,"CRC":721.877128,"CUC":1.123886,"CUP":29.78297,"CVE":110.27831,"CZK":25.455789,"DJF":200.784234,"DKK":7.436523,"DOP":63.845706,"DZD":156.55166,"EGP":17.663105,"ERN":16.858634,"ETB":54.290353,"EUR":1,"FJD":2.374152,"FKP":0.837783,"GBP":0.83884,"GEL":3.51208,"GGP":0.837783,"GHS":6.910734,"GIP":0.837783,"GMD":58.781261,"GNF":10657.886374,"GTQ":8.725353,"GYD":235.958479,"HKD":8.

#### Handling the JSON
The 'requests' library provides us with the '.json()' method, which coverts a JSON formatted response to a native python object.
- The python 'json' package provides methods for JSON manipulation. Two of the main ones are 'loads' and 'dumps'.
    - loads(string) : converts a JSON formatted string to a Python object
    - dumps(object) : converts a Python object to a JSON formatted string, with options to make it a prettier sting.

In [8]:
response.json()

{'success': True,
 'timestamp': 1637645343,
 'base': 'EUR',
 'date': '2021-11-23',
 'rates': {'AED': 4.127998,
  'AFN': 105.372718,
  'ALL': 121.384651,
  'AMD': 537.854917,
  'ANG': 2.032645,
  'AOA': 657.473483,
  'ARS': 112.839339,
  'AUD': 1.556197,
  'AWG': 2.023275,
  'AZN': 1.91859,
  'BAM': 1.956066,
  'BBD': 2.277175,
  'BDT': 96.760979,
  'BGN': 1.954164,
  'BHD': 0.423772,
  'BIF': 2245.421015,
  'BMD': 1.123886,
  'BND': 1.535985,
  'BOB': 7.776339,
  'BRL': 6.279153,
  'BSD': 1.127836,
  'BTC': 1.9881327e-05,
  'BTN': 83.954433,
  'BWP': 13.160088,
  'BYN': 2.822941,
  'BYR': 22028.158732,
  'BZD': 2.273374,
  'CAD': 1.427773,
  'CDF': 2255.086767,
  'CHF': 1.047697,
  'CLF': 0.033082,
  'CLP': 912.83125,
  'CNY': 7.177024,
  'COP': 4412.094089,
  'CRC': 721.877128,
  'CUC': 1.123886,
  'CUP': 29.78297,
  'CVE': 110.27831,
  'CZK': 25.455789,
  'DJF': 200.784234,
  'DKK': 7.436523,
  'DOP': 63.845706,
  'DZD': 156.55166,
  'EGP': 17.663105,
  'ERN': 16.858634,
  'ETB': 54.

In [9]:
type(response.json())

dict

In [10]:
import json

In [11]:
# We can choose the amount of whitespaces with the indent parameter.
json.dumps(response.json(),indent=4)

'{\n    "success": true,\n    "timestamp": 1637645343,\n    "base": "EUR",\n    "date": "2021-11-23",\n    "rates": {\n        "AED": 4.127998,\n        "AFN": 105.372718,\n        "ALL": 121.384651,\n        "AMD": 537.854917,\n        "ANG": 2.032645,\n        "AOA": 657.473483,\n        "ARS": 112.839339,\n        "AUD": 1.556197,\n        "AWG": 2.023275,\n        "AZN": 1.91859,\n        "BAM": 1.956066,\n        "BBD": 2.277175,\n        "BDT": 96.760979,\n        "BGN": 1.954164,\n        "BHD": 0.423772,\n        "BIF": 2245.421015,\n        "BMD": 1.123886,\n        "BND": 1.535985,\n        "BOB": 7.776339,\n        "BRL": 6.279153,\n        "BSD": 1.127836,\n        "BTC": 1.9881327e-05,\n        "BTN": 83.954433,\n        "BWP": 13.160088,\n        "BYN": 2.822941,\n        "BYR": 22028.158732,\n        "BZD": 2.273374,\n        "CAD": 1.427773,\n        "CDF": 2255.086767,\n        "CHF": 1.047697,\n        "CLF": 0.033082,\n        "CLP": 912.83125,\n        "CNY": 7.1770

In [13]:
# In order to manifest we need to print the string
print(json.dumps(response.json(),indent=4))

{
    "success": true,
    "timestamp": 1637645343,
    "base": "EUR",
    "date": "2021-11-23",
    "rates": {
        "AED": 4.127998,
        "AFN": 105.372718,
        "ALL": 121.384651,
        "AMD": 537.854917,
        "ANG": 2.032645,
        "AOA": 657.473483,
        "ARS": 112.839339,
        "AUD": 1.556197,
        "AWG": 2.023275,
        "AZN": 1.91859,
        "BAM": 1.956066,
        "BBD": 2.277175,
        "BDT": 96.760979,
        "BGN": 1.954164,
        "BHD": 0.423772,
        "BIF": 2245.421015,
        "BMD": 1.123886,
        "BND": 1.535985,
        "BOB": 7.776339,
        "BRL": 6.279153,
        "BSD": 1.127836,
        "BTC": 1.9881327e-05,
        "BTN": 83.954433,
        "BWP": 13.160088,
        "BYN": 2.822941,
        "BYR": 22028.158732,
        "BZD": 2.273374,
        "CAD": 1.427773,
        "CDF": 2255.086767,
        "CHF": 1.047697,
        "CLF": 0.033082,
        "CLP": 912.83125,
        "CNY": 7.177024,
        "COP": 4412.094089,
       

In [14]:
# To get the keys from the dictionaries
response.json().keys()

dict_keys(['success', 'timestamp', 'base', 'date', 'rates'])