# Chapter 1: Webscrapping and api.

In this chapter we will go more in depth on the scraping methodology. First we will go back to Beautiful Soup with a more complex example, we then discuss the advantages of selenium and scrappy and move on to the API part of the course. 

Structure:
- [Web developping tools in the webbrowser](#WB)
- [Beautiful Soup](#BS)
- [Scrapy](#Scrapy)
- [Selenium](#Selenium)
- [APIs](#APIs)
- [Rules of good conduct](#rules)
- [TODO](#TODO)

<a name="WB"></a>
## Web developping tools in the webbrowser

We have seen in introduction how to download html and with Beautiful Soup (BS) and read it with prettify. In practice it can be hard to find what you want using this method. An other option is to use web developping tools available on every browser (Google chrome, Mozilla, ...). Here's a quick introduction to what you can do on browser and how it can help you (note: i'll be using Mozilla)

- Ctrl+u   -> Watching the source code generating the page (What we get with requests.get() )
- Ctrl+Maj+C -> Inspector, Hover on element to see where it is on html page
- Maj+F7 -> Style editor, Check the css the page is using
- Ctrl+Maj+E -> Network, see what you are loading when opening a page (important for JS)

<a name="BS"></a>
## Beautiful Soup

So we have seen the basic usage of BS as a reminder of last year, let's move on a bigger project. You are probably familiar with the 6 degrees of separation ? Number of "steps" (friends of friends) between two individuals is 6 or fewer. Our goal will be to scrap a wikipedia page, get all the href and continue this process until 6 layers deep. (This idea comes from Mitchell R. Web scraping with Python)

In [1]:
import requests 
from bs4 import BeautifulSoup
import re # regex expression
import tqdm.notebook as tq # time loop in notebook

In [None]:
# Starting from the wikipedia page of Kevin Bacon
starting_url = "https://en.wikipedia.org/wiki/Kevin_Bacon"

# Get html content
response = requests.get(starting_url)
result = response.content

# Parse html with BS
soup = BeautifulSoup(result, 'html.parser')

# In the body content find all href that matches the regex query (start with wiki and ignore !: to avoid artifacts like jpeg )
for link in soup.find("div",attrs={'id':'bodyContent'}).find_all("a",href = re.compile("^(/wiki/)((?!:).)*$")):
    print(link.get("href"))

In [2]:
# Using function so that it is cleaner


def Get_hrefs(url):
    # Request url and create bs object.
    response = requests.get(url)
    result = response.content    
    soup = BeautifulSoup(result, 'html.parser')
    
    # init the list with all href
    hrefs = []
    for link in soup.find("div",attrs={'id':'bodyContent'}).find_all("a",href = re.compile("^(/wiki/)((?!:).)*$")):
        if "href" in link.attrs:
            if link.get("href") not in hrefs:
                hrefs.append(link.get("href"))
    return(hrefs)


In [3]:
# depth = number of times we get the hrefs of the hrefs.
# We limit at 2 to not overlead wikipedia with our things but in theory depth of 6 and you could have every person ?
depth = 2

# hrefs_checked = keeping track of href already visited
hrefs_checked = []

for i in tq.tqdm(range(depth)):
    # First iteration start from Kevin Bacon
    if i == 0:
        starting_url = "https://en.wikipedia.org/wiki/Kevin_Bacon"
        hrefs = Get_hrefs(starting_url)
        hrefs_checked.append(starting_url)
    else:
        hrefs_temp = []
        for starting_url in tq.tqdm(hrefs):
            url = "https://en.wikipedia.org" + starting_url
            # Checking if url not visited. Could become inneficient
            if url not in hrefs_checked:
                hrefs_temp += Get_hrefs(url)
        hrefs = [href for href in hrefs_temp if href not in hrefs_checked]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/273 [00:00<?, ?it/s]

In [26]:
# A step further we want to process text and save it in MongoDB
# Also short intro into classes
import pymongo 

class crawler:
    def __init__(self,starting_url, depth, mongo_uri, db_name, collection_name ):
        self.starting_url = starting_url
        self.depth = depth
        self.mongo_uri = mongo_uri
        self.db_name = db_name
        self.collection_name = collection_name
        self.hrefs_checked = []
        self.n_processed = 0
        
    def Get_hrefs(self,url):

        hrefs = []
        for link in self.soup.find("div",attrs={'id':'bodyContent'}).find_all("a",href = re.compile("^(/wiki/)((?!:).)*$")):
            if "href" in link.attrs:
                if link.get("href") not in hrefs:
                    hrefs.append(link.get("href"))
        return(hrefs)
    
    def parse_url(self): 
        full_text = ""
        for para in self.soup.find_all("p"):
            full_text += para.text + " "
        return(full_text)
    
    def save2mongo(self):
        Client = pymongo.MongoClient(self.mongo_uri)
        db = Client[self.db_name]
        collection = db[self.collection_name]
        
        collection.insert_many(self.list_of_insertion)
        
    def run_analysis(self):
        
        self.list_of_insertion = []
        
        for i in tq.tqdm(range(self.depth)):
            # First iteration start from Kevin Bacon
            if i == 0:
                response = requests.get(self.starting_url)
                result = response.content    
                self.soup = BeautifulSoup(result, 'html.parser')
                hrefs = self.Get_hrefs(self.starting_url)
                text = self.parse_url()
                self.hrefs_checked.append(self.starting_url)
                self.n_processed += 1
                self.list_of_insertion.append({"id":self.n_processed, "text" : text, "href":self.starting_url})
            else:
                hrefs_temp = []
                for starting_url in tq.tqdm(hrefs):
                    url = "https://en.wikipedia.org" + starting_url
                    response = requests.get(url)
                    result = response.content    
                    self.soup = BeautifulSoup(result, 'html.parser')
                    hrefs_temp += self.Get_hrefs(url)
                    text = self.parse_url()
                    self.n_processed += 1
                    self.list_of_insertion.append({"id":self.n_processed, "text" : text, "href":url})
                    if len(self.list_of_insertion) % 200 == 0:
                        self.save2mongo()
                        self.list_of_insertion = []
                hrefs = [href for href in hrefs_temp if href not in hrefs_checked]      
        self.save2mongo()


In [None]:
crawl = crawler(starting_url="https://en.wikipedia.org/wiki/Kevin_Bacon", depth = 3, mongo_uri = 'mongodb://localhost:27017', db_name = "M2", collection_name="BS")
crawl.run_analysis()

<a name="Scrapy"></a>
## Scrapy

Although BS works well on small examples it requires an extra amount of work on larger project to have it well structured. This overhead can be avoided using Scrapy which is another Python webscraping library. Also you can't use Xpaths in BS which are a cleaner way to find elements. The entry cost to scrapy is high but once mastered it will help you a lot in your scraping work. We will try to reproduce the BS wikipedia code but using scrapy. As always installation is straightforward:

```console
pip install scrapy
```

Scrapy works by first creating a project. Go to a folder that will have the project inside and run the following in a terminal/cmd prompt:

```console
scrapy startproject scrapyap
```

For the moment don't look too much into the folder created, we first want to create a script called "spider" in scrapy terminology which will be your main script at the beginning:

```console
cd scrapyap
scrapy genspider spider_wikipedia wikipedia.org
```

At the end you should have the following structure

```
scrapy.cfg
scrapy_ap
│   
└───spiders
│   │   __init__.py
│   │   spider_wikipedia.py
│   __init__.py    
│   items.py
│   middlewares.py
│   pipelines.py
│   settings.py
```

At any point in the process of writing code you can use something called scrapy shell. This allows you to do some small examples and test without having to run the whole thing.

```console
scrapy shell
fetch("https://en.wikipedia.org/wiki/Kevin_Bacon")
view(response)
hrefs = response.xpath("//div[@id='bodyContent']//a[@href[re:test(.,'^(/wiki/)((?!:).)*$')]]/@href").getall()
print(hrefs)
```

There's a lot to go through so to start let's focus on spider_wikipedia.py, it should look like this:


In [None]:
# spider_wikipedia.py 

import scrapy

# A class that inherits from scrapy.Spider. We will see in CHap 2 what inheritance is for the moment just know that we "inherit" modules from the class scrapy.Spider
# This means that you have some function and features already implemented and usable. 
class SpiderWikipediaSpider(scrapy.Spider):
    # the name we introduce during the creation of the spider
    name = 'spider_wikipedia'
    # If you try to scrap an url outside of allowed_domains it wont work
    allowed_domains = ['wikipedia.org']
    # The first url you will parse
    start_urls = ['http://wikipedia.org/']

    # What you do with the first url, response = what we get with a request.get()
    def parse(self, response):
        pass


Let's start small, how do we change this code to get the hrefs and urls and iterate this process:

In [None]:
# spider_wikipedia.py 

import scrapy
import time
# A class that inherits from scrapy.Spider. We will see in CHap 2 what inheritance is for the moment just know that we "inherit" modules from the class scrapy.Spider
# This means that you have some function and features already implemented and usable. 
class SpiderWikipediaSpider(scrapy.Spider):
    # the name we introduce during the creation of the spider
    name = 'spider_wikipedia'
    # If you try to scrap an url outside of allowed_domains it wont work
    allowed_domains = ['wikipedia.org']
    # The first url you will parse
    start_urls = ["https://en.wikipedia.org/wiki/Kevin_Bacon"]

    # What you do with the first url, response = what we get with a request.get()
    def parse(self, response):
        # time.sleep because we are nice.
        time.sleep(10)
        # Nothing new here
        hrefs = response.xpath("//div[@id='bodyContent']//a[@href[re:test(.,'^(/wiki/)((?!:).)*$')]]/@href").getall()
        full_text = ""
        for para in response.xpath("//p/text()").getall():
            full_text += para + " "
        
        # Scrapy works based on scrapy request, the most important argument being callback
        # Basically you get an url and call a function to work on this url (in this case the same as for starting_url: parse())
        for url in hrefs:
            yield(scrapy.Request(url="https://en.wikipedia.org" + url, callback=self.parse))

To run it just go the spiders folder and run in a console:

```console
scrapy runspider spider_wikipedia.py
```

At this point if you run the spider it won't give you any results and your terminal should look like this:

![robots](img/robots.png)

Seems like there's some kind of issue with the following url: https://fr.wikipedia.org/robots.txt. 
Turns out website don't like that robots try to scrap them, basic behavior of scrapy is to respect this rules. Indeed if you look at scrapy_ap/settings.py you'll find the following:

```
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
```

robots.txt are crucial information but for the sake of the tutorial, and since our goal is not to overflow wikipedia's server, we will turn down this setting to False:

```
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
```

If you try to run this now it should work although we did not put an ending condition so it will run forever so don't do it ! 
Before going into more details on the spider let's focus on settings now that we introduced a bit scrapy_ap/settings.py. Indeed there's a lot of commented line and a lot of features you can enable/disable to avoid complex coding scheme. 


In [None]:
# Scrapy settings for scrapy_wiki project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_wiki'

SPIDER_MODULES = ['scrapy_wiki.spiders']
NEWSPIDER_MODULE = 'scrapy_wiki.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_wiki (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_wiki.middlewares.ScrapyWikiSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy_wiki.middlewares.ScrapyWikiDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_wiki.pipelines.ScrapyWikiPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

But settings are not the only place you can find nice features, you already have seen that allowed_domains avoid getting caught up in weird website, another example is the duplicate ignore feature. Adding dont_filter=True to scrapy.Request() will ignore some of these features. Look there https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects for more scrapy.Request() arguments, pretty sure you'll find some things that are useful for you.

Now we talked about settings.py and spiders but there's still a lot of files left, why are they here ? Well as said above it's meant to have a more structured code and not a single file with every operation you do.

- items.py is made to handle and restrict the data retrieved from your request.
- pipelines.py will process items (clean, saving in mongo, ...).
- middlewares.py will process request and response.

Let's start with item.py

In [None]:
# item.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyWikiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


We want to create an item that stores the text, an id and the href.

In [None]:
# item.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyWikiItem(scrapy.Item):
    # Read the documentation, scrapy.Field() basic item object
    id_ = scrapy.Field()
    href = scrapy.Field()
    text = scrapy.Field()


We can then modify spider spider_wikipedia.py:

In [None]:
# spider_wikipedia.py 

import scrapy
from scrapy_wiki.items import ScrapyWikiItem


# A class that inherits from scrapy.Spider. We will see in Chap 2 what inheritance is for the moment just know that we "inherit" modules from the class scrapy.Spider
# This means that you have some function and features already implemented and usable.
 
class SpiderWikipediaSpider(scrapy.Spider):
    # the name we introduce during the creation of the spider
    name = 'spider_wikipedia'
    # If you try to scrap an url outside of allowed_domains it wont work
    allowed_domains = ['wikipedia.org']
    # The first url you will parse
    start_urls = ["https://en.wikipedia.org/wiki/Kevin_Bacon"]
    # create a counter
    n_processed = 0

    # What you do with the first url, response = what we get with a request.get()
    def parse(self, response):

        hrefs = response.xpath("//div[@id='bodyContent']//a[@href[re:test(.,'^(/wiki/)((?!:).)*$')]]/@href").getall()
        # update counter
        self.n_processed += 1
        # create instance of item
        item = ScrapyWikiItem()
        item["href"] = response.url
        item["text"] = response.xpath("//p/text()").getall()
        item["id_"] = self.n_processed
        
        # Scrapy works based on scrapy request, the most important argument being callback
        # Basically you get an url and call a function to work on this url (in this case the same as for starting_url: parse())
        for url in hrefs:
            # meta if you want to update item as you go along, in this case not needed
            yield scrapy.Request(url="https://en.wikipedia.org" + url, callback=self.parse,meta={'item': item})

Notice how we do not process the text for the moment, we will use pipelines.py to do it. For the moment the item is just returned (see below)

In [None]:
# pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ScrapyWikiPipeline(object):
    def process_item(self, item, spider):
        return item


Now we want to clean the text and put it in a mongodb, we start from a code given in the documentation (https://docs.scrapy.org/en/latest/topics/item-pipeline.html) and just add a clean_text function.

In [None]:
# pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import re
import pymongo
from itemadapter import ItemAdapter


class MongoPipeline:

    collection_name = 'scrapy'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        item["text"] = self.clean_text(item["text"])
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
        return item
    
    
    def clean_text(self,text):
        full_text = re.sub("\n",""," ".join(text))
        return full_text

 You also need to enable pipelines in the settings and give the DB name and URI in settings

In [None]:
# settings.py

ITEM_PIPELINES = {
    'scrapy_wiki.pipelines.MongoPipeline': 300,
}

MONGO_URI = "mongodb://localhost:27017"
MONGO_DATABASE = "M2"

Finally you need to add a yield to the spider so that it knows to process the item

In [None]:
# spider_wikipedia.py 

import scrapy
from scrapy_wiki.items import ScrapyWikiItem


# A class that inherits from scrapy.Spider. We will see in Chap 2 what inheritance is for the moment just know that we "inherit" modules from the class scrapy.Spider
# This means that you have some function and features already implemented and usable.
 
class SpiderWikipediaSpider(scrapy.Spider):
    # the name we introduce during the creation of the spider
    name = 'spider_wikipedia'
    # If you try to scrap an url outside of allowed_domains it wont work
    allowed_domains = ['wikipedia.org']
    # The first url you will parse
    start_urls = ["https://en.wikipedia.org/wiki/Kevin_Bacon"]
    # create a counter
    n_processed = 0

    # What you do with the first url, response = what we get with a request.get()
    def parse(self, response):

        hrefs = response.xpath("//div[@id='bodyContent']//a[@href[re:test(.,'^(/wiki/)((?!:).)*$')]]/@href").getall()
        # update counter
        self.n_processed += 1
        # create instance of item
        item = ScrapyWikiItem()
        item["href"] = response.url
        item["text"] = response.xpath("//p/text()").getall()
        item["id_"] = self.n_processed
        
        yield item
        # Scrapy works based on scrapy request, the most important argument being callback
        # Basically you get an url and call a function to work on this url (in this case the same as for starting_url: parse())
        for url in hrefs:
            # meta if you want to update item as you go along, in this case not needed
            yield scrapy.Request(url="https://en.wikipedia.org" + url, callback=self.parse,meta={'item': item})

And that's it ! You have your first scrapy project ! There's of course much more to see and we still haven't talked about middlewares.py but we stop here for the moment.
Ok so now one would think you have all the tools to scrap websites, well think again ! Let's try to see using scrapy shell what you get when scraping twitch for example:


```console
scrapy shell
fetch("https://www.twitch.tv/")
view(response)
```

![twitch](img/twitch.png)


Seems like it does not load. This is due to JavaScript. At some point in time the www was only html and css and the scrapping was easier. Today almost every website you use have some javascript runnning in the background making it dynamic. This makes it hard for BS and Scrapy to find what they are looking for. Now comes a new library called Selenium.

<a name="Selenium"></a>
## Selenium

Selenium is meant to act as if a human was using a web browser. This means that you need a web browser for it to work (we will use mozilla but chrome or others are fine too) and a driver (specific for the browser, geckodriver is for mozilla). DL geckodriver here https://github.com/mozilla/geckodriver/releases. Let's start again with twitch. When you start a code with Selenium you should have a page that opens up (default behavior that can be changed), this page is called "marionette".

In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Start up the marionette
driver = webdriver.Firefox()
# go to this page
driver.get("https://www.twitch.tv/directory")

# Get infos
publications_href = driver.find_elements(By.XPATH, "//a[@class='ScCoreLink-sc-udwpw5-0 lpnppF tw-link']")
urls = [ref.get_attribute('href') for ref in publications_href]
print(urls)
# Close marionette
#driver.close()



[]


In [6]:
publications_href = driver.find_elements(By.XPATH, "//a[@class='ScCoreLink-sc-udwpw5-0 lpnppF tw-link']")
urls = [ref.get_attribute('href') for ref in publications_href]
print(urls)

['https://www.twitch.tv/directory/game/Just%20Chatting', 'https://www.twitch.tv/directory/game/Just%20Chatting', 'https://www.twitch.tv/directory/game/Fortnite', 'https://www.twitch.tv/directory/game/Fortnite', 'https://www.twitch.tv/directory/game/League%20of%20Legends', 'https://www.twitch.tv/directory/game/League%20of%20Legends', 'https://www.twitch.tv/directory/game/Rocket%20League', 'https://www.twitch.tv/directory/game/Rocket%20League', 'https://www.twitch.tv/directory/game/Grand%20Theft%20Auto%20V', 'https://www.twitch.tv/directory/game/Grand%20Theft%20Auto%20V', 'https://www.twitch.tv/directory/game/Minecraft', 'https://www.twitch.tv/directory/game/Minecraft', 'https://www.twitch.tv/directory/game/Call%20of%20Duty%3A%20Warzone', 'https://www.twitch.tv/directory/game/Call%20of%20Duty%3A%20Warzone', 'https://www.twitch.tv/directory/game/VALORANT', 'https://www.twitch.tv/directory/game/VALORANT', 'https://www.twitch.tv/directory/game/Apex%20Legends', 'https://www.twitch.tv/directo

From this short example you can already see two problems from simulating real behavior:

- The loading time
- Scrolling to load

The loading time can be easily avoided adding a wait condition:

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # New import

# Start up the marionette
driver = webdriver.Firefox()

# go to this page
driver.get("https://www.twitch.tv/directory")
# Condition: wait for element, if after 10 second not found then send an error
WebDriverWait(driver, 10).until(lambda driver: driver.find_elements(By.XPATH, "//a[@class='ScCoreLink-sc-udwpw5-0 lpnppF tw-link']"))

# Get infos
publications_href = driver.find_elements(By.XPATH, "//a[@class='ScCoreLink-sc-udwpw5-0 lpnppF tw-link']")
urls = [ref.get_attribute('href') for ref in publications_href]
print(urls)
# Close marionette
#driver.close()

Scrolling takes a bit more coding to deal with it

In [7]:
#%% scrolling function example

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time

def scrolldown(driver,bottom = False, n = 0 ):
    SCROLL_PAUSE_TIME = 2
    last_height = driver.execute_script("return document.body.scrollHeight")
    if bottom == True:
        while True:
            # Scroll down to bottom
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            # Wait to load page
            time.sleep(SCROLL_PAUSE_TIME)
            # Calculate new scroll height and compare with last scroll height
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height
    else:
        for i in range(n):            
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(SCROLL_PAUSE_TIME)
            new_height = driver.execute_script("return document.body.scrollHeight")
            last_height = new_height

driver = webdriver.Firefox()
driver.get("https://twitter.com/anacondainc")
scrolldown(driver,bottom=False,n=10)
driver.close()


Although this function is pretty general it does not work in some specific case and you need to adapt, improvise and overcome and ActionChains might come in handy

In [3]:
#%% this scrolling does not work in all case:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait    
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time

driver = webdriver.Firefox()
driver.get("https://www.twitch.tv/directory")
WebDriverWait(driver, 10).until(lambda driver: driver.find_elements(By.XPATH, "//a[@class='ScCoreLink-sc-udwpw5-0 lpnppF tw-link']"))

def scroll_twitch(driver,n=0):
    element = driver.find_element_by_xpath("//div[@class='Layout-sc-nxg1ff-0 imInLb']//h1[@class='CoreText-sc-cpl358-0 ScTitleText-sc-1gsen4-0 ipNmNI tw-title']")
    element.click()
    for i in range(n):
        try:
            action_chains = ActionChains(driver)
            action_chains.send_keys(Keys.PAGE_DOWN).perform()
            time.sleep(2)
        except Exception as e:
            print(str(e))
            

scroll_twitch(driver,10)



# every action chains http://www.allselenium.info/python-selenium-all-mouse-actions-using-actionchains/#clickandhold(onelement=None)
# Problem : the bar is longer so you can scroll down the same amount: Hands-on, decay over time

Last thing to see is how to login using Selenium. Some website require authentification to perform certain action.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import pickle

driver = webdriver.Firefox()
driver.get("https://www.twitch.tv/login")

user_name = "username"
password = "password"

element = driver.find_element_by_id("login-username")
element.send_keys(user_name)

element = driver.find_element_by_id("password-input")
element.send_keys(password)

sign_in = driver.find_element(By.XPATH, "//button[@data-a-target='passport-login-button']")
sign_in.click()

# cookies


driver.get_cookies()
pickle.dump( driver.get_cookies() , open("data/cookies_twitch.pkl","wb"))

cookies = pickle.load(open("data/cookies_twitch.pkl", "rb"))
for cookie in cookies:
    driver.add_cookie(cookie)
    
"""
driver.add_cookie({'name': 'twitch.lohp.countryCode',
  'value': 'GE',
  'path': '/',
  'domain': '.twitch.tv',
  'secure': False,
  'httpOnly': False,
  'expiry': 1912091189})
"""

"\ndriver.add_cookie({'name': 'twitch.lohp.countryCode',\n  'value': 'GE',\n  'path': '/',\n  'domain': '.twitch.tv',\n  'secure': False,\n  'httpOnly': False,\n  'expiry': 1912091189})\n"

Now you might have noticed that the structure is similar to BS (+ Xpath).
A nice thing could be to have the scrapy structure with a Selenium backend. To do this we will use the middlewares.py of scrapy

In [None]:
#middlewares.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

from scrapy.http import HtmlResponse

class ScrapyWikiSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ScrapyWikiDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


The thing we want to modify is the process_request() function. Instead of getting the simple response of the request.get(), we will use Selenium to send back a driver.

In [None]:
#middlewares.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
import time


driver = webdriver.Firefox()

class ScrapyWikiSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ScrapyWikiDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        driver.get(request.url)
        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
    
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


Last thing to do enable the middlewares in settings.py

In [None]:
DOWNLOADER_MIDDLEWARES = {
    'ScrapyWiki.middlewares.ScrapyApDownloaderMiddleware': 543,
}

Of course this method is not failproof. Sometimes you'll want to return the driver and not just the response for example. 

<a name="API"></a>
## APIs

We have seen how to scrap information from website directly. Althoug this seems like a safe methods it is not the best for the server. Most of the time you don't require the whole page but some specific information on this page. Developpers that are ok with you scraping their website have probably implemented some kind of Application Programming Interface (API). This API reduces the overhead of your query and gives you only the data your are interested in. Querying an api is usually as easy as request.get() if documentation are available. Let's see some small example:

(sidenote: API means nothing and everything at the same time. When you are using your phone to send a message you use an API, when you use a library in python you use an API, when you open a webbrowser you use an API... Be wary when you use this acronym !)

In [None]:
#%% Arxiv

# First objective find if there's an API and the documentation of the API
# https://arxiv.org/help/api/tou

import requests
import feedparser


response = requests.get('http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=10')
feed = feedparser.parse(response.content)
feed

results = {}
for entry in feed.entries:
    print(entry)
    results[entry.id] = {"title": entry.title,
                         "abstract":entry.summary}

In [None]:
#%% Twitch

# https://dev.twitch.tv/docs/authentication
# https://dev.twitch.tv/docs/api

import requests
import json
import pymongo
import time
import tqdm

client = pymongo.MongoClient('localhost',27017)

mydb = client["api_twitch"]
collection = mydb["top_games"]


Client_ID = "zuxz59ow9v8zncx3ljdyo5jaj1sqdz"
secret = "julw385uytn4dhzkublz2wa644l3he"

access_token = requests.post("https://id.twitch.tv/oauth2/token?client_id={}&client_secret={}&grant_type=client_credentials".format(Client_ID,secret))
access_token = json.loads(access_token.content)["access_token"]
#scope = "analytics:read:games"
headers = {"Client-ID": Client_ID, "Authorization": "Bearer " + access_token,}

n_games = 40
limit = 20
n_iteration = int(n_games/limit)

for i in tqdm.tqdm(range(n_iteration)):
    if i == 0:
        response_category = requests.get("https://api.twitch.tv/helix/games/top",headers = headers)
    else:
        response_category = requests.get("https://api.twitch.tv/helix/games/top?after={}".format(json.loads(response_category.content)["pagination"]["cursor"]),headers = headers)
    for category in json.loads(response_category.content)["data"]:
        response = requests.get('https://api.twitch.tv/helix/streams?game_id={}'.format(category["id"]), headers=headers)
        streamers = {}
        for streamer in json.loads(response.content)["data"]:
            streamers[streamer["user_id"]] = {"user_name":streamer["user_name"],
                                              "title":streamer["title"],
                                              "viewer_count":streamer["viewer_count"],
                                              "started_at":streamer["started_at"],
                                              "language":streamer["language"],} 
        
        done = False
        while done == False:
            try:
                response = requests.get('https://api.twitch.tv/helix/streams?game_id={}&after={}'.format(category["id"],json.loads(response.content)["pagination"]["cursor"]), headers=headers)
                for streamer in json.loads(response.content)["data"]:
                    streamers[streamer["user_id"]] = {"user_name":streamer["user_name"],
                                                      "title":streamer["title"],
                                                      "viewer_count":streamer["viewer_count"],
                                                      "started_at":streamer["started_at"],
                                                      "language":streamer["language"],}
                time.sleep(1)
            except:
                done = True
        
        post = {"_id": category["id"],
                "game": category["name"],
                "streamers": streamers,
                }
        try:
            collection.insert_one(post)
        except Exception as e:
            print(str(e))
            
    print(response.headers)
    time.sleep(1)
    


cursor = collection.find({"game":"World of Warcraft"})
for document in cursor:
    print(document)


Depending on the api you'll have to work with different data format. The most popular is json:

In [None]:
import json
import requests

response = requests.get("http://ip-api.com/json/50.78.253.58")
response.content
json_str = json.loads(response.content)
json_str["city"]

In [None]:
The other less popular format is xml. The next code will show you how to parse it

In [None]:
# %% XML
# XML = a common language when doing requests. Extensible Markup Language. tree-like structure.
# Multiple package to work with python and xml: lxml, xml.dom.minidom, xml.etree.ElementTree

import xml.etree.ElementTree as ET
import xml.dom.minidom

# ET
tree = ET.parse(os.path.join(xml_file))
root = tree.getroot()

[(elem.tag, elem.text) for elem in root.iter()]
[(elem.tag, elem.text) for elem in root.iter() if elem.tag =="source"]
#list(list(children[1])[0])[0].text


#lxml

from lxml import etree

root = etree.parse(xml_file)
abstract = root.xpath("//abstract//text()")
body = root.xpath("//body//text()")
title = root.xpath("//title-group//text()")
figures = root.xpath("//fig//text()")

aff = root.xpath("//aff/text()")
aff = [i for i in aff if not i.startswith((' ', '\t'))]
aff_label = root.xpath("//aff/label/text()")

mails =root.xpath("//author-notes/corresp")[0]
mails.getchildren()

xref = {}
for affiliation,label in zip(aff,aff_label):
    xref[label]= affiliation

authors = root.xpath("//contrib")
authors = [i.getchildren() for i in authors]
for author in authors:
    names = [i.getchildren() for i in author if i.tag == "name"][0]
    surname = [i.text for i in names if i.tag=="surname"]
    name = [i.text for i in names if i.tag=="given-names"]
    xrefs = [i.text for i in author if i.tag=="xref"]

# minidom
doc = xml.dom.minidom.parse(xml_file) 

abstract = doc.getElementsByTagName("abstract")
body = doc.getElementsByTagName("body")
title = doc.getElementsByTagName("title-group")
figures = doc.getElementsByTagName("fig")

abstract[0].childNodes[0].childNodes[0].nodeValue

<a name="FTP"></a>
## FTP server

Sometimes website don't have APIs but FTP server where you can download bulk of data. Its unlikely that you'll encounter FTP server soon but just in case here is how you do it.

In [2]:
import ftplib
import re
import tarfile
import os
#import shutil

# callback for ftp.retrbinary
def file_write(data):
   local_file.write(data) 

# connect to ftp_server
email = "email@unistra.fr"
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login(user="anonymous",passwd=email)
ftp.cwd("pub/pmc/oa_package")

# Find tar.gz path

tar_gz = False
while tar_gz == False:
    list_files = ftp.nlst()
    if re.search("\.",list_files[-1]):
        tar_gz = True
    else:
        ftp.cwd(list_files[-1])

# In this case we will only take 1 file, dl it and uncompress it

link = ftp.pwd() +"/" + list_files[0]
path = 'data/{}'.format(list_files[0])
local_file = open(path,"wb")
ftp.retrbinary("RETR " + link,file_write, blocksize=16384)
local_file.close()
my_tar = tarfile.open(path)
my_tar.extractall(path)
path_extracted = re.sub("\.tar\.gz","",path)
my_tar.close()
os.unlink(path)

ftp.quit()


FileNotFoundError: [WinError 3] Le chemin d’accès spécifié est introuvable: '.\\data\\PMC7189666.tar.gz\\PMC7189666'

<a name="rules"></a>
## Rules of good conduct

# TODO

Code review:
- https://github.com/matthpn2/Web-Scraping-with-Beautiful-Soup
- https://github.com/SoumitraAgarwal/Fifa-Ratings
- https://github.com/RainrainWu/finance_scraper