* [Web Scraping For Beginners with Python](https://medium.com/@durgaswaroop/web-scraping-with-python-introduction-7b3c0bbb6053)
* [Web Scraping in Python](https://medium.com/dreidev/web-scraping-in-python-e07fba0a1663)
* [Web Scraping](https://medium.com/tag/web-scraping)
* [Web Scraping Tutorial with Python: Tips and Tricks](https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071)
* [SQLite Python tutorial](http://zetcode.com/db/sqlitepythontutorial/)

# Part 1: [Web Scraping For Beginners with Python](https://medium.com/@durgaswaroop/web-scraping-with-python-introduction-7b3c0bbb6053)

In [1]:
# Check installation of beautifulsoup 
#!pip list | grep beautifulsoup

In [27]:
from bs4 import BeautifulSoup as bs
import urllib.request as ureq
import pandas as pd

In [176]:
freblogg_url = 'http://freblogg.com'
website = ureq.urlopen(freblogg_url).read()
#print(website)

Here we are calling the `urlopen()` method and passing our website url to it. The return value would be an object of class `http.client.HTTPResponse`. By calling `read()` on that object, we get all of the html for that site.

This will print all of the html of the website and you should be able to see all the html of the homepage of the site you’re scraping.

Technically, at this point you are done as you have the html of the website. But until you use this html to get some information, it is useless. So, let’s do that.

Say, I want to extract titles of all the posts on my website’s homepage. At the time of writing this post, the first three articles are:

* `How to recover from ‘git reset — hard” | Git`
* `Functions in C Programming | Part 1`
* `Matrix Multiplication | C Programming`

You should be able to see these as well if you visit Freblogg now, below this article. So, my goal is to extract this information from the html we just got. We will use Beautifulsoup for this very task.

### Using Beautifulsoup
To tell beautifulsoup to read our html, we do:

In [23]:
soup = bs(website, "html.parser")
type(soup)

bs4.BeautifulSoup

`website` is the html we scraped from the site. We are passing to beautifulsoup along with an argument `html.parser`. This is our way of telling that we’re interested in using the default html parser. If we have our own custom parsers (which we don’t), we can use that here. Not giving this argument also works. i.e., `soup = bs(webpage)` would also work, as `html.parser` is the default parser. But it never hurts to be more specific.

Now that we have our `soup` object, we can use that to get what we want. In my case, I need to get the titles of all the articles on my homepage.

To do this we have to take a look at the website you are scraping and see what identifies the things we want to extract.

In my case, all the titles of the articles are all `h2` headers as evident from the following html source

In [61]:
h2 = soup.find_all('h2')
df_data = pd.DataFrame({'h2':h2})
df_data.head(5)

Unnamed: 0,h2
0,"<h2 class=""descriptionheader""> </h2>"
1,"<h2 class=""date-header""><span>January 12, 2018..."
2,"<h2 class=""post-title entry-title""><span class..."
3,"<h2 class=""date-header""><span>January 10, 2018..."
4,"<h2 class=""post-title entry-title""><span class..."


This will give us a list of all the `<h2>` tags in the page.

Here we’re searching for all of the `<h2>` tags. Similarly if we want to get all `<div>` tags, we can do

In [62]:
div = soup.find_all('div')
df_data['div'] = pd.Series(div)
df_data.head(5)

Unnamed: 0,h2,div
0,"<h2 class=""descriptionheader""> </h2>","<div class=""main-container""> <!-- begin header..."
1,"<h2 class=""date-header""><span>January 12, 2018...","<div class=""header_container_fixed""> <!-- begi..."
2,"<h2 class=""post-title entry-title""><span class...","<div class=""headersec section"" id=""headersec"">..."
3,"<h2 class=""date-header""><span>January 10, 2018...","<div class=""widget Header"" data-version=""1"" id..."
4,"<h2 class=""post-title entry-title""><span class...","<div id=""header-inner""> <div class=""titlewrapp..."


In [177]:
for i in headers[:3]:
    print('-------------------------------------')
    print(i)

-------------------------------------
<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2018/01/webapp-with-flask-1.html">Build A Web Application With Flask In Python Part I</a>
</h2>
-------------------------------------
<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2018/01/json-parsing-with-python.html">Json Parsing With Python</a>
</h2>
-------------------------------------
<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2018/01/apache-spark-datasets-3.html">Datasets In Apache Spark - Part 3 | Writing Datasets to Disk</a>
</h2>


I have taken out some of the output to keep it short, but as you can see, the headers has more things in it than just the article titles. The actual one’s I need are the `<h2>` tags with `class="post-title entry-title"`. The other tags in the output like `<h2>★ Labels</h2>` or `<h2>★ Trending</h2>` are things that we don’t want.

From this output we figured out that the it is `class="post-title entry-title"` that actually defines the article title along with the `<h2>` tag. So, we will use that by adding an attribute dictionary as follows:

In [178]:
headers = soup.find_all('h2', attrs = {'class':'post-title entry-title'})
df_data['headers'] = pd.Series(headers)
df_data.head()

Unnamed: 0,h2,div,headers
0,"<h2 class=""descriptionheader""> </h2>","<div class=""main-container""> <!-- begin header...","<h2 class=""post-title entry-title""><span class..."
1,"<h2 class=""date-header""><span>January 12, 2018...","<div class=""header_container_fixed""> <!-- begi...","<h2 class=""post-title entry-title""><span class..."
2,"<h2 class=""post-title entry-title""><span class...","<div class=""headersec section"" id=""headersec"">...","<h2 class=""post-title entry-title""><span class..."
3,"<h2 class=""date-header""><span>January 10, 2018...","<div class=""widget Header"" data-version=""1"" id...","<h2 class=""post-title entry-title""><span class..."
4,"<h2 class=""post-title entry-title""><span class...","<div id=""header-inner""> <div class=""titlewrapp...","<h2 class=""post-title entry-title""><span class..."


Now, it is just the article headers. Better than what we had before. From here we just have one more thing to do before we actually get the title.

Before we get the article titles, let me show you a couple of cases of parsing with `beautifulsoup`. Till now we’re using `find_all` to get all the tags we want. Instead let’s use `find` which gives just one element instead of a list of all the headers.

Using the python REPL, we get:

In [138]:
soup.find('h2', attrs = {'class':'post-title entry-title'})

<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2018/01/webapp-with-flask-1.html">Build A Web Application With Flask In Python Part I</a>
</h2>

In [179]:
soup.find('h2', {'class':'post-title entry-title'}).find('span')

<span class="post_title_icon"></span>

In [180]:
help(soup.find)

Help on method find in module bs4.element:

find(name=None, attrs={}, recursive=True, text=None, **kwargs) method of bs4.BeautifulSoup instance
    Return only the first child of this Tag matching the given
    criteria.



In [181]:
soup.find('h2', {'class':'post-title entry-title'}).find('a')

<a href="http://www.freblogg.com/2018/01/webapp-with-flask-1.html">Build A Web Application With Flask In Python Part I</a>

So, Using `find()` like this in series, we can drill down a nested tag and fetch the innermost values as needed.

To get the link of an `<a>` tag:

In [184]:
anchor = soup.find('h2', {'class':'post-title entry-title'}).find('a')
#help(anchor)
#anchor.get_text()
#anchor.getText()
anchor.text.strip()

'Build A Web Application With Flask In Python Part I'

In [98]:
anchor['href']

'http://www.freblogg.com/2018/01/webapp-with-flask-1.html'

And we want to get all the titles of the articles. Each element in the headers list is a tag something like this in simple form.

`<h2><a>How to recover from 'git reset --hard" | Git</a></h2>`

The title even though it is under `<a>`, is technically also under `<h2>` as well. Which means we can just use `.text` on that and get the title.

Finally we add this

In [186]:
titles = list(map(lambda h: h.text.strip(), headers))
#titles = list(map(lambda h: h.getText(), headers))
titles

['Build A Web Application With Flask In Python Part I',
 'Json Parsing With Python',
 'Datasets In Apache Spark - Part 3 | Writing Datasets to Disk',
 'Remove Duplicate Elements From An Array',
 'Reduce Image Size With Python And Tinypng',
 'Datasets In Apache Spark | Part 2',
 'My Almost Fully Automated Blogging Workflow']

In [201]:
titles_links = dict(map(lambda h: (h.text.strip(),h.find('a')['href']), headers))
pd.DataFrame(list(titles_links.items()), columns=['title', 'link'])

Unnamed: 0,title,link
0,Build A Web Application With Flask In Python P...,http://www.freblogg.com/2018/01/webapp-with-fl...
1,Json Parsing With Python,http://www.freblogg.com/2018/01/json-parsing-w...
2,Datasets In Apache Spark - Part 3 | Writing Da...,http://www.freblogg.com/2018/01/apache-spark-d...
3,Remove Duplicate Elements From An Array,http://www.freblogg.com/2018/01/remove-duplica...
4,Reduce Image Size With Python And Tinypng,http://www.freblogg.com/2018/01/resize-compres...
5,Datasets In Apache Spark | Part 2,http://www.freblogg.com/2018/01/apache-spark-d...
6,My Almost Fully Automated Blogging Workflow,http://www.freblogg.com/2017/12/my-automated-b...


# Part 2: [Web Scraping Tutorial with Python: Tips and Tricks](https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071)
* [What are the differences between the urllib, urllib2, and requests module?
](https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-and-requests-module)

There is a stand-alone ready-to-use data extracting framework called Scrapy. Apart from extracting HTML the package offers lots of functionalities like exporting data in formats, logging etc. It is also highly customisable: run different spiders on different processes, disable cookies¹ and set download delays². It can also be used to extract data using API. However, the learning curve is not smooth for the new programmers: you need to read tutorials and examples to get started.

For my use case it was too much ‘out of the box’: I just wanted to extract the links from all pages, access each link and extract information out of it.

In [2]:
import requests
url = 'https://www.malaysiakini.com/en/latest/news'
r = requests.get(url)
r.url

'https://www.malaysiakini.com/en/latest/news'

#### Passing Parameters In URLs

In [3]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get(url, params=payload)
r.url

'https://www.malaysiakini.com/en/latest/news?key1=value1&key2=value2'

#### Response Content
We can read the content of the server’s response. Consider the GitHub timeline again:

In [7]:
import requests

r = requests.get(url)
help(r)

Help on Response in module requests.models object:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, *args)
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if

In [132]:
import requests

r = requests.get('https://api.github.com/events')
#r.text

#### Implementation

In [170]:
from bs4 import BeautifulSoup as BS
import requests
STBusiness_link ='https://www.straitstimes.com/business'

**Time-out**: You can tell Requests to stop waiting for a response after a given number of seconds with the timeout parameter. Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely.

Timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.

In [171]:
# fetch the content from url
page_response = requests.get(STBusiness_link)
# parse html
page_content = BS(page_response.content, "html.parser")

In [174]:
# extract all html elements where price is stored
#prices = page_content.find_all(class_='block-link')
divs = page_content.find_all('div', attrs = {'class':'view-content'})
anchor = list()

# Part 3.[A web scraper to retrieve stock indices automatically](https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe)

* [Scraping javascript page with PyQt5 and QWebEngineView
](https://stackoverflow.com/questions/45265143/scraping-javascript-page-with-pyqt5-and-qwebengineview)

### HTML Tags

```HTML
<!DOCTYPE html>  
<html>  
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p> Hello World </p>
    <body>
</html>
```

This is the basic syntax of an HTML webpage. Every `<tag>` serves a block inside the webpage:
1. `<!DOCTYPE html>`: HTML documents must start with a type declaration.
2. The HTML document is contained between `<html>` and `</html>`.
3. The meta and script declaration of the HTML document is between `<head>` and `</head>`.
4. The visible part of the HTML document is between `<body>` and `</body>` tags.
5. Title headings are defined with the `<h1>` through `<h6>` tags.
6. Paragraphs are defined with the `<p>` tag.

Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` for table columns.

Also, HTML tags sometimes come with id or class attributes. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

**Scraping Rules**

* You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
* Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
* The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

In [1]:
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView

class Render(QWebEngineView):
    def __init__(self, url):
        self.html = None
        
        if not QApplication.instance():
            self.app = QApplication(sys.argv)
        else:
            self.app = QApplication.instance() 

        #self.app = QApplication(sys.argv)
        
        QWebEngineView.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        #self.setHtml(html)
        self.load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        # This is an async call, you need to wait for this
        # to be called before closing the app
        self.page().toHtml(self._callable)

    def _callable(self, data):
        self.html = data
        # Data has been stored, it's safe to quit the app
        self.app.quit()

In [2]:
import bs4 as bs
from lxml import html
url = 'https://pythonprogramming.net/parsememcparseface/' 
client_source = Render(url).html

soup = bs.BeautifulSoup(client_source, 'lxml')
js_test = soup.find('p', attrs = {'class': 'jstest'})
js_test.text.strip()

'Look at you shinin!'

In [None]:
import pandas as pd
import requests

page_link = 'https://www.bloomberg.com/quote/SPX:IND'
source = Render(page_link).html

page_content = bs.BeautifulSoup(source, "html.parser")

div = page_content.find_all('div', 
                            attrs = {'class':'overviewRow__0956421f'})
div

In [3]:
import json

r = {'is_claimed': 'True', 'rating': 3.5}
r = json.dumps(r)
loaded_r = json.loads(r)
type(r)

str

In [1]:
loaded_r['rating'] #Output 3.5

3.5

In [None]:
type(r) #Output str

In [None]:
type(loaded_r) #Output dict

In [None]:
#!pip install feedparser

WARNING! You are attempting to install newspaper's python2 repository on python3. PLEASE RUN `$ pip3 install newspaper3k` for python3 or `$ pip install newspaper` for python2

[Newspaper3k: Article scraping & curation](https://newspaper.readthedocs.io/en/latest/)

In [161]:
import feedparser as fp
import json
import newspaper
from newspaper import Article
from time import mktime
from datetime import datetime

In [162]:
# Set the limit for number of articles to download
LIMIT = 50

data = {}
data['newspapers'] = {}

In [163]:
'''
NewsPapers = {"malaysiakini": {"link": "https://www.malaysiakini.com/en/latest/news"},
              "bbc": {"rss": "http://feeds.bbci.co.uk/news/rss.xml", "link": "http://www.bbc.com/"}}

with open('NewsPapers.json', 'w') as in_file:
    companies = json.dump(NewsPapers, in_file)
'''

'\nNewsPapers = {"malaysiakini": {"link": "https://www.malaysiakini.com/en/latest/news"},\n              "bbc": {"rss": "http://feeds.bbci.co.uk/news/rss.xml", "link": "http://www.bbc.com/"}}\n\nwith open(\'NewsPapers.json\', \'w\') as in_file:\n    companies = json.dump(NewsPapers, in_file)\n'

In [164]:
# Loads the JSON files with news sites
with open('NewsPapers.json') as data_file:
    companies = json.load(data_file)
companies

{'malaysiakini': {'link': 'https://www.malaysiakini.com/en/latest/news'},
 'bbc': {'rss': 'http://feeds.bbci.co.uk/news/rss.xml',
  'link': 'http://www.bbc.com/'}}

In [166]:
# Iterate through each news company
for company, value in companies.items():
    if 'rss' in value:
        rss = value['rss']
        d = fp.parse(rss)
        print('###############################################################')
        print(rss)
        #print(d.entries[0].keys())
        print("Downloading articles from ", company)
        newsPaper = {
            "rss": value['rss'],
            "link": value['link'],
            "articles": []
        }
        
        count = 1
        
        #every entry is a news
        for entry in d.entries:
            #print(entry.keys())
            if hasattr(entry, 'published'):
            #if 'published' in entry.keys():
                if count > LIMIT:
                    break
                article = {}
                article['link'] = entry.link
                date = entry.published_parsed
                #print(entry.link)
                #print(date)
                article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
                
                try:
                    content = Article(entry.link)
                    content.download()
                    content.parse()
                except Exception as e:
                    print(e)
                    print("continuing...")
                    continue
                  
                article['title'] = content.title
                article['text'] = content.text[:10]
                print(article)

                
                newsPaper['articles'].append(article)
                #print(count, "articles downloaded from", company, ", url: ", entry.link)
                count = count + 1        

    else:
        # This is the fallback method if a RSS-feed link is not provided.
        # It uses the python newspaper library to extract articles
        print("Building site for ", company)
        paper = newspaper.build(value['link'], memoize_articles=False, language='en')
        #print(help(paper))
        print(len(paper.articles))
        
        newsPaper = {
            "link": value['link'],
            "articles": []
        }
        noneTypeCount = 0
        
        #each content is an article
        for content in paper.articles:
            #if count > LIMIT:
             #   break
            if content.title is 'en':
                continue
            try:
                content.download()
                content.parse()
            except Exception as e:
                print(e)
                print("continuing...")
                continue
                
            # Again, for consistency, if there is no found publish date the article will be skipped.
            # After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
            if content.publish_date is None:
                print(count, " Article has date of type None...")
                noneTypeCount = noneTypeCount + 1
                if noneTypeCount > 10:
                    print("Too many noneType dates, aborting...")
                    noneTypeCount = 0
                    break
                count = count + 1
                continue
                
            article = {}
            article['title'] = content.title
            article['text'] = content.text
            article['link'] = content.url
            article['published'] = content.publish_date.isoformat()
            
            newsPaper['articles'].append(article)
            #print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
            print(count, content.url, content.publish_date, content.title[:10])
            count = count + 1
            noneTypeCount = 0

Building site for  malaysiakini
118
36  Article has date of type None...
37 https://www.malaysiakini.com/news/439936 2018-08-22 12:38:00+08:00 Guan Eng r
38 https://www.malaysiakini.com/news/439931 2018-08-22 12:31:00+08:00 MCA must c
39 https://www.malaysiakini.com/news/439926 2018-08-22 10:16:00+08:00 Keeping Ha
40 https://www.malaysiakini.com/news/439922 2018-08-22 09:23:00+08:00 Top 5 back
41 https://www.malaysiakini.com/news/439928 2018-08-22 11:00:00+08:00 Wan Azizah
42 https://www.malaysiakini.com/news/439929 2018-08-22 11:19:00+08:00 Report: Pr
43 https://www.malaysiakini.com/news/439918 2018-08-22 07:54:00+08:00 Seri Setia
44 https://www.malaysiakini.com/news/439913 2018-08-22 07:01:00+08:00 New rules 
45 https://www.malaysiakini.com/news/439916 2018-08-22 07:38:00+08:00 Yoursay: D
46 https://www.malaysiakini.com/news/439915 2018-08-22 07:22:00+08:00 Three Chin
47 https://www.malaysiakini.com/news/439914 2018-08-22 07:16:00+08:00 Selamat Ha
48 https://www.malaysiakini.com/news

137 https://www.malaysiakini.com/news/439851 2018-08-21 16:04:00+08:00 “刘特佐当年权势才大
138 https://www.malaysiakini.com/news/439837 2018-08-21 14:36:00+08:00 已获中国谅解不反对，


KeyboardInterrupt: 

# Part 4.Pitfalls

**3.1 Check robots.txt**

The scraping rules of the websites can be found in the `robots.txt` file. You can find it by writing robots.txt after the main domain, e.g `www.website_to_scrape.com/robots.txt`. These rules identify which parts of the websites are not allowed to be automatically extracted or how frequently a bot is allowed to request a page. Most people don’t care about it, but try to be respectful and at least look at the rules even if you don’t plan to follow them.

**3.2 HTML can be evil**

HTML tags can contain id, class or both. HTML id specifies a unique id and HTML class is non-unique. Changes in the class name or element could either break your code or deliver wrong results.

There are two ways to avoid it or at least to be alerted about it:

* Use specific id rather than class since it is less likely to be changed
* Check if the element returns None

```python
price = page_content.find(id='listings_prices')
# check if the element with such id exists or not
if price is None:
    # NOTIFY! LOG IT, COUNT IT
else:
    # do something
```

However, because some fields can be optional (like `discounted_price` in our HTML example), corresponding elements would not appear on each listing. In this case you can count the percentage of how many times this specific element returned `None` to the number of listings. If it is 100%, you might want to check if the element name was changed.

**3.3 User agent spoofing**

Everytime you visit a website, it gets your browser information via user agent. Some websites won’t show you any content unless you provide a user agent. Also, some sites offer different content to different browsers. Websites do not want to block genuine users but you would look suspicious if you send 200 requests/second with the same user agent. A way out might be either to generate (almost) random user agent or to set one yourself.

```python
# library to generate user agent
from user_agent import generate_user_agent
# generate a user agent
headers = {'User-Agent': generate_user_agent(device_type="desktop", os=('mac', 'linux'))}
#headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.63 Safari/537.36'}
page_response = requests.get(page_link, timeout=5, headers=headers)
```

**3.4 Timeout request**

By default, Request will keep waiting for a response indefinitely. Therefore, it is advised to set the timeout parameter.
```python
# timeout is set to 5 secodns
page_response = requests.get(page_link, timeout=5, headers=headers)
```

**3.5 Did I get blocked?**

Frequent appearance of the status codes like 404 (Not Found), 403 (Forbidden), 408 (Request Timeout) might indicate that you got blocked. You may want to check for those error codes and proceed accordingly.
Also, be ready to handle exceptions from the request.

```python
try:
    page_response = requests.get(page_link, timeout=5)
    if page_response.status_code == 200:
        # extract
    else:
        print(page_response.status_code)
        # notify, try again
except requests.Timeout as e:
    print("It is time to timeout")
    print(str(e))
except # other exception
```

**3.6 IP Rotation**

Even if you randomize your user agent, all your requests will be from the same IP address. That doesn’t sound abnormal because libraries, universities, and also companies have only a few IP addresses. However, if there are uncommonly many requests coming from a single IP address, a server can detect it. 
Using shared `proxies, VPNs or TOR` can help you become a ghost :).
```python
proxies = {'http' : 'http://10.10.0.0:0000',  
          'https': 'http://120.10.0.0:0000'}
page_response = requests.get(page_link, proxies=proxies, timeout=5)  

```
By using a shared proxy, the website will see the IP address of the proxy server and not yours. A VPN connects you to another network and the IP address of the VPN provider will be sent to the website.


**3.7 Honeypots**

Honeypots are means to detect crawlers or scrapers.

These can be ‘hidden’ links that are not visible to the users but can be extracted by scrapers/spiders. Such links will have a CSS style set to `display:none`, they can be blended by having the color of the background, or even be moved off of the visible area of the page. Once your crawler visits such a link, your IP address can be flagged for further investigation, or even be instantly blocked.

Another way to spot crawlers is to add links with infinitely deep directory trees. Then one would need to limit the number of retrieved pages or limit the traversal depth.


**4. Dos and Don’ts**
* Before scraping, check if there is a public API available. Public APIs provide easier and faster (and legal) data retrieval than web scraping. Check out Twitter API that provides APIs for different purposes.
* In case you scrape lots of data, you might want to consider using a database to be able to analyze or retrieve it fast. Follow this tutorial on how to create a local database with python.
* Be polite. As this answer suggests, it is recommended to let people know that you are scraping their website so they can better respond to the problems your bot might cause.
Again, do not overload the website by sending hundreds of requests per second.


**5. Speed up — parallelization**
If you decide to parallelize your program, be careful with your implementation so you don’t slam the server. And be sure you read the Dos and Don’ts section. Check out the the definitions of parallelization vs concurrency, processors and threads here and here.

If you extract a huge amount of information from the page and do some preprocessing of the data while scraping, the number of requests per second you send to the page can be relatively low.

For my other project where I scraped apartment rental prices, I did heavy preprocessing of the data while scraping, which resulted in 1 request/second. In order to scrape 4K ads, my program would run for about one hour.

In order to send requests in parallel you might want to use a multiprocessing package.

Let’s say we have 100 pages and we want to assign every processor equal amount of pages to work with. If n is the number of CPUs, you can evenly chunk all pages into the n bins and assign each bin to a processor. Each process will have its own name, target function and the arguments to work with. The name of the process can be used afterwards to enable writing data to a specific file.

I assigned 1K pages to each of my 4 CPUs which yielded 4 requests/second and reduced the scraping time to around 17 mins.

```python
import numpy as np
import multiprocessing as multi

def chunks(n, page_list):
    """Splits the list into n chunks"""
    return np.array_split(page_list,n)
 
cpus = multi.cpu_count()
workers = []
page_list = ['www.website.com/page1.html', 'www.website.com/page2.html'
             'www.website.com/page3.html', 'www.website.com/page4.html']

page_bins = chunks(cpus, page_list)

for cpu in range(cpus):
    sys.stdout.write("CPU " + str(cpu) + "\n")
    # Process that will send corresponding list of pages 
    # to the function perform_extraction
    worker = multi.Process(name=str(cpu), 
                           target=perform_extraction, 
                           args=(page_bins[cpu],))
    worker.start()
    workers.append(worker)

for worker in workers:
    worker.join()
    
def perform_extraction(page_ranges):
    """Extracts data, does preprocessing, writes the data"""
    # do requests and BeautifulSoup
    # preprocess the data
    file_name = multi.current_process().name+'.txt'
    # write into current process file
```