# Web Scraping

## Getting Started

Web scraping is where you write an application to download web pages and parse out information from them.

Scraping tips:
* Check website terms & conditions, there is usually a limit on how often and what you can scrape.
* Don't hammer the website with requests...
* You can get into legal. trouble if you don't follow the tips above
* Websites change so if you want your scraper to work over time you will have to maintain it
* Website data can be messy, you will have to clean it most times

## Preparing to Scrape

You can use Python’s urllib2 module to download the HTML that we need to parse or you can use the requests library. For this example, we will use requests.

Using the inspect option in your browser to view the webpage HTML can be extremelyt useful!

## BeautifulSoup

A popular HTML parser for Python.

In [1]:
# Install BeautifulSoup
!pip install beautifulsoup4



In [2]:
# Scraping using BeatifulSoup to get article headers

import requests
from bs4 import BeautifulSoup

url = 'http://www.blog.pythonlibrary.org/'

def get_articles():
    """
    Get the articles from the front page of the blog
    """
    req = requests.get(url)
    html = req.text
    soup = BeautifulSoup(html, 'html.parser')
    pages = soup.findAll('h1')

    articles = {i.a['href']: i.text.strip()
                for i in pages if i.a}
    
    for article in articles:
        s = '{title}: {url}'.format(title=articles[article].encode('utf-8'),url=article)
        print(s)

    return articles

if __name__ == '__main__':
    articles = get_articles()

In [3]:
# Scraping twitter using requests in conjuction with BeautifulSoup

import requests
from bs4 import BeautifulSoup

url = 'https://twitter.com/mousevspython'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
tweets = soup.findAll('li', 'js-stream-item')
for item in range(len(soup.find_all('p', 'TweetTextSize'))):
    tweet_text = tweets[item].get_text().encode('utf-8')
    print(tweet_text)
    dt = tweets[item].find('a', 'tweet-timestamp')
    print('This was tweeted on ' + str(dt))

## Scrapy

Scrapy is a framework that you can use for crawling websites and extracting (i.e. scraping) data. It can also be used to extract data via a website’s API or as a general purpose web crawler. 

In [4]:
# Install Scrapy
!pip install scrapy



In [5]:
!scrapy startproject blog_scraper

New Scrapy project 'blog_scraper', using template directory '/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/templates/project', created in:
    /Users/miesner.jacob/python-for-programmers-educative/Module 4 - Advanced Concepts in Python/blog_scraper

You can start your first spider with:
    cd blog_scraper
    scrapy genspider example example.com


Here we changed our items.py to define what we want to scrape, added a blog.py file to create our spider.

In [6]:
# Navigate to blog scraper folder
%cd blog_scraper

/Users/miesner.jacob/python-for-programmers-educative/Module 4 - Advanced Concepts in Python/blog_scraper


In [9]:
# Run our crawler
!scrapy crawl mouse

2021-11-26 16:30:55 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: blog_scraper)
2021-11-26 16:30:55 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.0, Platform macOS-10.14.6-x86_64-i386-64bit
2021-11-26 16:30:55 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-26 16:30:55 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'blog_scraper',
 'NEWSPIDER_MODULE': 'blog_scraper.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['blog_scraper.spiders']}
2021-11-26 16:30:55 [scrapy.extensions.telnet] INFO: Telnet Password: 79ce4922fddddda2
2021-11-26 16:30:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extension

In [10]:
# Run our crawler and export to csv
!scrapy crawl mouse -o articles.csv -t csv

  feeds = feed_process_params_from_cli(
2021-11-26 16:31:01 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: blog_scraper)
2021-11-26 16:31:01 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.0, Platform macOS-10.14.6-x86_64-i386-64bit
2021-11-26 16:31:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-26 16:31:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'blog_scraper',
 'NEWSPIDER_MODULE': 'blog_scraper.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['blog_scraper.spiders']}
2021-11-26 16:31:01 [scrapy.extensions.telnet] INFO: Telnet Password: b050cc5382edd303
2021-11-26 16:31:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.t