## A broader conversation on webscraping

#### This week, we'll want to take some time talking a little bit more about the process of webscraping and adding some more context on why we would do so. We'll be doing this by taking a peek around engadget.com

### 1. The first thing that we want to do is take a look at the robots.txt to understand a little more about what we can and can't do on this page. NOTE: we are doing this without splinter

- https://moz.com/learn/seo/robotstxt

- https://www.engadget.com/robots.txt

- We'll also take a moment to talk about using the robots.txt to decision about how your scraper is going to work.

#### Theoretically, you can use the robots.txt to do a lot of things to make your scraper more compliant, as we've talked about before, take a bit of time to familiarize yourself with how to use these.

- We'll also take a quick look at the sitemap and briefly describe how that is used.

- https://www.engadget.com/sitemap.xml

- https://www.searchenginejournal.com/technical-seo/xml-sitemaps/

#### We will also very briefly discuss crawling a website,

- This is something that you'll also want to read into, it is a sort of more organized scraping, but whenever you do either make sure that you are being fair to whoever is serving the pages and following their recommendations to the best of your abilities.

- https://www.cloudflare.com/learning/bots/what-is-a-web-crawler/

In [1]:
import urllib

In [2]:
import requests
from bs4 import BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup

html_code = requests.get('http://engadget.com/robots.txt')
soup = BeautifulSoup(html_code.content, 'html.parser')
page_list = ['/homepage', '/forward', '/traffic']

display(soup.prettify)

def check_robots(html_input):
    checktext = 'User-agent: *\r\nSitemap: https://www.engadget.com/sitemap.xml\r\nSitemap: https://www.engadget.com/sitemaps/engadget-sitemap_index_US_en-US.xml.gz\r\nSitemap: https://www.engadget.com/sitemaps/engadget-sitemap_googlenewsindex_US_en-US.xml.gz\r\nSitemap: https://www.engadget.com/sitemaps/engadget-sitemap_googlenews_US_en-US.xml.gz\r\nDisallow: /forward\r\nDisallow: /traffic\r\nDisallow: /mm_track\r\nDisallow: /tag/expire-images*\r\nDisallow: /_uac/adpage.html'
    
    if checktext == html_input:
        page_list = ['/homepage']
        return(f'Good news! We can continue scraping, but there are some paths we will avoid! The updated page_list is {page_list}')
    else:
        return("""Make sure that you are able to read in and check out the robots.txt! The current version doesn't appear to match our records""")

    
check_robots(str(soup))


<bound method Tag.prettify of User-agent: *
Sitemap: https://www.engadget.com/sitemap.xml
Sitemap: https://www.engadget.com/sitemaps/engadget-sitemap_index_US_en-US.xml.gz
Sitemap: https://www.engadget.com/sitemaps/engadget-sitemap_googlenewsindex_US_en-US.xml.gz
Sitemap: https://www.engadget.com/sitemaps/engadget-sitemap_googlenews_US_en-US.xml.gz
Disallow: /forward
Disallow: /traffic
Disallow: /mm_track
Disallow: /tag/expire-images*
Disallow: /_uac/adpage.html>

"Good news! We can continue scraping, but there are some paths we will avoid! The updated page_list is ['/homepage']"

### 2. The next thing we'll want to do is collect some information from the homepage of Engadget.

- We're going to be looking at some of the main elements on the homepage.

- Will want to look at the actual soup that we'll be looking through at this point as well.

- First, we will want to get the article names as well as the names of the people writing those articles.

In [3]:
executable_path = {'executable_path' : '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless = False)

browser.visit('http://engadget.com')
html_code = browser.html
soup = BeautifulSoup(html_code, 'html.parser')

In [4]:
soup.find(id = 'engadget-main-dl')

<div class="grid__cell col-12-of-15 col-15-of-15@tl- grid-divider@d" data-ylk="sec:dl;itc:0" id="engadget-main-dl">
<div class="grid-divider-el bc-gray-2 h-85@tl- h-160@d hide@tl-"></div>
<div class="grid flex@tl+">
<!-- primary column -->
<div class="grid__cell col-8-of-12 col-12-of-12@tp-">
<!-- feature-listing-primary-alt -->
<article class="o-hit">
<!-- rating-thumb --><div class="o-rating_thumb c-white">
<div>
<img alt="Lenovo Chromebook Duet review: A surprisingly solid tablet experience" class="stretch-img" src="https://s.yimg.com/os/creatr-uploaded-images/2020-05/c5c21420-9f67-11ea-abfe-b5a8440aeaad"/>
<!-- slideshow-count -->
<svg class="absolute l-0 b-0 slideshow-icon" xmlns="http://www.w3.org/2000/svg"><use xlink:href="#icon-slideshow" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg>
<div class="t-meta absolute l-55 b-20 font-alias">
        15
    </div>
<div class="absolute l-0 b-0 b-55@s"></div>
</div>
<div class="o-rating_thumb__rating b-25@m- z-3"><!-- rating -->

In [5]:
[print(item.text) for item in soup.find(id = 'engadget-main-dl').find_all('span')];


      Lenovo Chromebook Duet review: A surprisingly solid tablet experience
  

By 
N. Ingraham, 
                 16h ago
By 
N. Ingraham
, 

                 16h ago


share



Philips Hue leaks show new versatility for Lightstrip Plus and Bloom

By 
R. England, 
                 19h ago
By 
R. England
, 

                 19h ago


share



'Pokémon Go' will better blend AR creatures into the real world

By 
J. Fingas, 
                 12h ago
By 
J. Fingas
, 

                 12h ago


share



HBO Max's early sign-up discount ends at 3AM ET

By 
M. DeAngelis, 
                 14h ago
By 
M. DeAngelis
, 

                 14h ago


share



Toxic coast: Cleaning up a century of industrial waste in New Jersey

By 
J. Dinneen, 
                 17h ago
By 
J. Dinneen
, 

                 17h ago


share





### 3. Next, I'll want to pull some of the photos from the first stories on the page to be downloaded to our local machine

- Although this is a very small example, you can imagine this data being extremely useful if sliced correctly.

- I've had friends of mine build scrapers in the past to get all sorts of valuable data on products online.

- Tagged photos are hard to come by, so if you have a photo with some amount of information attached then it could be exceedingly useful for training models or for a computer vision project.

In [6]:
[print(item) for item in soup.find(id = 'engadget-main-dl').find_all('img')]

<img alt="Lenovo Chromebook Duet review: A surprisingly solid tablet experience" class="stretch-img" src="https://s.yimg.com/os/creatr-uploaded-images/2020-05/c5c21420-9f67-11ea-abfe-b5a8440aeaad"/>
<img class="vc circle-mask absolute l-0 " height="30px" src="https://o.aolcdn.com/images/dims?thumbnail=30%2C30&amp;quality=70&amp;image_uri=https%3A%2F%2Fs.yimg.com%2Fuu%2Fapi%2Fres%2F1.2%2FKDPNKi2quaEStzz2SR78Sw--%7EB%2FaD0xMjAwO3c9MTIwMDthcHBpZD15dGFjaHlvbg--%2Fhttps%3A%2F%2Fs.blogcdn.com%2Fwww.engadget.com%2Fmedia%2F2017%2F08%2Funnamed.jpg&amp;client=amp-blogside-v2&amp;signature=ebba7bd501c6f716e81d996fac118313a4c717aa" width="30px"/>
<img alt="Philips Hue leaks show new versatility for Lightstrip Plus and Bloom" class="stretch-img" src="https://s.yimg.com/os/creatr-uploaded-images/2020-05/a43bb2b0-9f4f-11ea-9f97-16adc26ef4a0"/>
<img class="vc circle-mask absolute l-0 hide@tl" height="30px" src="https://o.aolcdn.com/images/dims?thumbnail=30%2C30&amp;quality=70&amp;image_uri=https%3A%2F

[None, None, None, None, None, None, None, None, None, None, None]

In [7]:
photoslist = []
for item in soup.find(id = 'engadget-main-dl').find_all('img'):
    if 'creatr-uploaded-images' in str(item):
        photoslist.append(item.get('src'))
        display(item.get('src'))
        
photoslist = photoslist[:-1]

display(print('\n\n\n'), photoslist)

'https://s.yimg.com/os/creatr-uploaded-images/2020-05/c5c21420-9f67-11ea-abfe-b5a8440aeaad'

'https://s.yimg.com/os/creatr-uploaded-images/2020-05/a43bb2b0-9f4f-11ea-9f97-16adc26ef4a0'

'https://s.yimg.com/os/creatr-uploaded-images/2020-05/92e627a0-9f92-11ea-bf93-60346e9686cf'

'https://s.yimg.com/os/creatr-uploaded-images/2020-05/c0995b40-9f74-11ea-8ee6-af1bdef8069f'

'https://s.yimg.com/os/creatr-uploaded-images/2020-05/c393dff0-9f5b-11ea-bddd-1085af783093'

'https://o.aolcdn.com/images/dims?thumbnail=30%2C30&quality=70&image_uri=https%3A%2F%2Fs.yimg.com%2Fos%2Fcreatr-uploaded-images%2F2020-05%2F5eac5560-9b76-11ea-befb-068b19fc35fd&client=amp-blogside-v2&signature=45a2bcedea7eda99f0a8e7086351500daf182df5'







None

['https://s.yimg.com/os/creatr-uploaded-images/2020-05/c5c21420-9f67-11ea-abfe-b5a8440aeaad',
 'https://s.yimg.com/os/creatr-uploaded-images/2020-05/a43bb2b0-9f4f-11ea-9f97-16adc26ef4a0',
 'https://s.yimg.com/os/creatr-uploaded-images/2020-05/92e627a0-9f92-11ea-bf93-60346e9686cf',
 'https://s.yimg.com/os/creatr-uploaded-images/2020-05/c0995b40-9f74-11ea-8ee6-af1bdef8069f',
 'https://s.yimg.com/os/creatr-uploaded-images/2020-05/c393dff0-9f5b-11ea-bddd-1085af783093']

In [8]:
counter = 1
for item in photoslist:
    urllib.request.urlretrieve(item, f'scraped_picture{counter}.jpg')
    counter += 1

In [9]:
! open .

### 4. Now, I'll want to collect the urls of some of the stories that are available on engadget.

- There are tons of reasons why you may want to do this, but the example that we'll be talking about will be the very very basics of language processing. 

- You could have a stock trading algorithm trained on stock market data as well as news stories. These algorithms are already increidbly prevalent and can be trained on any number of pages at any time. In our example, we'll be "crawling" a small part of Engadget and pulling in all of the information from the articles that are linked to from the pages we'll be crawling.

- We will talk through the code that was run here, we will step through getting all of the stories on a page (talk about how we could do several pages), and then continue on to actually get the information from each story in the list of articles.

- https://www.engadget.com/all/page/1/

In [10]:
browser.visit('https://www.engadget.com/all/page/1/')
soup = BeautifulSoup(browser.html, 'html.parser')

In [11]:
str(soup.find(id = 'engadget-the-latest'))[:5000]

'<div class="grid@tl+__cell col-8-of-12@tl col-11-of-15@d flex-1" data-ylk="sec:thelatest;itc:0;" id="engadget-the-latest">\n<div class="">\n<div class="container@m container@tp">\n<!-- latest-listing-featured -->\n<article class="">\n<div class="grid@m+">\n<div class="hide@tp- col-1-of-8@tl+ col-1-of-11@d grid@m+__cell">\n<div class="c-gray-7">\n<svg class="icon vt" xmlns="http://www.w3.org/2000/svg">\n<use xlink:href="#icon-clock" xmlns:xlink="http://www.w3.org/1999/xlink"></use>\n</svg>\n<div class="inline-block tx-meta ml-5 icon-line-height absolute">\n          1h \n        </div>\n</div>\n</div>\n<div class="col-4-of-4@tp col-7-of-8@tl+ col-10-of-11@d grid@m+__cell mb-10@m+">\n<div class="relative c-white">\n<div class="absolute l-0 b-0 b-5@s"></div>\n<div class="o-rating_thumb@m-"><a data-rapid_p="1" data-v9y="1" data-ylk="pos:1;slk:Google%20adds%201440p%20streaming%20resolution%20for%20Stadia%20on%20Chrome;elm:hdln;aid:engadget_479%3Dbsd%3Aac0bbc22-60e2-3ad0-b787-f19e84720d5e;"

In [12]:
art_name = []
for item in soup.find(id = 'engadget-the-latest').find_all('h2'):
    if "mt-10@tp+ t-h4@s t-h3-c@m t-h3-b@tp t-h4@tl t-h3@d" in str(item):
        art_name.append(item.text)
        
art_name = art_name[1:]

In [13]:
len(art_name)

21

In [26]:
[print(item) for item in art_name[:5]]



                  YouTube Kids is now on Apple TV
              



                  LG's first 48-inch 4K OLED TV is starting to roll out
              



                  South Korean cafe uses robotic baristas to comply with social distancing
              



                      Sleep soundly every night with this top-rated app
                    



                  MacOS Catalina update wrings more life from your MacBook's battery
              



[None, None, None, None, None]

In [15]:
soup.find(id = 'engadget-the-latest').find_all('article')[1].find_all('a')[-1].get('href')

'/you-tube-kids-on-apple-tv-043113790.html'

In [16]:
art_refs = []
for item in soup.find(id = 'engadget-the-latest').find_all('article')[1:]:
    art_refs.append(item.find_all('a')[-1].get('href'))

In [17]:
len(art_refs)

21

In [18]:
import pandas as pd
articles = pd.DataFrame({'article_names' : art_name, 'article_urls' : art_refs})

### After getting both the articles and the urls to them, we will end up going back through one final time to visit every single page and then actually peel out all of the article text from each page

In [19]:
articles

Unnamed: 0,article_names,article_urls
0,\n\n YouTube Kids is now on A...,/you-tube-kids-on-apple-tv-043113790.html
1,\n\n LG's first 48-inch 4K OL...,/lg-48cx-4k-oled-022836116.html
2,\n\n South Korean cafe uses r...,/south-korea-cafe-robotic-baristas-011123262.html
3,\n\n Sleep soundly every ...,https://beap.gemini.yahoo.com/mbclk?bv=1.0.0&e...
4,\n\n MacOS Catalina update wr...,/macos-catalina-battery-health-management-upda...
5,\n\n Twitter fact checks Trum...,/twitter-fact-checks-donald-trump-223653101.html
6,\n\n Google's work from home ...,/google-work-from-home-allowance-221701178.html
7,\n\n Microsoft Edge has a cut...,/microsoft-edge-surf-game-214711550.html
8,\n\n This Ableton and Log...,https://beap.gemini.yahoo.com/mbclk?bv=1.0.0&e...
9,\n\n Blizzard has canceled th...,/blizzard-blizzcon-event-canceled-covid-19-cor...


In [20]:
clean_df = articles[~articles.article_urls.str.contains('https')]

final_urls = list(articles[~articles.article_urls.str.contains('https')].article_urls)

top_df = clean_df.head()

In [21]:
def get_article_text(url_part):
    
    temp_list = []
    browser.visit(f'https://engadget.com{url_part}')
    temp_soup = BeautifulSoup(browser.html, 'html.parser')
    
    for item in temp_soup.find(id = 'page_body').find_all('p'):
        temp_list.append(item.get_text())
    
    finalstring = ''
    for item in temp_list:
        finalstring = finalstring + item
    return(finalstring)

In [22]:
top_df['arts'] = top_df.article_urls.apply(get_article_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [23]:
browser.visit(f'https://engadget.com{final_urls[0]}')
browser.html

last_soup = BeautifulSoup(browser.html, 'html.parser')
last_soup.find(id = 'page_body').find_all('p')

[<p>YouTube’s dedicated app for kids is now <a data-rapid_p="1" data-v9y="1" href="https://support.google.com/youtubekids/thread/49349707" target="_blank">out for download</a> from the App Store on Apple TV, so long as it’s available in your region. It works on both the 4K and the HD versions of the device, and you can use the Siri Remote to fire it up by saying “Hey Siri, open YouTube Kids.” While you can let your kids use the app without signing in, you’ll have to log in to import your current parental control settings, though you can always change them through the app on a phone or a tablet.</p>,
 <p>YouTube Kids launched as an app for mobile devices in 2015 to provide children a curated selection of age-appropriate content. Since then, YouTube has rolled out the app to various smart TVs and eventually launched a <a data-rapid_p="2" data-v9y="0" href="https://www.engadget.com/2019-08-29-youtube-kids-website.html">desktop version</a>. The platform has been dealing with <a data-rapid_

In [24]:
top_df.arts[0]

'YouTube’s dedicated app for kids is now out for download from the App Store on Apple TV, so long as it’s available in your region. It works on both the 4K and the HD versions of the device, and you can use the Siri Remote to fire it up by saying “Hey Siri, open YouTube Kids.” While you can let your kids use the app without signing in, you’ll have to log in to import your current parental control settings, though you can always change them through the app on a phone or a tablet.YouTube Kids launched as an app for mobile devices in 2015 to provide children a curated selection of age-appropriate content. Since then, YouTube has rolled out the app to various smart TVs and eventually launched a desktop version. The platform has been dealing with child-exploitative videos masquerading as child-friendly content on its website for years. While the YouTube Kids app has had its own share of issues — back in 2018, it suggested conspiracy theory videos when you search for certain keywords — it co

### Will want to briefly talk about how to move through several pages.

- Again, there are so many different use-cases for scraping that you'll want to think carefully about your goals

- In this specific use-case, we may want to visit several pages, get all of the article names and urls, and then send another scraper through to pull information from all of the articles.

- Again, you will always want to make sure that you are scraping ethically and in a way that won't cause you any trouble in the future. Much like APIs, it is best-practice to pull a page once, learn the ins and outs of it, and then when you have built up a good parser, you can run it over several of the same style page.

- Nearly every website will have pages that repeat, and once you undrstand how that html scaffold looks, you'll be able to parse it very efficiently.

- Keep in mind that the more obviously non-human the activity of your scraper is, the more likely it is that you'll be caught.

In [25]:
for num in range (1,6):
    browser.visit(f'https://www.engadget.com/all/page/{num}/')