# Scraping Web Pages with Beautiful Soup and Python 

### DISCLAIMER: The website crawled in a major part of this notebook contains strong language. Programmer's discretion is advised.

# 1. Working with Web Data Using Requests and Beautiful Soup 
## Collecting a Web Page with Requests

In [1]:
import requests
url = 'https://motherfuckingwebsite.com/'


Next, we can assign the result of a request of that page to the variable page with the <code>request.get()</code> method. We pass the page’s URL (that was assigned to the url variable) to that method.

In [4]:
page_content = requests.get(url)

In [5]:
page_content.status_code

200

The returned code of 200 tells us that the page downloaded successfully. Codes that begin with the number 2 generally indicate success, while codes that begin with a 4 or 5 indicate that an error occurred.<br><br>
In order to work with web data, we’re going to want to access the text-based content of web files. We can read the content of the server’s response with <code>page.text</code>

In [6]:
page_content.text

'<!DOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    \n    <!-- FOR THE CURIOUS: This site was made by @thebarrytone. Don\'t tell my mom. -->\n    \n    <title>Motherfucking Website</title>\n</head>\n\n<body>\n    <header>\n        <h1>This is a motherfucking website.</h1>\n        <aside>And it\'s fucking perfect.</aside>\n    </header>\n        \n        <h2>Seriously, what the fuck else do you want?</h2>\n        \n        <p>You probably build websites and think your shit is special. You think your 13 megabyte parallax-ative home page is going to get you some fucking Awwward banner you can glue to the top corner of your site. You think your 40-pound jQuery file and 83 polyfills give IE7 a boner because it finally has box-shadow. Wrong, motherfucker. Let me describe your perfect-ass website:</p>\n        \n        <ul>\n            <li>Shit\'s lightweight and loads fast</li>\n            <li>Fits 

We see that the full text of the page was printed out, with all of its HTML tags. However, it is difficult to read because there is not much spacing.<br><br>

### Now, we want (and would) leverage the Beautiful Soup module to work with this textual data in a more human-friendly manner and readable manner.

## Stepping through a web page with BeautifulSoup

The Beautiful Soup library creates a parse tree from parsed HTML and XML documents (including documents with non-closed tags or tag soup and other malformed markup). 

In [8]:
from bs4 import BeautifulSoup
crawled = BeautifulSoup(page_content.text, 'html.parser')
print(crawled.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- FOR THE CURIOUS: This site was made by @thebarrytone. Don't tell my mom. -->
  <title>
   Motherfucking Website
  </title>
 </head>
 <body>
  <header>
   <h1>
    This is a motherfucking website.
   </h1>
   <aside>
    And it's fucking perfect.
   </aside>
  </header>
  <h2>
   Seriously, what the fuck else do you want?
  </h2>
  <p>
   You probably build websites and think your shit is special. You think your 13 megabyte parallax-ative home page is going to get you some fucking Awwward banner you can glue to the top corner of your site. You think your 40-pound jQuery file and 83 polyfills give IE7 a boner because it finally has box-shadow. Wrong, motherfucker. Let me describe your perfect-ass website:
  </p>
  <ul>
   <li>
    Shit's lightweight and loads fast
   </li>
   <li>
    Fits on all your shitty screens
   </li>
   <li>
    Looks the same in

now it's literraly __Beautiful__ from the previous __Soup__....<br>
😂

## Finding and extracting 'a' particular Tag.

We can extract a single tag from a page by using Beautiful Soup’s <code>find_all</code> method. This will return all instances of a given tag within a document.<br><br>Running that method on our object returns the full text of the song along with the relevant 'p' tags and any tags contained within that requested tag, which here includes the line break tags 'br/':

In [10]:
crawled.find_all('p')

[<p>You probably build websites and think your shit is special. You think your 13 megabyte parallax-ative home page is going to get you some fucking Awwward banner you can glue to the top corner of your site. You think your 40-pound jQuery file and 83 polyfills give IE7 a boner because it finally has box-shadow. Wrong, motherfucker. Let me describe your perfect-ass website:</p>,
 <p>You. Are. Over-designing. Look at this shit. It's a motherfucking website. Why the fuck do you need to animate a fucking trendy-ass banner flag when I hover over that useless piece of shit? You spent hours on it and added 80 kilobytes to your fucking site, and some motherfucker jabbing at it on their iPad with fat sausage fingers will never see that shit. Not to mention blind people will never see that shit, but they don't see any of your shitty shit.</p>,
 <p>You never knew it, but this is your perfect website. Here's why.</p>,
 <p>This entire page weighs less than the gradient-meshed facebook logo on your

The output above that the data is contained in square brackets. This means it is a Python list data type.<br>

Because it is a list, we can call a particular item within it and use the <code>get_text()</code> method to extract all the text from inside that tag.<br><br>
So, suppose we want to extract the text from the sixth <code>p</code> tag in the page...we can easily do this by:

In [11]:
crawled.find_all('p')[5].get_text()

"Look at this shit. You can read it ... that is, if you can read, motherfucker. It makes sense. It has motherfucking hierarchy. It's using HTML5 tags so you and your bitch-ass browser know what the fuck's in this fucking site. That's semantics, motherfucker."

### Now the real thing starts,
# _(Actual)_ Data Scraping using BS and Python

## Data

I'll be working with data from the official website of the National Gallery of Art in the United States.<br>
I would like to search the Index of Artists available via the Internet Archive’s Wayback Machine at the following URL:

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

 From https://web.archive.org/web/20121007172927/http://www.nga.gov/collection/anM1.htm I want to get all artists' last names beginning with 'M'

### let's start
## Collecting and Parsing a Web Page

In [13]:
import requests
from bs4 import BeautifulSoup

#getting contents of first page
page = requests.get("https://web.archive.org/web/20121007172927/http://www.nga.gov/collection/anM1.htm")
artistic_soup = BeautifulSoup(page.text, 'html.parser')


With our page collected, parsed, and set up as a BeautifulSoup object, we can move on to collecting the data that we would like.

## Pulling Text From a Web Page
I’ll collect artists’ names and the relevant links available on the website. You may want to collect different data, such as the artists’ nationality and dates. Whatever data you would like to collect, you need to find out how it is described by the DOM of the web page.

From inspect(ing) element; We’ll see first that the table of names is within div tags where class="BodyText". This is important to note so that we only search for text within this section of the web page. We also notice that the name Maar, Dora is in a link tag, since the name references a web page that describes the artist. So we will want to reference the _a_ tag for links. Each artist’s name is a reference to a link.

Let's use Beautiful Soup’s <code>find()</code> and <code>find_all()</code> methods in order to pull the text of the artists’ names from the BodyText _div_

In [25]:
# Pull all text from the BodyText div
artist_name_list = artistic_soup.find(class_='BodyText')

# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')


## Pulling the Contents from a Tag

In order to access only the actual artists’ names, we’ll want to target the contents of the <b>a</b> tags rather than print out the entire link tag.

We can do this with Beautiful Soup’s <code>.contents</code>, which will return the tag’s children as a Python list data type.

In [27]:
# Use .contents to pull out the <a> tag’s children
for artist_name in artist_name_list_items:
    links = 'https://web.archive.org' + artist_name.get('href')
    names = artist_name.contents[0]
    print(links)
    print(names)

https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=10865
Maar, Dora
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=1352
Mabuse, Jan
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=14550
Mac Orlan, Pierre
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4771
MacArdell, James
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4772
MacCoy, Guy
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4773
MacDermott, David
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4774
MacDermott, Diane Conard
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=14611
Macdonald-Wright, Stanton
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=7796
Mace, Frank J.
https://web.arc

Added <br><code>links = 'https://web.archive.org' + artist_name.get('href')<br>print(links)</code><br>
cause I wanted to also capture the URLs associated with those artists. I can extract URLs found within a page’s a tags by using Beautiful Soup’s get('href') method.

## Writing the Data to a CSV File (cool stuff)

In [33]:
# total code till now...
import requests
import csv
from bs4 import BeautifulSoup

#getting contents of first page
page = requests.get("https://web.archive.org/web/20121007172927/http://www.nga.gov/collection/anM1.htm")
artistic_soup = BeautifulSoup(page.text, 'html.parser')

# Create a file to write to, add headers row
f = csv.writer(open('artist-names.csv', 'w'))
f.writerow(['Name of the Artist', 'Official Link'])

# Pull all text from the BodyText div
artist_name_list = artistic_soup.find(class_='BodyText')

# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')

# Use .contents to pull out the <a> tag’s children
for artist_name in artist_name_list_items:
    links = 'https://web.archive.org' + artist_name.get('href')
    names = artist_name.contents[0]
    print(names)
    print(links)
    f.writerow([names, links])

Maar, Dora
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=10865
Mabuse, Jan
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=1352
Mac Orlan, Pierre
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=14550
MacArdell, James
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4771
MacCoy, Guy
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4772
MacDermott, David
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4773
MacDermott, Diane Conard
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=4774
Macdonald-Wright, Stanton
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=14611
Mace, Frank J.
https://web.archive.org/web/20121007172927/http://www.nga.gov/cgi-bin/tsearch?artistid=7796
MacEdwards, Bar