# DataCamp Web Scraping Notes 

## Chapter 1 

### HyperText Markup Language
These days, many of us are familiar with the letters "HTML", the Hypertext Markup Language read by web browsers to render and display website content. For us, this means that when we want to scrape the content from a particular website, we are often given the HTML code.  

### HTML tags
The elements contained within the angle brackets are called "HTML tags", which, in well-formatted HTML, usually come in pairs. The pair contains a starting tag without a forward slash, and stopping tag with a forward slash.  

### The HTML tree
We notice that the HTML tags are nested within others, such as the body tag nested within the html root tag; a div tag nested within the body tag; two paragraph tags nested with in the div tag; etc.. This nesting gives rise to a hierarchy in the HTML which can be visualized as a tree structure as displayed here. The vocabulary we use to describe moving around the tree comes from looking at a family tree: As we move from left to right, we are moving forward generations; as we move top to bottom, we are moving between the same generation, and moving between siblings if the elements come from the same parent element. 

### Tag, you're it!
To start, let's look at an abstract tag formatting. There are many HTML tag types that follow the same formatting; we have already seen three tag names: the html, div, and p tags. These tags can also contain **attributes** which provide special instructions for the contents contained within that tag. Specific html attribute names are followed by an equals sign, followed by information which is being passed to that attribute within the tag; in well-formatted HTML the information is in quotes. 

In [None]:
# HTML code string
html = '''
<html>
  <body>
    <div class="class1" id="div1">
      <p class="class2">Visit DataCamp!</p>
    </div>
    <div class="you-are-classy">
      <p class="class2">Keep up the good work!</p>
    </div>
  </body>
</html>
'''
# Print out the class of the second div element
whats_my_class( html )

### Another Slasher Video?
Jumping right in, a simple XPath string we could write in python is given here. One nice property of XPath notation is that you might already have some familiarity with similar syntaxes, because it uses a single forward slash in an analogous way as you do if you are navigating directories, or typing a URL into your browser. The single forward-slash moves us forward one generation. In fact, if we think of the tag-names as the "directory" names, then these simple XPaths will look very much like navigating between directories. What might seem unfamiliar are the brackets. These brackets are used to help specify which element or elements we want to direct to. For example, there could be several div elements which are children of the body element (that is, several div siblings), so, we can use the brackets to narrow in on the div element we want. 

In [None]:
xpath = '/html/body/div[2]/p'

### Slasher Double Feature?
Another important feature of XPath notation is the double forward-slash. Using the double-forward slash tells us to "look forward to all future generations" (instead of just one generation like the single forward-slash). So, for example, we could navigate to all table elements within an HTML document by simply typing double forward-slash table. Or, we could want to restrict to a specific div element (say, the one we learned how to navigate to in the last couple slides), and navigate to all table elements which are descendants of that div element. 

In [None]:
xpath = '//span[@class="span-class"]'

### To Bracket or not to Bracket
The first XPath expression without brackets and the second with brackets lead us to the same element the body element since there is only one html element at the root level, and one body element which is a child of that html element.

### Double Slashing the Brackets
Using a double forward-slash, we could have selected all paragraph elements which are within the HTML document. Adding the bracketed number 1, it turns out we select two elements! Let's take this opportunity to be very careful in describing exactly what adding the brackets filled with a number does. When we add the brackets filled with the number N, say, to the end of an XPath expression, each of the elements that are selected before adding the brackets asks: "Am I the Nth of my selected siblings?"; if the answer is "Yes!", then that element is selected. Here, with the brackets filled with the number 1, the top paragraph element is selected because it asks "Am I the first element of my selected siblings?" and answers "Yes!". The bottom paragraph element is also selected since its sibling is the div element, which was not not originally selected, and so when the bottom paragraph element asks "Am I the first element of my selected siblings?" the answer is again "Yes!". Honestly, I don't often mix double forward-slashes and brackets filled with numbers. We'll see in later slides there are other, more interesting ways to use brackets to select elements.

### The Wildcard
One final piece of notation we will cover in this lesson is the "wildcard" character, the asterisks. The asterisks indicates we want to ignore tag type. For example, in this expression, we are directed to both children of the body element, regardless that one is a div element and one is a paragraph element 

### (At)tribute
Let's start by first pointing out that in XPath notation, the @ symbol is used to distinguish attributes. For example, if we see @class, @id, or @href, in the XPath expression, then it is referring to a class attribute, id attribute, or href attribute, respectively. 

### Brackets and Attributes
We saw before that square brackets can be used in xpath syntax to hone in on a specific element or elements based on their order within a given generation. We can also include other information within square brackets to select specific elements. For example, the XPath string here will direct to all paragraph elements from //p, and then reduce down to all those whose class attribute is equal to "class-1". Note that we have the class attribute in quotations. Now, my convention is to use single quotes to define the XPath string, and double quotes as needed within the XPath expression itself.

###  Content with Contains
A useful tool we can include within our square-bracketed expression is the "contains" function. The format of the "contains" function is given abstractly here, with the left argument containing the attribute name (including the at symbol), and the right argument is the string expression we want to search for within the given attribute. What it does is searches the attributes of that specific attribute name and matches with those where the string expression is a sub-string of the full attribute.

### Get Classy
Now, let's consider how to direct to the attribute information itself. To do so, we first create an XPath expression to the element or elements we want to pull out some attribute information from. Say, we would like to direct to the class attribute of this highlighted paragraph element. We already know how to direct to the highlighted area. To direct to the attribute itself, we take the XPath, follow it by a forward slash, and follow that by the @ symbol connected to the attribute name of interest, in this case, class. As a quick note. If we were instead to use a double forward slash before the @ symbol with the attribute name, we would not only direct to the attribute of the elements selected in the XPath, but also all of those attributes in their future generations too.

In [None]:
xpath = '//p[@id="p2"]/a/@href'

In [None]:
xpath = '//a[contains(@class,"package-snippet")]/@href'