# Web Scraping in Python

Tag Types for HTML and meaning:
    
Basic - 

    html = creates HTML doc
    head = Sets off the title & other info that isn't displayed
    body = Sets off the visible portion of the document
    title = Puts name of the document in the title bar
    
Formatting - 

    p = Creates a new paragraph
    div = defines a division or a section in an HTML documen
    span = inline container used to mark up a part of a text, or a part of a document

Links - 

    <a href="URL">clickable text</a> = Creates a hyperlink to a Uniform Resource Locator   



## 1. Web Scraping Overview

In [2]:
# Importing the module
import os

# Printing the current working directory
print("The Current working directory is: {0}".format(os.getcwd()))

# Changing the current working directory
os.chdir('/Users/jordanstevens/Documents/Python/Web Scraping in Python/Web Scraping in Python - Datacamp')

# Print the current working directory
print("The Current working directory now is: {0}".format(os.getcwd()))

The Current working directory is: /Users/jordanstevens/Documents/Python/Web Scraping in Python/Web Scraping in Python - Datacamp
The Current working directory now is: /Users/jordanstevens/Documents/Python/Web Scraping in Python/Web Scraping in Python - Datacamp


Use Cases
    
    Comparing Prices from competitors
    
    Customer Satisfaction through online reviews of their products and gather public opinion about the company
    
    Scrape social media sites to for contacting potential clients

HTML tags are nested within others such as the p tag nested in div tag which is nested in body tag which is nested in html tag

    Root tag = <html>
    Body tag = <body> - this defines the body of the html
    Div tag = <div> - defining a section of the body
    p tags = <p> - defining paragraphs within the body
    

### From Tree to HTML

Here you are given the chance to create your own bit of HTML code (as a python string). More specifically, below is an HTML tree image and you will finish the missing code within the string html which produces this HTML tree.

In [12]:
from IPython.display import Image

Image(url='https://assets.datacamp.com/production/repositories/2560/datasets/7043ab2cf053a078afe4adbda88b669de2a6ab73/html_tree_exercise_resize.png')

In [13]:
html = '''
<html>
  <head>
    <title>Intro HTML</title>
  </head>
  <body>
  <p> Hello World! </p>
  <p> Enjoy DataCamp! </p>
  </body>
</html>
'''

### Keep it classy

Here you are given the chance to create your own bit of HTML code (as a python string). More specifically, below is an HTML tree image and you will finish the missing code within the string html which produces this HTML tree.

Given the following HTML

    <html>
      <head>
        <title>Website Title</title>
        <link rel="stylesheet" type="text/css" href="style.css">
      </head>
      <body>
        <div class="class1" id="div1">
          <p class="class2">
            Visit <a href="http://datacamp.com/">DataCamp</a>!
          </p>
        </div>
        <div class="class1 class3" id="div2">
          Hello World!
        </div>
      </body>
    </html>
Which elements belong to the class class1?

Answer = 1st and 2nd div

Fill in the blank in the HTML code string html to assign a class attribute to the second div element which has the value "you-are-classy".

In [18]:
# HTML code string
html = '''
<html>
  <body>
    <div class="class1" id="div1">
      <p class="class2">Visit DataCamp!</p>
    </div>
    <div class="you-are-classy">
      <p class="class2">Keep up the good work!</p>
    </div>
  </body>
</html>
'''

### Finding HREF

Considering the following HTML:

    <html>
      <head>
        <title>Website Title</title>
        <link rel="stylesheet" type="text/css" href="style.css">
      </head>
      <body>
        <div class="class1" id="div1">
          <p class="class2">
            Visit <a href="http://datacamp.com/">DataCamp</a>!
          </p>
        </div>
        <div class="class1 class3" id="div2">
          <p class="class2">
            Or search for it on <a href="http://www.google.com">Google</a>!
          </p>
        </div>
      </body>
    </html>

Which of the following does correctly describe how to get to the URL, http://datacamp.com?


The URL is the href attribute of the first hyperlink a element.

The URL is the href attribute of the first child of the first paragraph p element.

The URL is the href attribute of the hyperlink a element which is a grandchild of a div element of id div1

### Where am I? - Xpath Introduction

Consider the HTML code:

    <html>
      <body>
        <div>
          <p>Good Luck!</p>
          <p>Not here...</p>
        </div>
        <div>
          <p>Where am I?</p>
        </div>
      </body>
    </html>
    
Using only single forward-slashes to move between generations, and brackets to select the correct element, assign a string to the variable xpath that directs to the paragraph element containing "Where am I?".

In [22]:
xpath = '/html/body/div[2]/p' #the div[2] looks t the 2nd element of div tag

Using double forward-slash notation, assign to the variable xpath a simple XPath string navigating to all paragraph p elements within any HTML code.

In [23]:
xpath = '//p'

### A classy span

Although we haven't yet gone deep into XPath, one thing we can do is select elements by their attributes using an XPath. For example, if we want to direct to the div element within the HTML document whose id attribute is "uid", then we could write the XPath string 

    '//div[@id="uid"]'

- The first part of this string, //div, first looks at all div elements in the HTML document.

- Then, using the brackets, we specify that we want only the div element with a specific id attribute (in this case uid).

- @id="uid" in the brackets would be read as "attribute id equals uid".

Assign to the variable xpath an XPath string which will select all span elements whose class attribute equals "span-class"

In [24]:
xpath = '//span[@class="span-class"]'

In [26]:
# Create an XPath string to direct to children of body element
xpath = "/html/body/*"

## 2. XPath navigation

 Counting Elements in the Wild  
(*) is equivalent to element() , selecting elements without regard to name

- The number of elements selected with the XPath string xpath = "/html/body/* is equal to the number of children of the body element; whereas the number of elements selected with the XPath string xpath = "/html/body//*" is equal to the total number of descendants of the body element.

- The number of elements selected by the XPath string xpath = "/*" is equal to the number of root elements within the HTML document, which is typically the 1 html root element.

- The number of elements selected by the Xpath string xpath = "//*" is equal to the total number of elements in the entire HTML document.


In [27]:
# Create an XPath string to direct to children of body element
xpath = "/html/body/*"

### Using XPATH to access elements in HTML

Consider the HTML code:

    <html>
      <body>
        <div>
          <p>Good Luck!</p>
          <p>Not here...</p>
        </div>
        <div>
          <p>Where am I?</p>
        </div>
      </body>
    </html>

Your job will be to create an XPath string using single forward-slashes and brackets which navigates to the paragraph p element which contains the text "Where am I?".

In [3]:
# Fill in the blank
xpath = '/html/body/div[2]/p' #single forward slash '/' to move to 1 generation
#sqr brackets '[]' after name tag identifies which of selected siblings to choose

In the lecture, we learned how to use double forward-slashes to navigate to all future generations. In this exercise, you will select all paragraph p elements within the HTML.

In [4]:
# Fill in the blank
xpath = '//p' #navigates to all generations with p elements wihtin html

if we want to direct to the div element within the HTML document whose id attribute is "uid", then we could write the XPath string '//div[@id="uid"]'. The first part of this string, //div, first looks at all div elements in the HTML document. Then, using the brackets, we specify that we want only the div element with a specific id attribute (in this case uid). To note, the phrase @id="uid" in the brackets would be read as "attribute id equals uid".

In this exercise, you will select all span elements whose class attribute equals "span-class". (Note: span is just another possible tag-name).

In [5]:
# Fill in the blank
xpath = '//span[@class="span-class"]'

### Choose DataCamp!


task is to select the paragraph element containing the text "Choose DataCamp!".

Consider the following HTML:

    <html>
      <body>
        <div>
          <p>Hello World!</p>
          <div>
            <p>Choose DataCamp!</p>
          </div>
        </div>
        <div>
          <p>Thanks for Watching!</p>
        </div>
      </body>
    </html>

Assign to the variable xpath an XPath string to direct to the paragraph element containing the phrase: "Choose DataCamp!".

In [30]:
# Create an XPath string to the desired paragraph element
xpath = "//html/body/div/div/p[1]"

### Where it's @

begin to write an XPath string using attributes to achieve a certain task; that task is to select the paragraph element containing the text "Thanks for Watching!". We've already created most of the XPath string for you.

Consider the following HTML:

    <html>
      <body>
        <div id="div1" class="class-1">
          <p class="class-1 class-2">Hello World!</p>
          <div id="div2">
            <p id="p2" class="class-2">Choose DataCamp!</p>
          </div>
        </div>
        <div id="div3" class="class-2">
          <p class="class-2">Thanks for Watching!</p>
        </div>
      </body>
    </html>

In [31]:
# Fill in XPath string to select the paragraph element containing the phrase: "Thanks for Watching!".
xpath = '//*[@id="div3"]/p'

### Check your Class


This exercise is to emphasize that when you use an XPath to select an element by its class attribute without using the contains() function, you match the class exactly.

    <html>
      <body>
        <div id="div1" class="class-1">
          <p class="class-1 class-2">Hello World!</p>
          <div id="div2">
            <p id="p2" class="class-2">Choose DataCamp!</p>
          </div>
        </div>
        <div id="div3" class="class-2">
          <p class="class-2">Thanks for Watching!</p>
        </div>
      </body>
    </html>

Fill in the blanks in the xpath below to select the paragraph element containing the phrase: "Hello World!".

In [32]:
# Create an XPath string to select p element by class
xpath = '//p[@class="class-1 class-2"]'

### Hyper(link) Active


One of the most important attributes to extract for "web-crawling" is the hyperlink url (href attribute) within an a tag. 

    <html>
      <body>
        <div id="div1" class="class-1">
          <p class="class-1 class-2">Hello World!</p>
          <div id="div2">
            <p id="p2" class="class-2">Choose 
                <a href="http://datacamp.com">DataCamp!</a>!
            </p>
          </div>
        </div>
        <div id="div3" class="class-2">
          <p class="class-2">Thanks for Watching!</p>
        </div>
      </body>
    </html>

Fill in the blanks to complete the variable xpath below to select the href attribute value from the DataCamp hyperlink.

In [33]:
# Create an xpath to the href attribute
xpath = '//p[@id="p2"]/a/@href' 

### Secret Links


ill in the blanks below to assign an XPath string to the variable xpath which directs to all href attribute values of the hyperlink a elements whose class attributes contain the string "package-snippet"

In [34]:
# Create an xpath to the href attributes
xpath = '//a[contains(@class,"package-snippet")]/@href'

### Divvy Up This Exercise


In [5]:
html = """
<html>
<body>
<div>Div 1: <p>paragraph 1</p></div>
<div>Div 2: <p>paragraph 2</p> <p>paragraph 3</p> </div>
<div>Div 3: <p>paragraph 4</p> <p>paragraph 5</p> <p>paragraph 6</p></div>
<div>Div 4: <p>paragraph 7</p></div>
<div>Div 5: <p>paragraph 8</p></div>
</body>
</html>
"""

In [6]:
from scrapy import Selector

# Create a Selector selecting html as the HTML document
sel = Selector( text = html )

# Create a SelectorList of all div elements in the HTML document
divs = sel.xpath( '//div' )

Referring to the divs variable you created in the previous exercise, choose the incorrect response. What are the correct answers?

    divs[2].xpath('./*') will select all the children of the 3rd div element
    
    len(divs[2].xpath('./*')) gives total no: of children of 3rd div element in HTML code
    
    divs[2] is another SelectorList of length 2

### Resquesting a Selector

We have pre-loaded an HTML into the string variable html. In this two part problem you will use this html variable as the HTML document to set up a Selector object with, and create a SelectorList which selects all div elements; then, you will check your understanding of what happens within the SelectorList.

In [7]:
# Import a scrapy Selector
from scrapy import Selector

# Import requests
import requests

#create url variable: url
url = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Create the string html containing the HTML source
html = requests.get( url ).content

# Create the Selector object sel from html
sel = Selector( text = html )

# Print out the number of elements in the HTML document
print( "There are 1020 elements in the HTML document.")
print( "You have found: ", len( sel.xpath('//*') ) )

There are 1020 elements in the HTML document.
You have found:  1020


## 3. CSS Locators

### XPath to CSS Locators

CSS = Cascading style sheets which describes how elements are displayed on the screen. People are often divided on whether XPath of CSS are the best way to go.

To help get you more comfortable going back and forth between XPath and CSS Locator strings, we give you a chance in this exercise to do some direct "translation" between the two.

In [8]:
#note // in xpath is achieved by including space in css
# Create the XPath string equivalent to the CSS Locator 
xpath = '/html/body//div[2]/p[2]'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'html > body div > p:nth-of-type(2)'

In [9]:
# Create the XPath string equivalent to the CSS Locator 
xpath = '/html/body/span[1]//a'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'html > body > span:nth-of-type(1) a'

In [10]:
# Create the XPath string equivalent to the CSS Locator 
xpath = '//div[@id="uid"]/span//h4'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'div#uid > span h4' #space indicates all levels i.e. // not jsut root level i.e. /

### Get an "a" in this Course

In the second part of this problem, we want you to create a CSS Locator string which will select a certain collection of elements as described here: Select the hyperlink (a element) children of all div elements belonging to the class "course-block" (that is, any div element with a class attribute such that "course-block" is one of the classes assigned). The number of such elements is 11, so you can check your solution with how_many_elements if you choose.

In [12]:
from scrapy import Selector

# Create a selector from the html (of a secret website)
sel = Selector( text = html )

# Fill in the blank
css_locator = 'div.course-block > a'

### The CSS Wildcard

You can use the wildcard * in CSS Locators too! In fact, we can use it in a similar way, when we want to ignore the tag type. For example:

    The CSS Locator string '*' selects all elements in the HTML document.
    
    The CSS Locator string '*.class-1' selects all elements which belong to class-1, but this is unnecessary since the string '.class-1' will also do the same job.
    
    The CSS Locator string '*#uid' selects the element with id attribute equal to uid, but this is unnecessary since the string '#uid' will also do the same job.

In [13]:
# Create the CSS Locator to all children of the element whose id is uid
css_locator = '#uid > *'

### You've been hrefed


In a previous exercise, you created a CSS Locator string to select the hyperlink (a element) children of all div elements belonging to the class "course-block". Here we have created a SelectorList called course_as having selected those hyperlink children.

Now, we want you to fill in the blank below to extract the href attribute values from these elements. This is another example of chaining, as we've seen in a previous exercise.

The point here is that we can chain together calls to the methods css and xpath, and combine them! We help nudge you in the correct direction by giving you the solution if we chain with another call to the css method.


In [None]:
from scrapy import Selector

# Create a selector object from a secret website
sel = Selector( text = html )

# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css( 'div.course-block > a' )

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css( '::attr(href)' )

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath( './@href' )

### Top Level Text


This exercise will have you write an XPath and CSS Locator string to direct to the text of a specific paragraph p element. The p element in the HTML is uniquely defined by its id attribute, which is "p3". With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element, which does not include the text in future generations of the element. We have created a function print_results for you to compare which elements your strings direct to.

In [18]:
#set uo html example
html = """
<html>
<body>
<div id="this-div">
<p id="p1" class="class-1">This is not the element you are looking for</p>
<p id="p2" class="class-12">
<a href="https://www.google.com">Google</a> is linked to here, but this isn't the link you are looking for. 
</p>
<p id="p3" class="class-1 class-12">
Here is the <a href="https://www.datacamp.com" id="a-exercise">DataCamp</a> link you want!
</p>
</div>
</body>
</html>
"""

# Assign to the variable xpath an XPath string directing to the text within the paragraph p element with id equal to p3, which does not include the text of future generations of this p element.
xpath = '//p[@id="p3"]/text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3-id::text'

### All Level Text


This exercise is similar to the previous, but differs in that you will be selecting text from multiple generations of a given element.

You will write an XPath and CSS Locator strings to direct to the text of a specific paragraph p element. The p element in the HTML is uniquely defined by its id attribute, which is "p3". 

In [19]:
# Assign to the variable xpath an XPath string directing to the text within the paragraph p element with id equal to p3, which includes the text of future generations of this p element.
# note future generations of p element //text
xpath = '//p[@id="p3"]//text()'

#done with spacing
# Create a CSS Locator string to the desired text.
css_locator = 'p#p3-id ::text'

### Reveal By Response
 

We have pre-loaded a Response object, named response, with the content from a secret website. Your job is to figure out the URL and the title of the website using the response variable. You learned how to find the URL in the last lesson. To find the website title, what you need to know is:

    The title is the text from the title element
    The title element is a child of the head element, which is a child of the html root element.


In [28]:
import requests

url = 'https://www.datacamp.com/courses/all>'

# Get the URL to the website loaded in response
this_url = response.url

# Get the title of the website loaded in response
this_title = response.xpath('/html/head/title/text()').extract_first()

AttributeError: 'str' object has no attribute 'url'