# Web Scraping Introduction

Webscraping is the process of collecting data (information) from public websites that is then exported into an easier-to-read format.

### What are websites are made of?

We can think of it this way. In order to build a house, we must first understand what materials are used in the construction.  In the same way to gather the relevant data from websites, we must first learn the way that websites are built. 

Websites are created using HTML (Hypertext Markup Language), along with CSS (Cascading Style Sheets) and JavaScript. We are going to focus on the HTML. HTML, in a simple explanation, is the way that material is formatted/displayed over the internet. It allows creators to create and structure sections, paragraphs, and links with things like elements, tags, and attributes. 

- Tags: starting and ending parts of an HTML element
- They will always begin and end with angle brackets (`<`, `>`)
- Whatever is written inside the angle brackets is a tag. 
- Tags are like keywords with a distinctive meaning. 
- They also must be opened and closed in order to function. 

#### Example:
    <a> _content_ </a>  
#### Elements: the content in between the tag
    <a> THIS IS THE ELEMENT </a> 
#### Attributes: used to definite the characteristics of the HTML element in detail
    <a align="right"> _content_ </a>

When we are scraping, we need to find the tags that have the relevant information between them. 

### Tools to webscrape

There are several structures used to webscrape, such as `requests`, `lxml`, and `beautifulsoup4`, but we will be focusing today on using `selenium`. This will let us create a script to webscrape multiple pages to create our dataframe. Through `selenium`, the script can interact, scrape, and parse through the browser. 

In getting started, we must choose a browser and it's web driver.

* Firefox: GeckoDriver
* Chrome: ChromeDriver
* Safari: SafariDriver

For this exercise, we will be using Firefox and Geckodriver.

The recently upgraded version of Selenium is fairly easy to configure. At first all these options can seem overwhelming, but Selenium is a powerful package that has a learning curve to it. Below, we've implemented everything you'll need to do to get this webdriver started, but if you wanted to implement this in Chrome it might not have the same syntax. Be sure to check the documentation if you get stuck just getting a driver started.

In [None]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import pandas as pd

#declaring all of our options; see https://www.selenium.dev/documentation/webdriver/browsers/firefox/
options=Options()

#set the profile preference below if you need to specify a particular profile (with unique settings, such as cookie policy, etc)
# options.set_preference('profile', profile_path)
options.set_preference("dom.webdriver.enabled", False)
options.set_preference('useAutomationExtension', False)

options.add_argument("-headless") #headless means no gui
options.add_argument("window-size=1920,1080")
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument('--disable-blink-features=AutomationControlled')

service = Service()

driver = Firefox(service=service, options=options)

Next, we will take a look at the website https://www.goodreads.com/list/show/18645.Best_Books_That_Grow_You

So that we can confirm that our scraping is working, highlight and right click the 1st book titled "The Alchemist", you should see a drop-down menu, go ahead and click `Inspect`. 

You will see that at the bottom of the webpage, there is a new box that shows you the HTML code that builds the website. 

Here is where we will find all the information that we are looking for.

Does the output of the code below match any of the titles you see on the page? Does it seem like its scraping the page correctly?

In [None]:
driver.get("https://www.goodreads.com/list/show/18645.Best_Books_That_Grow_You")
my_elements = driver.find_elements("xpath", "//a[@class='bookTitle']/span")
good_reads=[]
# Created a for loop that allows for us to keep adding more data to the end of the list
for element in my_elements:
    good_reads.append(element.text)
# Prints out the list that we just created
print(good_reads)

In [None]:
good_reads_2=[]
good_reads_2_authors=[]
for page in range(1,12,1):
    page_url = "https://www.goodreads.com/list/show/18645.Best_Books_That_Grow_You"
    driver.get(page_url)
    my_elements = driver.find_elements("xpath", "//a[@class='bookTitle']/span")
    my_authors = driver.find_elements("xpath", "//a[@class='authorName']/span")
    for element in my_elements:
        good_reads_2.append(element.text)
    for author in my_authors:
        good_reads_2_authors.append(author.text)

This code above will scrape multiple pages for the book titles and names.

Now that we have all of the data, we want to take it and create a CSV file so that it is easy to look at, read, and analyze. 

The first step to that is to take all of our data and create a neat dataframe.

In [None]:
dict = {'Title': good_reads_2, 'Authors': good_reads_2_authors}
GoodReads2022 = pd.DataFrame(dict)

Once we have created the dataframe, we just need to export it into a `csv` file.

In [None]:
GoodReads2022.to_csv('GoodReads2022.csv')

# You just successfully scraped GoodReads, and outputted the results to a csv file! Congrats!