# AI/ML with Python: Web Scraping & Sentiment Analysis
## Data Extraction and Preprocessing through Webscraping

### Step 3 - Fetching Raw HTML Web Content

We start by accessing and obtaining the raw HTML of a webpage. Turning this into a comprehensible string format enables subsequent processing and analysis activities. 

In [1]:
# Package used to work with URLs
from urllib.request import urlopen

# Link to the website
url = "https://quotes.toscrape.com/"

# Opening the URL and reading it
page = urlopen(url)
html_bytes = page.read()

# Taking raw web content in bytes format and converting it into a readable string format using the UTF-8 encoding
html = html_bytes.decode("utf-8")

# Print raw HTML of the webpage
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <sp

### Step 4 - Understanding Regular Expression in Webscraping

In this exercise, we'll continue from where we left off, after getting the raw HTML source code of the webpage. We proceed to import `re`, the module that supports regular expressions. We will be extracting the title tag from this webpage and printing it out below, which in this case is "Quotes to Scrape".  

In [2]:
import re

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()

title = re.sub("<.*?>", "", title)

title

'Quotes to Scrape'

We begin off with the pattern that we will be analyzing, then from there use the pattern to search for the tags labeled as "title". We begin with the pattern that we will be analyzing, then from there use the pattern to search for the tags labeled as "title". 

The `<title.*?>` matches the opening `<title>` tag. The `.*?` part ensures that it matches any character (represented by `.`) any number of times (represented by `*`), but it will stop matching once it finds the first closing `>` character (represented by `?`). At its core, regex employs specialized symbols to identify patterns within text strings.

It then utilizes the pattern to search for the first HTML tag that has `<title>` in it, ignoring case sensitivity (it doesn't care if it is uppercase or lowercase). This then retrieves the HTML tags from the `match_results`and removes the symbols `<.*?>`, before finally retrieving the word within the title.

Pretty neat! A search online will help if you're uncertain about what pattern you can use, then you can proceed to use the remaining code to extract other relevant information you require.

### Step 5 - Web Scraping with BeautifulSoup

While regex might serve well for basic web scraping tasks, it's often not the ideal choice for more intricate or expansive web scraping projects. 

BeautifulSoup, a widely-used Python library for web scraping, stands out because of its user-friendly and versatile features for parsing and traversing HTML and XML content. 

Key benefits include its adaptability in parsing and its advanced search capabilities. Let's delve into an example to showcase the simplicity of using BeautifulSoup!

In [3]:
# Remember to install BeautifulSoup4 first
# python -m pip install beautifulsoup4 
# try pip3 if pip is not found

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.find_all("span", class_="text")

for obj in text:
    print(obj.get_text())

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


In this case, we can see that by running this code, we are extracting all the actual quotes from famous people. In this case, BeautifulSoup can be used to parse the raw HTML files and find all `<span>` tags with "text" as their class. This stores all the span elements within a list, which can then be extracted by using the `get_text()` function.

### Step 6 - Automating Tasks with Selenium

We'll learn how to make use of Selenium to automate a login process using a dummy website that you can visit by clicking here <href>https://the-internet.herokuapp.com/login<href/>. 

This simple website allows us to learn how to interact with elements, autofill fields, and click on buttons to login without you having to manually do so! We start by installing Selenium package. 

We then proceed to import both Webdriver and keys which are necessary to open up a new browser and interact with input fields, as you will see in a bit.

In [10]:
# input 'pip install selenium' or 'pip3 install selenium' into your terminal

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

With that in place, we then go on next to open up the web browser and update both the 'username' and 'password' fields. After running the cell below, you should see a new browser window opened!

In [11]:
### Automated Tasks with Selenium

link = 'https://the-internet.herokuapp.com/login'

# Initializing webdriver and getting the link
# using Chrome as primary browser. However if you are using another browser, you will have to download it from the Microsoft Edge Webdriver page
driver = webdriver.Chrome()
driver.get((link))

With the browser opened, we can go ahead and begin interacting with the input fields **using code**! 

Before updating the fields, we have to first locate the elements that indicate where the fields are. 

As previously covered, we can either make use of BeautifulSoup or opening up developer tools to study the raw HTML web content. The code below demonstates how we can do this.

In [12]:
raw_html = urlopen(link)
login_page = BeautifulSoup(raw_html, "html.parser")
login_page

<!DOCTYPE html>

<!--[if IE 8]>         <html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<html>
<head>
<script src="/js/vendor/298279967.js"></script>
<meta charset="utf-8"/>
<meta content="width=device-width" name="viewport"/>
<title>The Internet</title>
<link href="/css/app.css" rel="stylesheet"/>
<link href="/css/font-awesome.css" rel="stylesheet"/>
<script src="/js/vendor/jquery-1.11.3.min.js"></script>
<script src="/js/vendor/jquery-ui-1.11.4/jquery-ui.js"></script>
<script src="/js/foundation/foundation.js"></script>
<script src="/js/foundation/foundation.alerts.js"></script>
<script>
      $(document).foundation();
    </script>
</head>
<body>
<div class="row">
<div class="large-12 columns" id="flash-messages">
</div>
</div>
<div class="row">
<a href="https://github.com/tourdedave/the-internet"><img alt="Fork me on GitHub" src="/img/forkme_right_green_007200.png" style="position: absolute; top: 0; right: 0

From the output above, we can easily find out that the elements that are linked to the username and password fields are both input tags with id "username" and "password" respectively. 

This is relatively easy to find because it is just an example, but in more complex websites it is better to locate them via developer consoles.

To do that, we make use of the `find_element` tag to locate each input field by its id as shown below, by importing the `by` class.

We then send the keys to the input field by using the `keys` class. By running the script below, you should see **both fields within the automated browser** that you just opened get **automatically updated**.

In [13]:
# Pre-establishing the password and username 
# Note that if you are doing this for your own project, ensure you are securing your credentials to prevent misuse

user = 'tomsmith'
pw = 'SuperSecretPassword!'

# Get the element of both username and password fields
username_input = driver.find_element(By.ID, "username")
password_input = driver.find_element(By.ID, "password")

# Input the credentials (the credentials for the-internet are usually provided on the page)
username_input.send_keys(user)
password_input.send_keys(pw)

Now that we've updated both fields, the next thing we'll get Selenium to do is to **click the login button**. This time, we'll make use of `CSS Selector` because this button does not have a id associated to it. 

In short, `CSS selector` just identifies elements on a web page based on their cascading style sheet (CSS) properties. They are used to select the HTML elements with that specific css tags. 

In this case, we are using an attribute selector, which selects an element based on the presence or value of a given attribute. In this case, it selects the `button` element with the type 'submit'.

If you'll like to check on more on CSS selectors, visit this link here: <href>https://saucelabs.com/resources/blog/selenium-tips-css-selectors<href/> 

Run the code below to witness it in action!

In [14]:
login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
login_button.click()

And...you should be logged in! In this short excercise, we have successfully harnessed the ability of Selenium to interact with elements for web scraping. 

For websites that use JavaScript heavily to load the content dynamically, Selenium can circumvent that by triggering elements to extract data from that. 

Remember that you should read and understood all terms and conditions of a site to ensure that you don't violate any ethical and legal constraints!

### Step 7 - Let's Ace Your Submissions! Preparing Your Submission!
Follow the instructions under Step 7 to complete this quest. We have provided instructions below to guide you along the way, so please refer to previous steps or check the web if you are uncertain!

In [None]:
# Importing all the necessary packages
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Opening the link using Webdriver

# Navigating to the next page by clicking next

# Get the link of the current page and opening
current_page_url = driver.current_url

# Taking raw web content in bytes format and converting it into a readable string format using the UTF-8 encoding

# Use beautifulSoup package to parse and read them

# Print out each quote

In [15]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from urllib.request import urlopen

# 1. Open the link using Webdriver
driver = webdriver.Chrome()  # Replace with your preferred browser's WebDriver
driver.get("https://quotes.toscrape.com/")

# 2. Navigate to the next page by clicking "Next"
next_button = driver.find_element(By.CSS_SELECTOR, "li.next a")  # Locate the "Next" button
next_button.click()  # Click the button to go to the next page

# 3. Get the link of the current page
current_page_url = driver.current_url  # Get the URL of the page you're now on

# 4. Use BeautifulSoup to scrape and print the quotes
response = urlopen(current_page_url)  # Fetch the HTML content
html_content = response.read().decode("utf-8")  # Decode the content

soup = BeautifulSoup(html_content, "html.parser")  # Parse the HTML
quotes = soup.find_all(class_="quote")  # Find all quote elements

# 5. Print each quote
for quote in quotes:
    text = quote.find(class_="text").get_text()  # Extract the quote text
    author = quote.find(class_="author").get_text()  # Extract the author
    print(f"Quote: {text}\nAuthor: {author}\n")

# 6. Close the browser (optional)
driver.quit()  # Close the browser when you're done

Quote: “This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most impor