# Web Scraping 101

## Method for getting data on the front of a website

### Methods of getting data from a website
1. Backend Access
1. API
1. Web scrapping

## Website Design Basics
HTML, CSS, JavaScript

### HTML - Structure/Content

In [1]:
%%HTML
<!DOCTYPE html>
<html>

<head>
<title>Page Title</title>
</head>

<body>

<h1>Heading 1</h1>
<h2>Heading 2</h2>

<p>This is a paragraph.</p>


</body>

</html> 

### CSS - The looks

### JavaScript - The logic

# View the website structure on your browser
1. View Source
1. Inspect Object

### Getting the info from a webpage
Target HTML tags   
Target CSS tags (#id, .class)  

## Scraping Using Python

Beautiful Soup 4  
Scrapy  
Selenium

In [None]:
#!pip install selenium

In [None]:
# https://anaconda.org/conda-forge/selenium

In [2]:
#Some basic Selenium imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import datetime

In [None]:
# Download browser drivers
# https://www.seleniumhq.org/download/

In [3]:
# Instantiate the driver. Remember the executable_path=...
driver = webdriver.Firefox(executable_path=r"C:\path\to\geckodriver.exe") # Opens Firefox

In [4]:
# Let's visit a website
driver.get("http://example.com")

### Example 1, Basic Website

In [None]:
# single
#driver.find_element_by_...
# multiple
#driver.find_elements_by_...

In [5]:
heading = driver.find_element_by_tag_name("h1")

In [8]:
print(heading.text)

Example Domain


## Example 2, Fill in a form field

In [9]:
driver.get("https://www.google.co.za")

In [12]:
search_box = driver.find_element_by_name('q')

In [13]:
search_box.send_keys('Explore Academy')

In [14]:
search_box.submit()

### Example 3, Loggin into a website

In [15]:
driver.get("https://athena2.explore-datascience.net/")

In [16]:
# find userneme field
uname_field = driver.find_element_by_id('username')

In [17]:
# type username
uname_field.send_keys("my_email@gmail.com")

In [18]:
# find password field
pword_field = driver.find_element_by_id('password')

In [19]:
# type password
pword_field.send_keys("my_password")

In [20]:
# find login button
login_button = driver.find_element_by_tag_name('button')

In [21]:
# click login
login_button.click()

In [None]:
# close the browser
driver.close()

### Exampe 4, Let's go shopping

In [22]:
driver.get("https://www.takealot.com/")

In [23]:
search_bar = driver.find_element_by_id("search")

In [24]:
search_bar.send_keys("dell xps 15")

In [25]:
search_bar.submit()

In [26]:
data = driver.find_element_by_css_selector("div.p-data.left")

In [27]:
data.text

'DELL XPS 15 Core i7-8750H 15.6" FHD Notebook - Black\n8GB | 256 | 4GFX | Win 10 Pro\nR 33,909R 34,999i\neB 339,090Discovery Miles 339,090\nIn Stock\nCPT | \nJHB'

In [28]:
# let's take everything
all_data = driver.find_elements_by_css_selector("div.p-data.left")

In [29]:
for i in all_data:
    print(i.text)

DELL XPS 15 Core i7-8750H 15.6" FHD Notebook - Black
8GB | 256 | 4GFX | Win 10 Pro
R 33,909R 34,999i
eB 339,090Discovery Miles 339,090
In Stock
CPT | 
JHB
Dell XPS 9570 15" Intel Core i7-8750H 32GB - Notebook
32GB Ram | 1024GB SSD | Win 10 Pro
R 53,499
eB 534,990Discovery Miles 534,990
Shipped in 3 - 5 working days
Sold by Click Tek - Fulfilled by Takealot
Dell XPS 9570 15" Intel Core i7-8750H 16GB - Notebook
16GB Ram | 512GB SSD | Win 10 Pro
R 42,999R 43,999i
eB 429,990Discovery Miles 429,990
Shipped in 3 - 5 working days
DELL XPS 15 Core i5-8300H 15.6" FHD Notebook - Black
8GB | 1TB+128 | 4GFX | Win 10 Pro
R 31,629
eB 316,290Discovery Miles 316,290
In Stock
JHB
Dell XPS 13 9370 Intel Core i7-8550U 13.3" Notebook
8GB | 256GB SSD | Win10 Pro
R 33,999R 35,999i
eB 339,990Discovery Miles 339,990
In Stock
CPT
Dell XPS 15 9570 Core i5-8300H 15.6" Notebook - Silver
8GB | 128GB | Win10
R 28,999R 29,999i
eB 289,990Discovery Miles 289,990
In Stock
CPT
Dell G3 Core i7-8750H 15.6" Gaming Notebook

### Load the info on to a DataFrame

In [33]:
import pandas as pd

In [34]:
df = pd.DataFrame(columns=['result'])

In [35]:
for i in all_data:
    df = df.append({'result' : i.text} , ignore_index=True)

In [36]:
df.shape

(20, 1)

In [37]:
df.head()

Unnamed: 0,result
0,"DELL XPS 15 Core i7-8750H 15.6"" FHD Notebook -..."
1,"Dell XPS 9570 15"" Intel Core i7-8750H 32GB - N..."
2,"Dell XPS 9570 15"" Intel Core i7-8750H 16GB - N..."
3,"DELL XPS 15 Core i5-8300H 15.6"" FHD Notebook -..."
4,"Dell XPS 13 9370 Intel Core i7-8550U 13.3"" Not..."


In [38]:
# searching multiple pages
next_page = driver.find_element_by_css_selector("a.page-next")

In [39]:
next_page.click()

In [40]:
all_data = driver.find_elements_by_css_selector("div.p-data.left")

In [41]:
for i in all_data:
    df = df.append({'result' : i.text} , ignore_index=True)

In [42]:
df.shape

(40, 1)

In [43]:
# Export to csv
df.to_csv('xps15_prices.csv')

In [None]:
# https://selenium-python.readthedocs.io/locating-elements.html

In [44]:
# close the browser
driver.close()

# References
https://www.w3schools.com/html/  
https://selenium-python.readthedocs.io/  
https://www.w3schools.com/tags/ref_byfunc.asp   
https://selenium-python.readthedocs.io/locating-elements.html  