## Creating Your First Web Scraper

In [2]:
import requests
urltoget='https://bradfordtuckfield.com/indexarchive20210903.xhtml'
pagecode = requests.get(urltoget)
print(pagecode.text[0:600])

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
<hr>
<address>Apache/2.4.53 (Debian) Server at bradfordtuckfield.com Port 443</address>
</body></html>



## Parsing HTML Code

In [5]:
# Scraping an Email Address

urltoget = 'https://bradfordtuckfield.com/'
pagecode = requests.get(urltoget)

mail_beginning=pagecode.text.find('Email:')
print(mail_beginning)

12244


In [12]:
print(pagecode.text[(mail_beginning):(mail_beginning+39)])

Email: contact at bradfordtuckfield.com


In [17]:
# Searching for Addresses Directly

urltoget = 'https://bradfordtuckfield.com/contactscrape.xhtml'
pagecode = requests.get(urltoget)

at_beginning=pagecode.text.find('@')
print(at_beginning)

-1


In [18]:
print(pagecode.text[(at_beginning-8):(at_beginning+22)])




## Performing Searches with Regular Expressions

Regular expressions are special strings that enable advanced, flexible, custom searches of patterns in text.

In [19]:
import re

print(re.search(r'recommend','irrelevant text I recommend irrelevant text').span())
# finds the word in the larger string
# r = raw string
# span() - gives beginning and end locations of substring

(18, 27)


In [21]:
import re
print(re.search('rec+om+end', 'irrelevant text I recommend irrelevant text').span())
# + - metacharacter -> indicates repetition
# ex: c+ -> searches for one or more repetitions of the letter c
# A string that uses a metacharacter like + with a special, logical meaning is called a regular expression

(18, 27)


In [25]:
import re
print(re.search('rec+om+end','irrelevant text I recomend irrelevant text').span())
print(re.search('rec+om+end','irrelevant text I reccommend irrelevant text').span())
print(re.search('rec+om+end','irrelevant text I reommend irrelevant text').span())
# error bc there is not one or more repetitions of c
print(re.search('rec+om+end','irrelevant text I recomment irrelevant text').span())
# error bc there is no match for the d at the end

(18, 26)
(18, 28)


AttributeError: 'NoneType' object has no attribute 'span'

In [26]:
# Using Metacharacters for Flexible Searches

# asterisk (*) specifies preceding character is repeated zero or more times
re.search('10*','My bank balance is 100').span()

(19, 22)

In [28]:
import re
print(re.search('10*','My bank balance is 1').span())
print(re.search('10*','My bank balance is 1000').span())
print(re.search('10*','My bank balance is 1000000').span())
print(re.search('10*','My bank balance is 9000').span())
# error bc no 1 adjacent to the 0

(19, 20)
(19, 23)
(19, 26)


AttributeError: 'NoneType' object has no attribute 'span'

In [29]:
# ? - specifies that the preceding character appears either zero or one times
print(re.search('Clarke?','Please refer questions to Mr. Clark').span())

(30, 35)


In [30]:
# Fine-Tuning Searches with Escape Sequences

re.search('99+12=111','Example addition: 99+12=111').span()

AttributeError: 'NoneType' object has no attribute 'span'

In [31]:
re.search('99+12=111','Incorrect fact: 999912=111').span()

(16, 26)

In [32]:
re.search('99\+12=111','Example addition: 99+12=111').span()

(18, 27)

In [33]:
# backslash(\) - escape sequence
re.search('Clarke\?','Is anyone here named Clarke?').span()

(21, 28)

In [34]:
re.search(r'\\',r'The escape character is \\').span()

(24, 25)

In [35]:
re.search('\d','The loneliest number is 1').span()
# \d - searches for any digit (numbers 0 to 9)

(24, 25)

The following are other useful escape sequences using non-metacharacters: <br>
\D  Searches for anything that’s not a digit <br>
\s  Searches for whitespace (spaces, tabs, and newlines) <br>
\w  Searches for any alphabetic characters (letters, numbers, or underscores)

In [36]:
re.search('[a-z]','My Twitter is @fake; my email is abc@def.com').span()
# [a-z] - searches for characters that are in the "class" of characters between a and z

(1, 2)

In [37]:
re.search('[A-Z]','My Twitter is @fake; my email is abc@def.com').span()

(0, 1)

In [41]:
re.search('Manchac[a|k]','Lets drive on Manchaca.').span()
# pipe (|) - or logical expression

(14, 22)

In [42]:
re.search('Manchac[a|k]','Lets drive on Manchack.').span()

(14, 22)

In [43]:
# Combining Metacharacters for Advanced Searches

The following are other metacharacters you should know: <br>
$  For the end of a line or string <br>
^  For the beginning of a line or string <br>
.  For a wildcard, meaning any character except the end of a line (\n)

In [45]:
re.search('school.*\.pdf$','schoolforgottenname.pdf').span()
# searching for a filename that starts with school and ends with .pdf, and may have any other characters in between

(0, 23)

In [46]:
import re
print(re.search('school.*\.pdf$','schoolforgottenname.pdf').span())
print(re.search('school.*\.pdf$','school.pdf').span())
print(re.search('school.*\.pdf$','schoolothername.pdf').span())
print(re.search('school.*\.pdf$','othername.pdf').span())
print(re.search('school.*\.pdf$','schoolothernamepdf').span())
print(re.search('school.*\.pdf$','schoolforgottenname.pdf.exe').span())

(0, 23)
(0, 10)
(0, 19)


AttributeError: 'NoneType' object has no attribute 'span'

## Using Regular Expressions to Search for Email Addresses

In [47]:
# <some text>@<some more text>
re.search('[a-zA-Z]+@[a-zA-Z]+\.[a-zA-Z]+',\
'My Twitter is @fake; my email is abc@def.com').span()

(33, 44)

## Converting Results to Usable Data

In [58]:
import requests
urltoget = 'https://bradfordtuckfield.com/contactscrape2.xhtml'
pagecode = requests.get(urltoget)

In [59]:
allmatches=re.finditer('[a-zA-Z]+@[a-zA-Z]+\.[a-zA-Z]+',pagecode.text)
# finditer() - obtains multiple matches

In [60]:
alladdresses = []
for match in allmatches:
    alladdresses.append(match[0])

print(alladdresses)

[]


In [61]:
import pandas as pd
alladdpd=pd.DataFrame(alladdresses)
print(alladdpd)

Empty DataFrame
Columns: []
Index: []


In [63]:
alladdpd=alladdpd.sort_values(0,ascending=False)
alladdpd.to_csv('alladdpd20220720.csv')

KeyError: 0

## Using Beautiful Soup

The Beautiful Soup library allows us to search for the contents of particular HTML elements without writing any regular expressions.

In [64]:
import requests
from bs4 import BeautifulSoup

URL = 'https://bradfordtuckfield.com/indexarchive20210903.xhtml'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'lxml') # dependency of bs4 is lxml

all_urls = soup.find_all('a')
for each in all_urls:
    print(each['href'])

In [65]:
# Parsing HTML Label Elements

import requests
from bs4 import BeautifulSoup


URL = 'https://bradfordtuckfield.com/contactscrape.xhtml'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'lxml')

email = soup.find('label',{'class':'email'}).text
mobile = soup.find('label',{'class':'mobile'}).text
website = soup.find('a',{'class':'website'}).text

print("Email : {}".format(email))
print("Mobile : {}".format(mobile))
print("Website : {}".format(website))

AttributeError: 'NoneType' object has no attribute 'text'

In [66]:
# Scraping and Parsing HTML Tables

import requests
from bs4 import BeautifulSoup


URL = 'https://bradfordtuckfield.com/user_detailsscrape.xhtml'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'lxml')

all_user_entries = soup.find_all('tr',{'class':'user-details'})
for each_user in all_user_entries:
    user = each_user.find_all("td")
    print("User Firstname : {}, Lastname : {}, Age: {}"\
.format(user[0].text, user[1].text, user[2].text))

## Advanced Scraping

- With dynamic web pages, you may want to use another tool such as the Selenium library, which is designed for scraping dynamic web pages. With Selenium, your script can do things like enter information into website forms and click CAPTCHA-type challenges without requiring direct human input. <br>

- Set up one or more proxy servers - A website might block your IP address from accessing its data, so you can set up a different server with a different IP address that the website hasn’t blocked. If the website continues to try to block the IP address of your proxy server as well, you can set up rotating proxies so that you continuously get new IP addresses that are not blocked, and scrape only with those fresh, unblocked IP addresses. <br>

- Some websites allow scraping, and some even set up an application programming interface (API) to facilitate data access. An API allows you to query a website’s data automatically and receive data that’s in a user-friendly format. If you ever need to scrape a website, check whether it has an API that you can access. If a website has an API, the API documentation should indicate the data that the API provides and how you can access it. <br>

- To prevent the target site from crashing or blocking you, you can adjust your scraper so that it works more slowly. One way to slow down your script is to deliberately add pauses. For example, after downloading one row from a table, the script can pause and do nothing (the script can sleep) for 1 second or 2 seconds or 10 seconds, and then download the next row from the table.