<a href="https://colab.research.google.com/github/AdaChornelia/30-seconds-of-code/blob/master/Copy_of_Introduction_to_web_scraping_and_text_preprocessing_demo1_20240325.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the web scraping and text preprocessing Demo1

Welcome to this seminars for the Research data academy organized by the HKU Libraries.




# The basic workflow

## Part 0: Prepare Environment

If you are using your local PC, you have to install packages for your python environment before using this iPython Notebook file.

You can uncomment the following command by removing the # to download the packages

In [25]:
#!pip install bs4 requests

## Part 1: Your First Scraper

**Step 1: import required library**

In [26]:
from urllib.request import urlopen

**Step 2: Initialize a variable to store the url string**

In [27]:
url = "https://lib.hku.hk"

**Step 3: Send the request to the target url**

In [28]:
response = urlopen(url)

and the you will get the http response object

In [29]:
response

<http.client.HTTPResponse at 0x7d2d90172cb0>

**Step 4: read the bytes from the response**

In [30]:
html_bytes = response.read()


<img src="https://upload.wikimedia.org/wikipedia/commons/9/92/Bits_and_Bytes.svg" width="400">


Any file is just a series of bytes stored on the disk.
A string is a sequence of characters that cannot be directly stored on disk. On the other hand, a byte string is a sequence of bytes - things that can be stored on disk. The relationship between them is known as encoding.

In order to properly read text from a file, we need to decode the file on disk with the correct encoding. Similarly, we need to encode the string to bytes before saving it on disk.

To print a bytes string in Python, we use a prefix 'b' which is decoded with ASCII automatically.

In [31]:
print(html_bytes)

b'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">\n<head>\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<meta http-equiv="X-UA-Compatible" content="IE=edge"> \n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="/sites/all/themes/bootstrap/hkul_subtheme/favicon.ico" type="image/vnd.microsoft.icon" />\n<title>HKU Libraries</title>\n<style>\n@import url("/modules/system/system.base.css?");\n</style>\n<style media="screen">\n@import url("/sites/all/modules/tipsy/stylesheets/tipsy.css?");\n</style>\n<style media="screen">\n@import url("/sites/all/modules/views_slideshow/views_slideshow.c

**Step 5: Decode to text**

As we talked about the bytes string before, because the original html file is encoded and send to the client in bytes, we need to use the right encoding system to decode the bytes and turn them back into string for displaying correctly and performing other operations.

And most of html file declare the encoding of the document in the <meta> tag within the <head> section.

Example:
`<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`

In [32]:
html = html_bytes.decode("utf-8")

In [33]:
html

'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">\n<head>\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<meta http-equiv="X-UA-Compatible" content="IE=edge"> \n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="/sites/all/themes/bootstrap/hkul_subtheme/favicon.ico" type="image/vnd.microsoft.icon" />\n<title>HKU Libraries</title>\n<style>\n@import url("/modules/system/system.base.css?");\n</style>\n<style media="screen">\n@import url("/sites/all/modules/tipsy/stylesheets/tipsy.css?");\n</style>\n<style media="screen">\n@import url("/sites/all/modules/views_slideshow/views_slideshow.cs

## Part 2: Extract the title text from HTML with string methods

Using the find method to a string type variable by passing the target. It will return the index in the string.

String in python is a list of characters, so you can access each character by passing the index number or index range to the list to extract the part.

**Step 1: find the opening `<title>` tag**

In [34]:
title_index = html.find("<title>")
title_index

704

**Step 2: Get the target text starting index by adding the length of the title tag**

In [35]:
start_index = title_index + len("<title>")
start_index

711

**Step 3: find the closing </title> tag**

In [36]:
end_index = html.find("</title>")
end_index

724

**Step 4: Extract the title by slicing the html string**

In [37]:
title = html[start_index:end_index]
title

'HKU Libraries'

A string in python can be extract by given the start and end index in a list

Example:
```
example_string = 'abcdefg'
extract = example_string[2:4]
```

Result:
cd

The above snippet try to extract the string start from the index 2 in the list and end before index 4. And remeber the start of index for a string is from 0.

In [38]:
example_string = 'abcdefg'
extract = example_string[2:4]
extract

'cd'

However, you need to know the extact string in the html document by performing string search, for example, if the title tag present with extract space like `< title>`, the string search method will return -1 because of no result so that we extract the wrong information.

If we need to encounter all these problems in every single html documents, it is not a reliable way to achieve data collection efficiently.

You may think about using pattren to help extraction, such as Regular Expressions, which is a common programming concept. However it is still take our time to write all rules to extract our target information.

## Part 3: Use an HTML Parser (BeatifulSoup)

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

From Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4

### Scenario 1


Your boss asked you to build web scraper to get the table List_of_countries_by_population_(United_Nations) from wikipedia

### Let's define steps


1. Import the necessary libraries
2. Store our url into a variable
3. Send request to the target and get the decoded text
4. Construct the soup object with scraped text
5. Extract all tables to a list
6. Identify the target table for the country list
7. Parse the table into a pandas dataframe

### Step 1: import the library to our environment

In [39]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd

### Step 2: store our url into a variable

In [40]:
#this url contains the data you need to scrape
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"

To request the target url, we will use thrid party http client library "requests" instead of using the build-in urllib.

The requests library will fetch the URI and return a response object. It helps us to check the status from server and the encoding in that object.

In addition, it provide methods .content to access the byte stream or .text to access the decoded Unicode stream. That means it can give you the result with auto decoding.



### Step 3: send request to the target and get the decoded text

In [41]:
data = requests.get(url).text

### Step 4: Construct the soup object with scraped text

<img src="https://upload.wikimedia.org/wikipedia/commons/e/eb/DOM_tree.svg" width="200">

*image from wikimedia*

By constructing the soup object, it helps to convert the "plain text" of the source code into a "tag tree" that can be used for analysis. We will use the html5lib which is installed in the Colab env by default, if you are using your local PC, please install the library by

`pip install html5lib`

To compare different parsers, please read: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

In [42]:
soup = BeautifulSoup(data, 'html5lib')

### Step 5:Extract all tables to a list

The soup object support a lot of methods to search and get elements from the tree, for example:
`
.select(),
.select_one(),
.find_all(),
...
`
 You can pass the string, regular expression and also True to the above functions.

 Read more: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree


In [43]:
tables = soup.select("table")

### Step 6: Identify the target table for the country list

In [44]:
table = tables[0]

In [45]:
## you can confirm by pretty print the table
# print(table.prettify())

### Step 7: Parse the table into a pandas dataframe

In [46]:
df = pd.DataFrame(columns=['Location', 'Population_2022', 'Population_2023','Change','Continental'])

for index, row in enumerate(table.find("tbody").find_all("tr")):
  if index > 0:

    col = row.find_all("td")

    # store the
    location = col[0].text
    population_2022 = col[1].text
    population_2023 = col[2].text
    change = col[3].text
    continental = col[4].text

    # # create a temp dataframe
    tmp_row = {
        'Location' : location,
        'Population_2022': population_2022,
        'Population_2023': population_2023,
        'Change':change,
        'Continental':continental,
    }
    tmp_df = pd.DataFrame([tmp_row])

    # # append the temp dataframe to the main data frame
    df = pd.concat([df,tmp_df], ignore_index=True)

# #drop first row
df.drop(index=df.index[0], axis=0, inplace=True)



### Bouns: Visualization recommendation from Google Colab (Optional)

When you view the dataframe variable in Colab, it will

In [47]:
df

Unnamed: 0,Location,Population_2022,Population_2023,Change,Continental
1,India,1417173173,1428627663,+0.81%,Asia
2,China[a],1425887337,1425671352,−0.02%,Asia
3,United States,338289857,339996564,+0.50%,Americas
4,Indonesia,275501339,277534123,+0.74%,Asia
5,Pakistan,235824862,240485658,+1.98%,Asia
...,...,...,...,...,...
234,Falkland Islands (United Kingdom),3780,3791,+0.29%,Americas
235,Niue,1934,1935,+0.05%,Oceania
236,Tokelau (New Zealand),1871,1893,+1.18%,Oceania
237,Vatican City[x],510,518,,Europe


## Part 4 Storing and read the data again

By taking these step, we can mount the google drive to read and write files.

In [48]:
df.to_csv("drive/MyDrive/Colab Notebooks/countries_population_united_nations.csv", index=False)

OSError: Cannot save file into a non-existent directory: 'drive/MyDrive/Colab Notebooks'

In [None]:
df2 = pd.read_csv("drive/MyDrive/Colab Notebooks/countries_population_united_nations.csv",index_col=False)

In [None]:
df2

# Apart from static web page, A demo requires human action simulation

### Scenario 2

Your boss asked you to build web scraper to get the Harry potter charater's details from wikipedia by searching harry potter and click the character link on the result page.

### Let's define steps

1. imports the necessary libraries from Selenium.
2. sets up the web driver for Chrome, providing the path to the Chrome driver executable.
3. defines the URL and navigates to the Wikipedia website.
4. find the search bar element and type "harry potter", then press the enter
5. check the current url and print the screenshot in base64 format
6. find the first url contain "character", and then click it
7. check the current url and print the screenshot in base64 format again
8. quits the web driver to close the browser.

## Step 0: Setup web driver before using selenium

In [None]:
# Set up for running selenium in Google Colab
## You don't need to run this code if you do it in Jupyter notebook, or other local Python setting
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

In [None]:
# fix the version compatibitable issue by installing chromedriver-autoinstaller
!pip install chromedriver-autoinstaller

# add the driver to the sys path so selenium can find the driver
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

**Test our setup by getting the page title from wikipedia**

In [None]:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller

# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# # set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

# set the target URL
url = "https://wikipedia.org/"

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
print(driver.title)
driver.quit()

## Step 1: Import the necessary libraries

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import chromedriver_autoinstaller
from IPython import display
from base64 import b64decode
import pandas as pd

## Step 2: Set up the web driver

In [None]:
# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)

## Step 3: Define the URL and navigate to the Wikipedia website

In [None]:
# set the target URL
url = "https://www.wikipedia.org/"

# fetch the url
driver.get(url)


In [None]:
# check the webpage title
driver.title

## Step 4: Find the search bar element and search for "harry potter"

In [None]:
search_bar = driver.find_element(By.ID, "searchInput")

In [None]:
search_bar.send_keys("harry potter")
search_bar.send_keys(Keys.RETURN)

In [None]:
# check the current url after sending the new action
print(driver.current_url)

In [None]:
# take a screenshot and display in notebook
screen_1_data = driver.get_screenshot_as_base64()
display.Image(b64decode(screen_1_data))

## Step 7: Find the first URL containing "character" and click it

In [None]:
character_link =driver.find_element(By.PARTIAL_LINK_TEXT, "character")
character_link.click()

In [None]:
# check the current url after sending the new action
print(driver.current_url)

In [None]:
# take a screenshot and display in notebook
screen_2_data = driver.get_screenshot_as_base64()
display.Image(b64decode(screen_2_data))

## Step 9: Find the `<p>` tags and store text content in a CSV file by pandas

In [None]:
paragraphs = driver.find_elements(By.TAG_NAME, "p")

In [None]:
# check the text collected, the text of a element need to be access by the .text method
paragraphs[1].text

In [None]:
data = []
for paragraph in paragraphs:
    data.append([paragraph.text])

df = pd.DataFrame(data, columns=["Paragraph"])

df.to_csv("drive/MyDrive/Colab Notebooks/harry_potter_character_paragraphs.csv", index=False, encoding="utf-8")

## Step 10: Quit the web driver

In [None]:
driver.quit()