# 1.4: Accessing Web Data with Data Scraping

In [2]:
# Import libraries

import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import matplotlib.pyplot as plt 
import os
import logging

In [7]:
from bs4 import BeautifulSoup
import requests

In [4]:
# Install the driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

In my case, Selenium 4 have changed the API? then the webdriver.Chrome() constructor no longer accepts executable_path directly. Instead, I must pass a Service object.
After this step was open new Chrome page, that will be filled with url of interest on the next page - https://en.wikipedia.org/wiki/Key_events_of_the_20th_century. 

In [15]:
# Get the page’s contents

page = requests.get("https://en.wikipedia.org/wiki/Key_events_of_the_20th_century")

In [16]:
# Create soup and get title

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.title)

None


The request isn’t actually returning the full article HTML. Seems like Wikipedia blocked “bare” Python requests and instead serves a redirect or a cookie/consent page that doesn’t contain a <title> tag. 
BeautifulSoup parses that response, finds no <title>, and gives me None.

In [18]:
print(page.status_code)

403


In [19]:
print(page.url) 

https://en.wikipedia.org/wiki/Key_events_of_the_20th_century


In [20]:
print(page.text[:500])

Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also https://phabricator.wikimedia.org/T400119.



In [21]:
headers = {"User-Agent": "Mozilla/5.0"}
page = requests.get("https://en.wikipedia.org/wiki/Key_events_of_the_20th_century", headers=headers)

soup = BeautifulSoup(page.text, 'html.parser')

print(soup.title)

<title>Key events of the 20th century - Wikipedia</title>


- User-Agent tells Wikipedia “I’m a real browser” (like Chrome or Firefox).
- With that header, Wikipedia serves the full article HTML.
- BeautifulSoup then finds the <title> tag correctly

In [22]:
print(page.text[:500])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vect


In [23]:
# create a new object that stores all the text, define its encoding, and save the file in the working directory

text = soup.get_text()

text = text.encode ('utf-8')

with open('Key_events_of_the_20th_century.txt', 'wb') as f:
       f.write(text)