### Automatic data collection

### Wikipedia Web Scraping Task

Choose a topic that interests you on **wikipedia.org** and complete the following tasks:

1. **Capture the page title**
2. **Capture all section headings**
3. **Extract at least one image from the page**

In [1]:
import requests
from bs4 import BeautifulSoup

URL =  "https://en.wikipedia.org/wiki/Linus_Torvalds"
page = requests.get(URL)

soup =  BeautifulSoup(page.content, "html.parser")

print(soup)

Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.



In [2]:
import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/Linus_Torvalds"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/116.0.0.0 Safari/537.36"
}

response = requests.get(URL, headers=headers)

soup = BeautifulSoup(response.text, "html.parser")

print(soup.prettify()[:1000])  



<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Linus Torvalds - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-

1. **Capture the page title**

In [3]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Linus_Torvalds"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")

print(soup.title.text)


Linus Torvalds - Wikipedia


2. **Capture all section headings**

In [None]:
import requests

url = "https://en.wikipedia.org/w/api.php"
params = {
    "action": "parse",
    "page": "Linus_Torvalds",
    "format": "json",
    "prop": "sections"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}

response = requests.get(url, params=params, headers=headers) #https://en.wikipedia.org/w/api.php?action=parse&page=Linus_Torvalds&format=json&prop=sections


data = response.json() 
sections = data['parse']['sections'] 

for section in sections:
    print(section['line']) 


Life and career
Early years
Linux
The Linus/Linux connection
Authority and trademark
Other software
Git
Subsurface
Sparse
Personal life
Awards and achievements
Media recognition
Bibliography
See also
Notes
References
Further reading
External links


3. **Extract at least one image from the page**

In [None]:
import requests
from bs4 import BeautifulSoup
from IPython.display import Image, display

# Wikipedia page URL
URL = "https://en.wikipedia.org/wiki/Linus_Torvalds"

# Set headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}

# Fetch the page
response = requests.get(URL, headers=headers)

# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")

# Find the infobox table
infobox = soup.find("table", class_="infobox")

if infobox:
    image_tag = infobox.find("img")
    if image_tag and image_tag.get("src"): 
        image_url = "https:" + image_tag["src"] 
        print("\nMain Image URL:", image_url)

        display(Image(url=image_url)) 




Main Image URL: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Lc3_2018_%28263682303%29_%28cropped%29.jpeg/250px-Lc3_2018_%28263682303%29_%28cropped%29.jpeg
