# Web scraping

Web scraping is a technique used to extract data from websites. It involves sending HTTP requests to websites, parsing the returned HTML code, and extracting the desired data. Web scraping is a powerful tool for data scientists as it allows them to collect large amounts of data from the web. This data can then be used to train machine learning models, analyse trends, and make informed business decisions.

---
## 1.&nbsp; Import libraries 💾

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

---
## 2.&nbsp; Beautiful Soup 🍲

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that simplifies the process of web scraping. It provides a user-friendly interface for parsing HTML documents, enabling users to extract specific information from websites. Through Beautiful Soup, you can navigate the HTML tree structure, locate elements based on their tags, attributes, and content, and extract the desired data into a structured format.

To illustrate how to use Beautiful Soup, we'll use the simplified mock website below. This stripped-down version serves as a practical learning tool, as real websites often possess much larger and more complex HTML structures. By starting with this simplified model, you can gradually build your skills and expertise, ensuring a solid understanding of the core concepts before tackling more intricate web scraping tasks.

In [None]:
html_doc = """
<html>
  <head>
    <title>The Dormouse's story
    </title>
  </head>
  <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1" meta="Eldest sister">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2" meta="Middle sister">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3" meta="Youngest sister">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


Beautiful Soup's HTML parser takes the raw, unruly HTML code and transforms it into a neatly organised tree structure, making the information easily accessible and manageable.

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

We can see the tree structure using Beautiful Soup's `.prettify` attribute.

In [None]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



---
## 3.&nbsp; Navigating html for beginners 🧭
There are many methods in Beautiful Soup to explore the html data. By far the most popular and useful of these is `.find_all().` So, naturally, this is where we'll start our journey.

### 3.1.&nbsp; `.find_all()`
The `.find_all()` method in Beautiful Soup returns a list of all the elements that match the specified criteria, such as tag name, class name, or attribute values.

#### 3.1.1.&nbsp; Searching by tag

The tags are the letter/word at the beginning of the angle brackets. For example, below, these brackets have an `a` tag.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

The `.find_all()` method takes a string argument and returns a list of all matching HTML tags within the current document. If no matching tags exist, an empty list is returned.

In [None]:
soup.find_all("title")

[<title>The Dormouse's story
     </title>]

In [None]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

#### 3.1.2.&nbsp; Searching by attribute

Attributes are the other information in the angle brackets. For example, below, these brackets have a `class`, `href`, `id`, and `meta` attribute.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Attributes provide additional context and functionality to the elements. They can serve various purposes, including CSS selectors for styling, URLs for linking to external resources, metadata for storing relevant data, and a multitude of other information-bearing components. By leveraging these attributes, we can effectively target specific sections of the website.

##### 3.1.2.1.&nbsp; CSS selectors
CSS selectors are used to to style certain sections of websites. This makes them very helpful for webscraping as we can then target certain regions of the website.

###### 3.1.2.1.1.&nbsp; Class
Class selectors are used to style **multiple** HTML elements that share a common characteristic or function.
> **Note:** here class has an underscore at the end of the word, this is because class is a reserved keyword in python.

In [None]:
soup.find_all(class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

###### 3.1.2.1.2.&nbsp; ID
ID selectors are used to style **single** HTML elements.

In [None]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [None]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>]

##### 3.1.2.2.&nbsp; Other attributes
HTML elements can also include other attributes, which can be equally useful for identifying and targeting specific data points. To locate these attributes, search for them using the same method as you do for CSS selectors.

In [None]:
soup.find_all(meta="Youngest sister")

[<a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

#### 3.1.3.&nbsp; Searching by string
The text (string) is the part between the opening and closing angle brackets, this is what's displayed on the webpage. For example, below, these brackets have `Elsie` as the text.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Instead of searching for specific tags or attributes, you can also search for this text. To do this, you can use a string or a regular expression to specify the text you're looking for.

In [None]:
soup.find_all(string="Dormouse")

[]

The string "Dormouse" didn't return any results because BeautifulSoup searches for entire strings that exactly match the string you entered. In other words, the string must be the exact same as what you're searching for for it to be considered a match.

In [None]:
soup.find_all(string="The Dormouse's story")

["The Dormouse's story"]

To search for a substring, the easiest way is to use the regular expressions method `.compile()`.

In [None]:
import re
soup.find_all(string=re.compile("dormouse", re.IGNORECASE))

["The Dormouse's story\n    ", "The Dormouse's story"]

> **Note:** by default, the .compile() method is case-sensitive, meaning it will only match strings that are exactly equal to the pattern you specify, including case. To perform case-insensitive matching, you must explicitly pass the re.IGNORECASE flag to the .compile() method.

### 3.2.&nbsp; Extracting text
There are a few ways to extract text in Beautiful Soup, here we'll focus on 2 of them.

#### 3.2.1.&nbsp; `.get_text()`
The `.get_text()` method extracts all the human-readable text from a Beautiful Soup object, returning it as a string.

In [None]:
soup.find_all("title")

[<title>The Dormouse's story
     </title>]

In [None]:
soup.find_all("title").get_text()

AttributeError: ResultSet object has no attribute "get_text". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Answer: basically if i use find_all, the get_text wouldnt know which element to retrieve the text from and gives an error since find_all gets a set of elements and not a single element.

> Read the error message and look at the output from the cell above. Can you work out why we got an error?

In [None]:
# @title Click `show code` to see the solution to the error

# It was a list, read the error messages and notice the square brackets in the original output
# Therefore, we need to select the first and only element of this list
soup.find_all("title")[0].get_text()

"The Dormouse's story\n    "

We can also print out multiple items using our looping skills.

In [None]:
story = soup.find_all("p")
story

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [None]:
for p in story:
  print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


#### 3.2.2.&nbsp; Extracting attributes:
HTML elements often store additional information within their attributes. To extract this data using Beautiful Soup, you can append square brackets after the element selector and specify the attribute name within them.

In [None]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [None]:
soup.find_all(id="link1")[0]['href']

'http://example.com/elsie'

In [None]:
soup.find_all(id="link1")[0]['meta']

'Eldest sister'

## Challenge 1 😀
Below is new HTML code. Use your scrapping skills to answer the questions.

In [32]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [33]:
# Create the "soup"
soup = BeautifulSoup(geography, 'html.parser')

In [34]:
# 1. All the "fun facts"
soup.find_all("p")

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>,
 <p>Spain produces 43,8% of all the world's Olive Oil.</p>]

In [37]:
# 2. The names of all the places.
headings = [h2.get_text() for h2 in soup.find_all("h2")]

headings

['London', 'Paris', 'Spain']

In [41]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
city_texts = [
    {"city": div.find("h2").get_text(strip=True), "description": div.find("p").get_text(strip=True)}
    for div in soup.find_all("div", class_="city")
]
city_texts

[{'city': 'London',
  'description': 'London is the most popular tourist destination in the world.'},
 {'city': 'Paris',
  'description': 'Paris was originally a Roman City called Lutetia.'}]

In [None]:
# 4. The names (not facts!) of all the cities (not countries!)
for city in soup.find_all("div", class_="city"):
  print(city.find("h2").get_text())

London
Paris


---
## 4.&nbsp; Navigating html with a few more advanced techniques 🗺️

### 4.1.&nbsp; `.find()`
`.find()` is similar to `.find_all()`, but it returns only the first element that matches the specified criteria. This makes it useful when you know exactly where the element you're looking for is located and you only need to retrieve one instance of it.

In [None]:
soup.find('p')

<p>London is the most popular tourist destination in the world.</p>

### 4.2.&nbsp; `.select()`
`.select()` is similar to `.find_all()`, but there are 2 main differences:
- the way we write our query in the brackets is slightly different
- `.select()` allows you to chain CSS selectors together to navigate through the HTML structure, enabling you to select elements based on their positions within nested elements or patterns. This makes it particularly useful for extracting data from complex HTML structures.

In contrast, `.find_all()` uses a simpler syntax based on tag names and attributes, making it more straightforward for basic element selection.

Here's how we query with `.find_all()`

In [None]:
soup.find_all('a', class_='sister')

[]

Here's the same query with `.select()`

In [None]:
soup.select('a.sister')

[]

To demonstrate the power of `.select()` in navigating through nested elements, let's extract all the `<a>` tags with the id `'link2'` that are within `<p>` tags with the class `'story'`.

In [None]:
soup.select('p.story a#link2')

[]

### 4.3.&nbsp; Navigating to the Next or Previous Element
In some cases, you may need to access specific elements that are closely related to others, but their HTML structure doesn't provide unique identifiers. To overcome this challenge, you can utilise the `.find_next()` and `.find_previous()` methods to navigate through the HTML structure and reach the desired element.

In [68]:
last_link = soup.find(id='link3')
last_link

#### 4.1.1.&nbsp; `.find_next()`
`.find_next()` moves forward one element

In [70]:
if last_link:
    next_element = last_link.find_next()
else:
    print("No matching element found.")

No matching element found.


#### 4.2.&nbsp; `.find_previous()`
`.find_previous()` moves back one element

In [69]:
if last_link:
    previous_element = last_link.find_previous()
else:
    print("No matching element found.")

No matching element found.


---
## 5.&nbsp; Showcasing these skills on a real website 💻
Let's see what information we can get from the wikipedia site for web scraping

### Loading the html

In [None]:
url = "https://en.wikipedia.org/wiki/Web_scraping"

response = requests.get(url)

soup_3 = BeautifulSoup(response.content, 'html.parser')

> While we haven't yet looked into the requests library, we'll postpone delving into it today to avoid overwhelming you with too much new information. Instead, we'll explore the requests library when we start gathering weather data later in the project.

In [None]:
print(soup_3.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Web scraping - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-wid

### Getting the title

In [None]:
soup_3.find("title").get_text()

'Web scraping - Wikipedia'

### Getting the first h1 tag

In [None]:
soup_3.find("h1").get_text()

'Web scraping'

### Getting all the h2 tags

In [None]:
h2_tags = soup_3.find_all("h2")
h2_tags

[<h2 class="vector-pinnable-header-label">Contents</h2>,
 <h2 id="History">History</h2>,
 <h2 id="Techniques">Techniques</h2>,
 <h2 id="Legal_issues">Legal issues</h2>,
 <h2 id="Methods_to_prevent_web_scraping">Methods to prevent web scraping</h2>,
 <h2 id="See_also">See also</h2>,
 <h2 id="References">References</h2>]

As we have multiple tags in the list here, we need to use a loop to print them out.

In [None]:
for h2 in h2_tags:
  print(h2.get_text())

Contents
History
Techniques
Legal issues
Methods to prevent web scraping
See also
References


### Selecting the `Legal Issues` text for only `India`
> **Pro tip:** If you're using Google Chrome, you can navigate to `View > Developer > Inspect elements` to access the built-in web development tools. Here, you can explore the HTML structure of the webpage directly within the browser using your mouse. This interactive approach is often more intuitive than examining the raw HTML code.

By investigating the html we can see that the closest, easy to access, tag is the heading with the CSS `id` of `"India"`.

In [None]:
soup_3.find(id="India")

<h3 id="India">India</h3>

We can then use `.find_next()` to select the text.

In [None]:
soup_3.find(id="India").find_next()

<span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=16" title="Edit section: India"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span>

Looks like the next tag was a `span` tag, so let's specify that we want the next `p` tag.

In [None]:
soup_3.find(id="India").find_next("p")

<p>Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the <a href="/wiki/Information_Technology_Act,_2000#:~:text=From_Wikipedia,_the_free_encyclopedia_The_Information_Technology,in_India_dealing_with_cybercrime_and_electronic_commerce." title="Information Technology Act, 2000">Information Technology Act, 2000</a>, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.
</p>

Now we can simply extract the text, and we have what we need

In [None]:
soup_3.find(id="India").find_next("p").get_text()

'Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.\n'

## Challenge 2 😀

Utilise your web scraping skills to gather information about three German cities – Berlin, Hamburg, and Munich – from Wikipedia. You will start by extracting basic information: the country, the latitude and the longitude of each city and then expand to more dynamic data such as the population.

1. Scraping Basic Information

  1.1. Begin by scraping the country, the latitude and the longitude of each city from their respective Wikipedia pages:

 - Berlin: https://en.wikipedia.org/wiki/Berlin
 - Hamburg: https://en.wikipedia.org/wiki/Hamburg
 - Munich: https://en.wikipedia.org/wiki/Munich

  1.2. Once you have scraped the basic information of each city, reflect on the similarities and patterns in accessing them across the three pages. Also, analyse the URLs to identify any commonalities. Make a loop that executes once and retrieves the country, latitude, and longitude for all three cities.

2. Data Organisation

  2.1 Utilise pandas DataFrame to effectively store the extracted information. This DataFrame should have a row for each city, and columns for each type of information (cityname, country, latitude, longitude). If you feel brave, change latitude and longitude into decimal format.

  2.2 Looking ahead (optional): Create a function from the loop and DataFrame to encapsulate the scraping process. This function can be used repeatedly to fetch updated data whenever necessary. It should return a clean, properly formatted DataFrame.


In [46]:
# We are going to load the soups first
url_1 = "https://en.wikipedia.org/wiki/Berlin"
url_2 = "https://en.wikipedia.org/wiki/Hamburg"
url_3 = "https://en.wikipedia.org/wiki/Munich"

response_1 = requests.get(url_1)
response_2 = requests.get(url_2)
response_3 = requests.get(url_3)

soup_4 = BeautifulSoup(response_1.content, 'html.parser')
soup_5 = BeautifulSoup(response_2.content, 'html.parser')
soup_6 = BeautifulSoup(response_3.content, 'html.parser')

In [65]:
def extract_city_coords_dec(soup):
  coords_decimals = {
     "city": soup.find("span", class_="mw-page-title-main").get_text(strip=True) if soup.find("span", class_="mw-page-title-main") else "Unknown",
      "country": next((row.find_next("td").get_text(strip=True) for row in soup.find_all("th") if "Country" in row.get_text()), "Unknown"),
     "latitude": round(sum(float(x) / 60**i for i, x in enumerate(re.split(r"[°′″NSEW]+", soup.find("span", class_="latitude").get_text(strip=True))[:-1])), 6) if soup.find("span", class_="latitude") else None,
     "longitude": round(
         sum(float(x) / 60**i for i, x in enumerate(re.split(r"[°′″NSEW]+", soup.find("span", class_="longitude").get_text(strip=True))[:-1]))
         * (-1 if soup.find("span", class_="longitude") and "W" in soup.find("span", class_="longitude").get_text() else 1),
         6
     ) if soup.find("span", class_="longitude") else None
  }

  return pd.DataFrame([coords_decimals])


coords_decimals = extract_city_coords_dec(soup_6)
print(coords_decimals)

     city  country  latitude  longitude
0  Munich  Germany   48.1375     11.575


In [61]:
coords_berlin = {
    "city": soup_4.find("span", class_="mw-page-title-main").get_text(),
    "country": soup_4.find("td", class_="infobox-data").get_text(),
    "latitude": soup_4.find("span", class_="latitude").get_text(),
    "longitude": soup_4.find("span", class_="longitude").get_text()
}

coords_berlin

{'city': 'Berlin',
 'country': 'Germany',
 'latitude': '52°31′12″N',
 'longitude': '13°24′18″E'}

In [55]:
coords_hamburg = {
    "city": soup_5.find("span", class_="mw-page-title-main").get_text(),
    "country": soup_5.find("td", class_="infobox-data").get_text(),
    "latitude": soup_5.find("span", class_="latitude").get_text(),
    "longitude": soup_5.find("span", class_="longitude").get_text()
}

coords_hamburg

{'city': 'Munich', 'country': 'Germany', 'latitude': '48°08′15″N', 'longitude': '11°34′30″E'}


In [62]:
coords_munich = {
    "city": soup_6.find("span", class_="mw-page-title-main").get_text(),
    "country": soup_6.find("td", class_="infobox-data").get_text(),
    "latitude": soup_6.find("span", class_="latitude").get_text(),
    "longitude": soup_6.find("span", class_="longitude").get_text()
}

coords_munich

{'city': 'Munich',
 'country': 'Germany',
 'latitude': '48°08′15″N',
 'longitude': '11°34′30″E'}

In [56]:
def extract_city_data(soup):
    coords = {
        "city": soup.find("span", class_="mw-page-title-main").get_text(strip=True) if soup.find("span", class_="mw-page-title-main") else "Unknown",
        "country": next((row.find_next("td").get_text(strip=True) for row in soup.find_all("th") if "Country" in row.get_text()), "Unknown"),
        "latitude": soup.find("span", class_="latitude").get_text(strip=True) if soup.find("span", class_="latitude") else None,
        "longitude": soup.find("span", class_="longitude").get_text(strip=True) if soup.find("span", class_="longitude") else None
    }

    return coords

coords = extract_city_data(soup_6)
print(coords)

{'city': 'Munich', 'country': 'Germany', 'latitude': '48°08′15″N', 'longitude': '11°34′30″E'}


## BONUS Challenge 3: Population

  3.1. Expand the scope of your data gathering by extracting the population of a city. This information changes over time, so we might need to add a timestamp.

  3.2. Organise your information in a DataFrame and wrap it in a separate function.

In [20]:
from datetime import datetime

def scrape_city_data(soup_4):
    """Scrapes city data (name, country, population) from a BeautifulSoup object."""

    # Extract city, country, and population directly
    city = soup_4.find("span", class_="mw-page-title-main").get_text(strip=True)
    country = next((row.find_next("td").get_text(strip=True) for row in soup_4.find_all("th") if "Country" in row.get_text()), "Unknown")
    population = next((row.find_next("td").get_text(strip=True).split("[")[0] for row in soup_4.find_all("th") if "Population" in row.get_text()), "Unknown")

    # Add timestamp
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Create a dictionary and convert it into a DataFrame
    data = {
        "city": city,
        "country": country,
        "population": population,
        "timestamp": timestamp
    }

    return pd.DataFrame([data])

# Example: Use soup_4 directly (already parsed HTML)
berlin_data = scrape_city_data(soup_4)
print(berlin_data)

     city  country population            timestamp
0  Berlin  Germany  3,596,999  2025-03-12 17:17:18


## BONUS Challenge 4: Global Data Scraping

  With your robust scraping skills now honed, venture beyond the confines of Germany and explore other cities around the world. While the extraction methodology for German cities may follow a consistent pattern, this may not be the case for cities from different countries. Can you make a function that returns a clean DataFrame of information for cities worldwide?

In [28]:
def scrape_city_data(url):
    """Scrapes relevant city data from a Wikipedia page by URL."""

    # Request the Wikipedia page and parse it with BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract city name
    city = soup.find("span", class_="mw-page-title-main").get_text(strip=True)

    # Extract country
    country = next(
        (row.find_next("td").get_text(strip=True) for row in soup.find_all("th") if "Country" in row.get_text()),
        "Unknown"
    )

    # Extract population
    population = next(
        (row.find_next("td").get_text(strip=True).split("[")[0] for row in soup.find_all("th") if "Population" in row.get_text()),
        "Unknown"
    )

    # Extract area (in square kilometers or miles)
    area = next(
        (row.find_next("td").get_text(strip=True) for row in soup.find_all("th") if "Area" in row.get_text()),
        "Unknown"
    )

    # Extract coordinates (latitude and longitude)
    coords = soup.find("span", class_="latitude")
    if coords:
        latitude = coords.get_text(strip=True)
        longitude = soup.find("span", class_="longitude").get_text(strip=True)
    else:
        latitude, longitude = "Unknown", "Unknown"

    # Extract timezone (if available)
    timezone = next(
        (row.find_next("td").get_text(strip=True) for row in soup.find_all("th") if "Timezone" in row.get_text()),
        "Unknown"
    )

    # Add timestamp
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Return all data in a dictionary
    data = {
        "city": city,
        "country": country,
        "population": population,
        "area": area,
        "latitude": latitude,
        "longitude": longitude,
        "timezone": timezone,
        "timestamp": timestamp
    }

    # Return as a DataFrame
    return pd.DataFrame([data])

# Example: Scrape data for Berlin
berlin_data = scrape_city_data("https://en.wikipedia.org/wiki/Berlin")
print(berlin_data)

# Example: Scrape data for New York City
nyc_data = scrape_city_data("https://en.wikipedia.org/wiki/New_York_City")
print(nyc_data)

     city  country population                    area    latitude   longitude  \
0  Berlin  Germany  3,596,999  891.3 km2(344.1 sq mi)  52°31′12″N  13°24′18″E   

  timezone            timestamp  
0  Unknown  2025-03-12 17:21:35  
            city        country population                         area  \
0  New York City  United States  8,804,190  472.43 sq mi (1,223.59 km2)   

     latitude  longitude timezone            timestamp  
0  40°42′46″N  74°0′22″W  Unknown  2025-03-12 17:21:36  
