<a href="https://colab.research.google.com/github/PhoenixCC0722/Journey_to_become_DataScientist/blob/main/Chapter5_1_webScraping__structure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web scraping

Web scraping is a technique used to extract data from websites. It involves sending HTTP requests to websites, parsing the returned HTML code, and extracting the desired data. Web scraping is a powerful tool for data scientists as it allows them to collect large amounts of data from the web. This data can then be used to train machine learning models, analyse trends, and make informed business decisions.

---
## 1.&nbsp; Import libraries 💾

In [90]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

---
## 2.&nbsp; Beautiful Soup 🍲

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that simplifies the process of web scraping. It provides a user-friendly interface for parsing HTML documents, enabling users to extract specific information from websites. Through Beautiful Soup, you can navigate the HTML tree structure, locate elements based on their tags, attributes, and content, and extract the desired data into a structured format.

To illustrate how to use Beautiful Soup, we'll use the simplified mock website below. This stripped-down version serves as a practical learning tool, as real websites often possess much larger and more complex HTML structures. By starting with this simplified model, you can gradually build your skills and expertise, ensuring a solid understanding of the core concepts before tackling more intricate web scraping tasks.

In [91]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1" meta="Eldest sister">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2" meta="Middle sister">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3" meta="Youngest sister">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


Beautiful Soup's HTML parser takes the raw, unruly HTML code and transforms it into a neatly organised tree structure, making the information easily accessible and manageable.

In [92]:
soup = BeautifulSoup(html_doc, 'html.parser')

We can see the tree structure using Beautiful Soup's `.prettify` attribute.

In [93]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



---
## 3.&nbsp; Navigating html for beginners 🧭
There are many methods in Beautiful Soup to explore the html data. By far the most popular and useful of these is `.find_all().` So, naturally, this is where we'll start our journey.

### 3.1.&nbsp; `.find_all()`
The `.find_all()` method in Beautiful Soup returns a list of all the elements that match the specified criteria, such as tag name, class name, or attribute values.

#### 3.1.1.&nbsp; Searching by tag

The tags are the letter/word at the beginning of the angle brackets. For example, below, these brackets have an `a` tag.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

The `.find_all()` method takes a string argument and returns a list of all matching HTML tags within the current document. If no matching tags exist, an empty list is returned.

In [94]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [95]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

#### 3.1.2.&nbsp; Searching by attribute

Attributes are the other information in the angle brackets. For example, below, these brackets have a `class`, `href`, `id`, and `meta` attribute.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Attributes provide additional context and functionality to the elements. They can serve various purposes, including CSS selectors for styling, URLs for linking to external resources, metadata for storing relevant data, and a multitude of other information-bearing components. By leveraging these attributes, we can effectively target specific sections of the website.

##### 3.1.2.1.&nbsp; CSS selectors
CSS selectors are used to to style certain sections of websites. This makes them very helpful for webscraping as we can then target certain regions of the website.

###### 3.1.2.1.1.&nbsp; Class
Class selectors are used to style **multiple** HTML elements that share a common characteristic or function.
> **Note:** here class has an underscore at the end of the word, this is because class is a reserved keyword in python.

In [96]:
soup.find_all(class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

###### 3.1.2.1.2.&nbsp; ID
ID selectors are used to style **single** HTML elements.

In [97]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [98]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>]

##### 3.1.2.2.&nbsp; Other attributes
HTML elements can also include other attributes, which can be equally useful for identifying and targeting specific data points. To locate these attributes, search for them using the same method as you do for CSS selectors.

In [99]:
soup.find_all(meta="Youngest sister")

[<a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

#### 3.1.3.&nbsp; Searching by string
The text (string) is the part between the opening and closing angle brackets, this is what's displayed on the webpage. For example, below, these brackets have `Elsie` as the text.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Instead of searching for specific tags or attributes, you can also search for this text. To do this, you can use a string or a regular expression to specify the text you're looking for.

In [100]:
soup.find_all(string="Dormouse")

[]

The string "Dormouse" didn't return any results because BeautifulSoup searches for entire strings that exactly match the string you entered. In other words, the string must be the exact same as what you're searching for for it to be considered a match.

In [101]:
soup.find_all(string="The Dormouse's story")

["The Dormouse's story", "The Dormouse's story"]

To search for a substring, the easiest way is to use the regular expressions method `.compile()`.

In [102]:
import re
soup.find_all(string=re.compile("dormouse", re.IGNORECASE))

["The Dormouse's story", "The Dormouse's story"]

> **Note:** by default, the .compile() method is case-sensitive, meaning it will only match strings that are exactly equal to the pattern you specify, including case. To perform case-insensitive matching, you must explicitly pass the re.IGNORECASE flag to the .compile() method.

### 3.2.&nbsp; Extracting text
There are a few ways to extract text in Beautiful Soup, here we'll focus on 2 of them.

#### 3.2.1.&nbsp; `.get_text()`
The `.get_text()` method extracts all the human-readable text from a Beautiful Soup object, returning it as a string.

In [103]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [104]:
soup.find_all("title").get_text()

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

> Read the error message and look at the output from the cell above. Can you work out why we got an error?

In [105]:
soup.find_all("title")[0].get_text()

"The Dormouse's story"

In [106]:
soup.find("title").get_text()

"The Dormouse's story"

In [107]:
# @title Click `show code` to see the solution to the error

# It was a list, read the error messages and notice the square brackets in the original output
# Therefore, we need to select the first and only element of this list
soup.find_all("title")[0].get_text()

"The Dormouse's story"

We can also print out multiple items using our looping skills.

In [108]:
story = soup.find_all("p")
story

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [109]:
for p in story:
  print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


#### 3.2.2.&nbsp; Extracting attributes:
HTML elements often store additional information within their attributes. To extract this data using Beautiful Soup, you can append square brackets after the element selector and specify the attribute name within them.

In [110]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [111]:
soup.find_all(id="link1")[0]['href']

'http://example.com/elsie'

In [112]:
soup.find_all(id="link1")[0]['meta']

'Eldest sister'

## Challenge 1 😀
Below is new HTML code. Use your scrapping skills to answer the questions.

In [113]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [114]:
# Create the "soup"
soup = BeautifulSoup(geography, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [27]:
# 1. All the "fun facts"
# 'p' represents the <p> tags used for paragraphs in HTML.
fun_facts = soup.find_all('p')
fun_facts

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>,
 <p>Spain produces 43,8% of all the world's Olive Oil.</p>]

In [28]:
for p in fun_facts:
    print(p.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [29]:
# 2. The names of all the places.
# soup.find_all(class_="city")
all_the_places = soup.find_all('h2')
all_the_places

[<h2>London</h2>, <h2>Paris</h2>, <h2>Spain</h2>]

In [30]:
for n in all_the_places:
  print(n.get_text())

London
Paris
Spain


In [31]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
all_the_cities = soup.find_all(class_="city")
all_the_cities

[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>,
 <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

In [32]:
for n in all_the_cities:
  print(n.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [33]:
# 4. The names (not facts!) of all the cities (not countries!)
contents_of_cities = soup.find_all(class_="city")
contents_of_cities

[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>,
 <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

In [34]:
for name in contents_of_cities:
  print(name.h2.get_text())

London
Paris


---
## 4.&nbsp; Navigating html with a few more advanced techniques 🗺️

### 4.1.&nbsp; `.find()`
`.find()` is similar to `.find_all()`, but it returns only the first element that matches the specified criteria. This makes it useful when you know exactly where the element you're looking for is located and you only need to retrieve one instance of it.

In [35]:
soup.find('p')

<p>London is the most popular tourist destination in the world.</p>

### 4.2.&nbsp; `.select()`
`.select()` is similar to `.find_all()`, but there are 2 main differences:
- the way we write our query in the brackets is slightly different
- `.select()` allows you to chain CSS selectors together to navigate through the HTML structure, enabling you to select elements based on their positions within nested elements or patterns. This makes it particularly useful for extracting data from complex HTML structures.

In contrast, `.find_all()` uses a simpler syntax based on tag names and attributes, making it more straightforward for basic element selection.

Here's how we query with `.find_all()`

In [36]:
soup.find_all('a', class_='sister')

[]

Here's the same query with `.select()`

In [37]:
soup.select('a.sister')

[]

To demonstrate the power of `.select()` in navigating through nested elements, let's extract all the `<a>` tags with the id `'link2'` that are within `<p>` tags with the class `'story'`.

In [38]:
soup.select('p.story a#link2')

[]

**p.story** select the p elements with the class 'story'.

**a#link2** select a elemnts with an ID 'link2'.

CSS Selectors **. (class)   # (ID)**

### 4.3.&nbsp; Navigating to the Next or Previous Element
In some cases, you may need to access specific elements that are closely related to others, but their HTML structure doesn't provide unique identifiers. To overcome this challenge, you can utilise the `.find_next()` and `.find_previous()` methods to navigate through the HTML structure and reach the desired element.

In [116]:
last_link = soup.find(id='link3')
last_link

#### 4.1.1.&nbsp; `.find_next()`
`.find_next()` moves forward one element

In [117]:
last_link.find_next()

AttributeError: 'NoneType' object has no attribute 'find_next'

#### 4.2.&nbsp; `.find_previous()`
`.find_previous()` moves back one element

In [42]:
last_link.find_previous()

AttributeError: 'NoneType' object has no attribute 'find_previous'

---
## 5.&nbsp; Showcasing these skills on a real website 💻
Let's see what information we can get from the wikipedia site for web scraping

### Loading the html

In [43]:
url = "https://en.wikipedia.org/wiki/Web_scraping"

response = requests.get(url)

soup_3 = BeautifulSoup(response.content, 'html.parser')

> While we haven't yet looked into the requests library, we'll postpone delving into it today to avoid overwhelming you with too much new information. Instead, we'll explore the requests library when we start gathering weather data later in the project.

In [44]:
print(soup_3.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Web scraping - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-m

### Getting the title

In [45]:
soup_3.find("title").get_text()

'Web scraping - Wikipedia'

### Getting the first h1 tag

In [46]:
soup_3.find("h1").get_text()

'Web scraping'

### Getting all the h2 tags

In [47]:
h2_tags = soup_3.find_all("h2")
h2_tags

[<h2 class="vector-pinnable-header-label">Contents</h2>,
 <h2><span class="mw-headline" id="History">History</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=1" title="Edit section: History"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Techniques">Techniques</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=2" title="Edit section: Techniques"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Software">Software</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=11" title="Edit section: Software"><span>edit</span></a><span class="mw-editsection-bracket">]</span></

As we have multiple tags in the list here, we need to use a loop to print them out.

In [48]:
for h2 in h2_tags:
  print(h2.get_text())

Contents
History[edit]
Techniques[edit]
Software[edit]
Legal issues[edit]
Methods to prevent web scraping[edit]
See also[edit]
References[edit]


### Selecting the `Legal Issues` text for only `India`
> **Pro tip:** If you're using Google Chrome, you can navigate to `View > Developer > Inspect elements` to access the built-in web development tools. Here, you can explore the HTML structure of the webpage directly within the browser using your mouse. This interactive approach is often more intuitive than examining the raw HTML code.

By investigating the html we can see that the closest, easy to access, tag is the heading with the CSS `id` of `"India"`.

In [49]:
soup_3.find(id="India")

<span class="mw-headline" id="India">India</span>

**span** tag is an HTML element used to group inline elements and apply styles or behaviors to them.

We can then use `.find_next()` to select the text.

In [50]:
soup_3.find(id="India").find_next()

<span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=16" title="Edit section: India"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span>

Looks like the next tag was a `span` tag, so let's specify that we want the next `p` tag.

In [51]:
soup_3.find(id="India").find_next("p")

<p>Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the <a href="/wiki/Information_Technology_Act,_2000#:~:text=From_Wikipedia,_the_free_encyclopedia_The_Information_Technology,in_India_dealing_with_cybercrime_and_electronic_commerce." title="Information Technology Act, 2000">Information Technology Act, 2000</a>, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.
</p>

Now we can simply extract the text, and we have what we need

In [52]:
soup_3.find(id="India").find_next("p").get_text()

'Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.\n'

## Challenge 2 😀

Utilise your web scraping skills to gather information about three German cities – Berlin, Hamburg, and Munich – from Wikipedia. You will start by extracting the population of each city and then expand the scope of your data gathering to include latitude and longitude, country, and possibly other relevant details.

1. Population Scraping

  1.1. Begin by scraping the population of each city from their respective Wikipedia pages:

 - Berlin: https://en.wikipedia.org/wiki/Berlin
 - Hamburg: https://en.wikipedia.org/wiki/Hamburg
 - Munich: https://en.wikipedia.org/wiki/Munich

  1.2. Once you have scrapped the population of each city, reflect on the similarities and patterns in accessing the population data across the three pages. Also, analyse the URLs to identify any commonalities. Make a loop that executes once but simultaneously retrieves the population for all three cities.

2. Data Organisation

  Utilise pandas DataFrame to effectively store the extracted population data. Ensure the data is clean and properly formatted. Remove any unnecessary characters or symbols and ensure the column data types are correct.

3. Further Enhancement

  3.1. Expand the scope of your data gathering by extracting other relevant information for each city:

 - Latitude and longitude
 - Country of location

  3.2. Create a function from the loop and DataFrame to encapsulate the scraping process. This function can be used repeatedly to fetch updated data whenever necessary. It should return a clean, properly formatted DataFrame.

4. Global Data Scraping

  With your robust scraping skills now honed, venture beyond the confines of Germany and explore other cities around the world. While the extraction methodology for German cities may follow a consistent pattern, this may not be the case for cities from different countries. Can you make a function that returns a clean DataFrame of information for cities worldwide?

# Solutions--Population:






1.  Web scraping from the city Berlin



In [53]:
url_0 = "https://en.wikipedia.org/wiki/Berlin"
response_0 = requests.get(url_0)
soup_0 = BeautifulSoup(response_0.content, 'html.parser')

In [55]:
#print(soup_0.prettify)

In [54]:
soup_0.find("title").get_text()

'Berlin - Wikipedia'

In [56]:
soup_0.find("h1").get_text()

'Berlin'

In [57]:
h2_tags = soup_0.find_all("h2")
#h2_tags

In [58]:
for h2 in h2_tags:
  print(h2.get_text())

Contents
History[edit]
Geography[edit]
Demographics[edit]
Government and politics[edit]
Economy[edit]
Quality of life[edit]
Transport in Berlin[edit]
Rohrpost[edit]
Energy[edit]
Health[edit]
Telecommunication[edit]
Education and research[edit]
Culture[edit]
Sports[edit]
See also[edit]
References[edit]
External links[edit]


view/Developer/Inspect:

```
 <th colspan="2" class="infobox-header">Population<div class="ib-settlement-fn"><span class="nowrap">&nbsp;</span>(2022)<sup id="cite_ref-Amt_für_Statistik_Berlin-Brandenburg_4-0" class="reference"><a href="#cite_note-Amt_für_Statistik_Berlin-Brandenburg-4">[4]</a></sup></div></th>

<tr class="mergedrow"><th scope="row" class="infobox-label">&nbsp;•&nbsp;City/State</th><td class="infobox-data">3,755,251</td></tr>

<th scope="row" class="infobox-label">&nbsp;•&nbsp;City/State</th>
<td class="infobox-data">3,755,251</td>
```

table, tr--table row, th--table head, td--table data.

**Task:** to find out the table with the head "Population" and the scrap the data (value of the population) from it.

In [59]:
population_0 = soup_0.find(id="cite_ref-Amt_für_Statistik_Berlin-Brandenburg_4-0").find_next("td").get_text()
print(population_0)

3,755,251


In [60]:
population_0 = soup_0.find(id="cite_ref-Amt_für_Statistik_Berlin-Brandenburg_4-0")
print(population_0)

<sup class="reference" id="cite_ref-Amt_für_Statistik_Berlin-Brandenburg_4-0"><a href="#cite_note-Amt_für_Statistik_Berlin-Brandenburg-4">[4]</a></sup>




2.   Population extraction from the city Hamburg




```
<th colspan="2" class="infobox-header">Population<div class="ib-settlement-fn"><span class="nowrap">&nbsp;</span>(2022-12-31)<sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></div></th>
```



In [61]:
url_1 = "https://en.wikipedia.org/wiki/Hamburg"
response_1 = requests.get(url_1)
soup_1 = BeautifulSoup(response_1.content, 'html.parser')
population_1 = soup_1.find(id="cite_ref-2").find_next("td").get_text()
print(population_1)

1,945,532




3.   Population extraction from the city Munich




`<th colspan="2" class="infobox-header">Population<div class="ib-settlement-fn"><span class="nowrap">&nbsp;</span>(2022-12-31)<sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></div></th>`

In [62]:
url_2 = "https://en.wikipedia.org/wiki/Munich"
response_2 = requests.get(url_2)
soup_2 = BeautifulSoup(response_2.content, 'html.parser')
population_2 = soup_2.find(id="cite_ref-2").find_next("td").get_text()
print(population_2)

1,512,491




4.   create a dataframe with the city and population data




In [63]:
cities = ["Berlin", "Hamburg", "Munich"]
population = [population_0, population_1, population_2]
p_df = pd.DataFrame([cities, population], columns = cities, index = ["city","population"])
p_df

Unnamed: 0,Berlin,Hamburg,Munich
city,Berlin,Hamburg,Munich
population,[[[4]]],1945532,1512491


?? how to use regular expression to remove all the , inside of the population values

# Solution--Latitude and longitude

Coordinates: 52°31′12″N 13°24′18″E

latitude is 52°31′12″N
longitude is 13°24′18″E

`<td colspan="2" class="infobox-full-data">Coordinates: <span class="geo-inline"><style data-mw-deduplicate="TemplateStyles:r1156832818">.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct,.mw-parser-output .geo-inline-hidden{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}</style><span class="plainlinks nourlexpansion load-gadget" data-gadget="WikiMiniAtlas"><span style="white-space: nowrap;"><img src="//upload.wikimedia.org/wikipedia/commons/thumb/5/55/WMA_button2b.png/17px-WMA_button2b.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/55/WMA_button2b.png/17px-WMA_button2b.png 1x, //upload.wikimedia.org/wikipedia/commons/thumb/5/55/WMA_button2b.png/34px-WMA_button2b.png 2x" class="wmamapbutton noprint" title="Show location on an interactive map" alt="" style="padding: 0px 3px 0px 0px; cursor: pointer;"><a class="external text" href="https://geohack.toolforge.org/geohack.php?pagename=Berlin&amp;params=52_31_12_N_13_24_18_E_type:city(3755251)_region:DE" style="white-space: normal;"><span class="geo-default"><span class="geo-dms" title="Maps, aerial photos, and other data for this location"><span class="latitude">52°31′12″N</span> <span class="longitude">13°24′18″E</span></span></span><span class="geo-multi-punct">﻿ / ﻿</span><span class="geo-nondefault"><span class="geo-dec" title="Maps, aerial photos, and other data for this location">52.52000°N 13.40500°E</span><span style="display:none">﻿ / <span class="geo">52.52000; 13.40500</span></span></span></a></span></span></span></td>`

In [64]:
latitude_0 = soup_0.find("span", class_="latitude").get_text()
#print(latitude_0)

In [65]:
longitude_0 = soup_0.find("span", class_="longitude").get_text()
print(longitude_0)

13°24′18″E


In [66]:
latitude_1 = soup_1.find("span", class_="latitude").get_text()
longitude_1 = soup_1.find("span", class_="longitude").get_text()
print(latitude_1, longitude_1)

53°33′N 10°00′E


In [67]:
latitude_2 = soup_2.find("span", class_="latitude").get_text()
longitude_2 = soup_2.find("span", class_="longitude").get_text()
print(latitude_2, longitude_2)

48°08′15″N 11°34′30″E


In [68]:
cities = ["Berlin", "Hamburg", "Munich"]
population = [population_0, population_1, population_2]
latitude = [latitude_0, latitude_1, latitude_2]
longitude = [longitude_0, longitude_1, longitude_2]
pll_df = pd.DataFrame([cities, population, latitude, longitude], columns = cities, index = ["city","population", "latitude", "longitude"])
pll_df

Unnamed: 0,Berlin,Hamburg,Munich
city,Berlin,Hamburg,Munich
population,[[[4]]],1945532,1512491
latitude,52°31′12″N,53°33′N,48°08′15″N
longitude,13°24′18″E,10°00′E,11°34′30″E


# Solution: Country of location

country -- Germany

`<tr class="mergedtoprow"><th scope="row" class="infobox-label">Country</th><td class="infobox-data">Germany</td></tr>`

In [69]:
country_0 = soup_0.find("tr", class_="mergedtoprow").find_next("td", class_="infobox-data").get_text()
country_0

'Germany'

In [70]:
country_1 = soup_1.find("tr", class_="mergedtoprow").find_next("td", class_="infobox-data").get_text()
country_2 = soup_2.find("tr", class_="mergedtoprow").find_next("td", class_="infobox-data").get_text()
print(country_1, country_2)

Germany Germany


In [71]:
cities = ["Berlin", "Hamburg", "Munich"]
population = [population_0, population_1, population_2]
latitude = [latitude_0, latitude_1, latitude_2]
longitude = [longitude_0, longitude_1, longitude_2]
country = [country_0, country_1, country_2]
pllc_df = pd.DataFrame([cities, population, latitude, longitude, country], columns = cities, index = ["city","population", "latitude", "longitude", "country"])
pllc_df

Unnamed: 0,Berlin,Hamburg,Munich
city,Berlin,Hamburg,Munich
population,[[[4]]],1945532,1512491
latitude,52°31′12″N,53°33′N,48°08′15″N
longitude,13°24′18″E,10°00′E,11°34′30″E
country,Germany,Germany,Germany


# Loop all the procedures

In [79]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

In [None]:
Berlin: https://en.wikipedia.org/wiki/Berlin
Hamburg: https://en.wikipedia.org/wiki/Hamburg
Munich: https://en.wikipedia.org/wiki/Munich

In [81]:
cities = ["Berlin", "Hamburg", "Munich"]
population = []

for city in cities:
    url = f"https://en.wikipedia.org/wiki/{city}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    population_element = soup.find("table", class_="infobox ib-settlement vcard").find_all(string="Population")
    for p in population_element:
        population_value = p.find_next("td").text.strip()
        print(population_value)
    population.append(population_value)

p_df = pd.DataFrame([cities, population], columns=cities, index=["city", "population"])

3,755,251
1,945,532
1,512,491


In [82]:
p_df

Unnamed: 0,Berlin,Hamburg,Munich
city,Berlin,Hamburg,Munich
population,3755251,1945532,1512491


In [87]:
# List of cities and their corresponding Wikipedia URLs
cities = [
    {'name': 'Berlin', 'url': 'https://en.wikipedia.org/wiki/Berlin'},
    {'name': 'Hamburg', 'url': 'https://en.wikipedia.org/wiki/Hamburg'},
    {'name': 'Munich', 'url': 'https://en.wikipedia.org/wiki/Munich'}
]

# Empty list to store the extracted data
data = []

# Loop through the cities
for city in cities:
    response = requests.get(city['url'])
    soup = BeautifulSoup(response.content, 'html.parser')
    # Extract the population, latitude, longitude, and country information
    population_element = soup.find("table", class_="infobox ib-settlement vcard").find_all(string="Population")
    for p in population_element:
        population_value = p.find_next("td").text.strip()
    # population = soup.find('th', text='Population').find_next('td').text.strip()
    latitude = soup.find('span', class_='latitude').text.strip()
    longitude = soup.find('span', class_='longitude').text.strip()
    country = soup.find('th', string='Country').find_next('td').text.strip()
    # Append the extracted data to the list
    data.append({
        'City': city['name'],
        'Population': population_value,
        'Latitude': latitude,
        'Longitude': longitude,
        'Country': country
    })

df = pd.DataFrame(data)
print(df)
df

      City Population    Latitude   Longitude  Country
0   Berlin  3,755,251  52°31′12″N  13°24′18″E  Germany
1  Hamburg  1,945,532     53°33′N     10°00′E  Germany
2   Munich  1,512,491  48°08′15″N  11°34′30″E  Germany


Unnamed: 0,City,Population,Latitude,Longitude,Country
0,Berlin,3755251,52°31′12″N,13°24′18″E,Germany
1,Hamburg,1945532,53°33′N,10°00′E,Germany
2,Munich,1512491,48°08′15″N,11°34′30″E,Germany
