# Web scraping

Web scraping is a technique used to extract data from websites. It involves sending HTTP requests to websites, parsing the returned HTML code, and extracting the desired data. Web scraping is a powerful tool for data scientists as it allows them to collect large amounts of data from the web. This data can then be used to train machine learning models, analyse trends, and make informed business decisions.

---
## 1.&nbsp; Import libraries 💾

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

---
## 2.&nbsp; Beautiful Soup 🍲

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that simplifies the process of web scraping. It provides a user-friendly interface for parsing HTML documents, enabling users to extract specific information from websites. Through Beautiful Soup, you can navigate the HTML tree structure, locate elements based on their tags, attributes, and content, and extract the desired data into a structured format.

To illustrate how to use Beautiful Soup, we'll use the simplified mock website below. This stripped-down version serves as a practical learning tool, as real websites often possess much larger and more complex HTML structures. By starting with this simplified model, you can gradually build your skills and expertise, ensuring a solid understanding of the core concepts before tackling more intricate web scraping tasks.

In [75]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1" meta="Eldest sister">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2" meta="Middle sister">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3" meta="Youngest sister">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


Beautiful Soup's HTML parser takes the raw, unruly HTML code and transforms it into a neatly organised tree structure, making the information easily accessible and manageable.

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')

We can see the tree structure using Beautiful Soup's `.prettify` attribute.

In [4]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



---
## 3.&nbsp; Navigating html for beginners 🧭
There are many methods in Beautiful Soup to explore the html data. By far the most popular and useful of these is .find_all(). So, naturally, this is where we'll start our journey.

### 3.1.&nbsp; `.find_all()`
The `.find_all()` method in Beautiful Soup returns a list of all the elements that match the specified criteria, such as tag name, class name, or attribute values.

#### 3.1.1.&nbsp; Searching by tag

The tags are the letter/word at the beginning of the angle brackets. For example, below, these brackets have an `a` tag.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

The `.find_all()` method takes a string argument and returns a list of all matching HTML tags within the current document. If no matching tags exist, an empty list is returned.

In [5]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [6]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

#### 3.1.2.&nbsp; Searching by attribute

Attributes are the other information in the angle brackets. For example, below, these brackets have a `class`, `href`, `id`, and `meta` attribute.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Attributes provide additional context and functionality to the elements. They can serve various purposes, including CSS selectors for styling, URLs for linking to external resources, metadata for storing relevant data, and a multitude of other information-bearing components. By leveraging these attributes, we can effectively target specific sections of the website.

##### 3.1.2.1.&nbsp; CSS selectors
CSS selectors are used to to style certain sections of websites. This makes them very helpful for webscraping as we can then target certain regions of the website.

###### 3.1.2.1.1.&nbsp; Class
Class selectors are used to style **multiple** HTML elements that share a common characteristic or function.
> **Note:** here class has an underscore at the end of the word, this is because class is a reserved keyword in python.

In [7]:
soup.find_all(class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

###### 3.1.2.1.2.&nbsp; ID
ID selectors are used to style **single** HTML elements.

In [8]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [9]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>]

##### 3.1.2.2.&nbsp; Other attributes
HTML elements can also include other attributes, which can be equally useful for identifying and targeting specific data points. To locate these attributes, search for them using the same method as you do for CSS selectors.

In [10]:
soup.find_all(meta="Youngest sister")

[<a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

#### 3.1.3.&nbsp; Searching by string
The text (string) is the part between the opening and closing angle brackets, this is what's displayed on the webpage. For example, below, these brackets have `Elsie` as the text.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Instead of searching for specific tags or attributes, you can also search for this text. To do this, you can use a string or a regular expression to specify the text you're looking for.

In [11]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [12]:
soup.find_all(string="Dormouse")

[]

The string "Dormouse" didn't return any results because BeautifulSoup searches for entire strings that exactly match the string you entered. In other words, the string must be the exact same as what you're searching for for it to be considered a match.

In [13]:
soup.find_all(string="The Dormouse's story")

["The Dormouse's story", "The Dormouse's story"]

To search for a substring, the easiest way is to use the regular expressions method `.compile()`.

In [14]:
import re
soup.find_all(string=re.compile("dormouse", re.IGNORECASE))

["The Dormouse's story", "The Dormouse's story"]

> **Note:** by default, the .compile() method is case-sensitive, meaning it will only match strings that are exactly equal to the pattern you specify, including case. To perform case-insensitive matching, you must explicitly pass the re.IGNORECASE flag to the .compile() method.

### 3.2.&nbsp; Extracting text
There are a few ways to extract text in Beautiful Soup, here we'll focus on 2 of them.

#### 3.2.1.&nbsp; `.get_text()`
The `.get_text()` method extracts all the human-readable text from a Beautiful Soup object, returning it as a string.

In [15]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [18]:
soup.find_all("title")[0].get_text()

"The Dormouse's story"

> Read the error message and look at the output from the cell above. Can you work out why we got an error?

In [None]:
# @title Click `show code` to see the solution to the error

# It was a list, read the error messages and notice the square brackets in the original output
# Therefore, we need to select the first and only element of this list
soup.find_all("title")[0].get_text()

We can also print out multiple items using our looping skills.

In [19]:
story = soup.find_all("p")
story

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [20]:
for p in story:
  print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


#### 3.2.2.&nbsp; Extracting attributes:
HTML elements often store additional information within their attributes. To extract this data using Beautiful Soup, you can append square brackets after the element selector and specify the attribute name within them.

In [21]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [24]:
soup.find_all(id="link1")[0]['href']

'http://example.com/elsie'

In [27]:
soup.find_all(id="link1")[0]['meta']

'Eldest sister'

## Challenge 1 😀
Below is new HTML code. Use your scrapping skills to answer the questions.

In [28]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [31]:
# Create the "soup"
soup = BeautifulSoup(geography, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [30]:
# 1. All the "fun facts"
all_p = soup.find_all('p')
for p in all_p:
  print(p.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [32]:
# 2. The names of all the places.
all_h = soup.find_all("h2")
for h in all_h:
  print(h.get_text())

London
Paris
Spain


In [41]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
cities = soup.find_all(class_="city")
for city in cities:
  print(city.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [71]:
# 4. The names (not facts!) of all the cities (not countries!)
cities = soup.find_all(class_="city")
for i in cities:
    print(i.find_all("h2")[0].get_text())


London
Paris


---
## 4.&nbsp; Navigating html with a few more advanced techniques 🗺️

### 4.1.&nbsp; `.find()`
`.find()` is similar to `.find_all()`, but it returns only the first element that matches the specified criteria. This makes it useful when you know exactly where the element you're looking for is located and you only need to retrieve one instance of it.

In [73]:
soup.find('p').get_text()

'London is the most popular tourist destination in the world.'

### 4.2.&nbsp; `.select()`
`.select()` is similar to `.find_all()`, but there are 2 main differences:
- the way we write our query in the brackets is slightly different
- `.select()` allows you to chain CSS selectors together to navigate through the HTML structure, enabling you to select elements based on their positions within nested elements or patterns. This makes it particularly useful for extracting data from complex HTML structures.

In contrast, `.find_all()` uses a simpler syntax based on tag names and attributes, making it more straightforward for basic element selection.

Here's how we query with `.find_all()`

In [76]:
soup.find_all('a', class_='sister')

[]

Here's the same query with `.select()`

In [77]:
soup.select('a.sister')

[]

To demonstrate the power of `.select()` in navigating through nested elements, let's extract all the `<a>` tags with the id `'link2'` that are within `<p>` tags with the class `'story'`.

In [None]:
soup.select('p.story a#link2')

### 4.3.&nbsp; Navigating to the Next or Previous Element
In some cases, you may need to access specific elements that are closely related to others, but their HTML structure doesn't provide unique identifiers. To overcome this challenge, you can utilise the `.find_next()` and `.find_previous()` methods to navigate through the HTML structure and reach the desired element.

In [None]:
last_link = soup.find(id='link3')
last_link

#### 4.1.1.&nbsp; `.find_next()`
`.find_next()` moves forward one element

In [None]:
last_link.find_next()

#### 4.2.&nbsp; `.find_previous()`
`.find_previous()` moves back one element

In [None]:
last_link.find_previous()

---
## 5.&nbsp; Showcasing these skills on a real website 💻
Let's see what information we can get from the wikipedia site for web scraping

### Loading the html

In [78]:
url = "https://en.wikipedia.org/wiki/Web_scraping"

response = requests.get(url)

soup_3 = BeautifulSoup(response.content, 'html.parser')

> While we haven't yet looked into the requests library, we'll postpone delving into it today to avoid overwhelming you with too much new information. Instead, we'll explore the requests library when we start gathering weather data later in the project.

### Getting the title

In [80]:
soup_3.find("title").get_text()

'Web scraping - Wikipedia'

### Getting the first h1 tag

In [81]:
soup_3.find("h1").get_text()

'Web scraping'

### Getting all the h2 tags

In [82]:
h2_tags = soup_3.find_all("h2")
h2_tags

[<h2 class="vector-pinnable-header-label">Contents</h2>,
 <h2><span class="mw-headline" id="History">History</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=1" title="Edit section: History"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Techniques">Techniques</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=2" title="Edit section: Techniques"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Software">Software</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=11" title="Edit section: Software"><span>edit</span></a><span class="mw-editsection-bracket">]</span></

As we have multiple tags in the list here, we need to use a loop to print them out.

In [83]:
for h2 in h2_tags:
  print(h2.get_text())

Contents
History[edit]
Techniques[edit]
Software[edit]
Legal issues[edit]
Methods to prevent web scraping[edit]
See also[edit]
References[edit]


### Selecting the `Legal Issues` text for only `India`
> **Pro tip:** If you're using Google Chrome, you can navigate to `View > Developer > Inspect elements` to access the built-in web development tools. Here, you can explore the HTML structure of the webpage directly within the browser using your mouse. This interactive approach is often more intuitive than examining the raw HTML code.

By investigating the html we can see that the closest, easy to access, tag is the heading with the CSS `id` of `"India"`.

In [86]:
soup_3.find(id="India")

<span class="mw-headline" id="India">India</span>

We can then use `.find_next()` to select the text.

In [87]:
soup_3.find(id="India").find_next()

<span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=16" title="Edit section: India"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span>

Looks like the next tag was a `span` tag, so let's specify that we want the next `p` tag.

In [88]:
soup_3.find(id="India").find_next("p")

<p>Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the <a href="/wiki/Information_Technology_Act,_2000#:~:text=From_Wikipedia,_the_free_encyclopedia_The_Information_Technology,in_India_dealing_with_cybercrime_and_electronic_commerce." title="Information Technology Act, 2000">Information Technology Act, 2000</a>, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.
</p>

Now we can simply extract the text, and we have what we need

In [89]:
soup_3.find(id="India").find_next("p").get_text()

'Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.\n'

## Challenge 2 😀

Utilise your web scraping skills to gather information about three German cities – Berlin, Hamburg, and Munich – from Wikipedia. You will start by extracting the population of each city and then expand the scope of your data gathering to include latitude and longitude, country, and possibly other relevant details.

1. Population Scraping

  1.1. Begin by scraping the population of each city from their respective Wikipedia pages:

 - Berlin: https://en.wikipedia.org/wiki/Berlin
 - Hamburg: https://en.wikipedia.org/wiki/Hamburg
 - Munich: https://en.wikipedia.org/wiki/Munich

  1.2. Once you have scrapped the population of each city, reflect on the similarities and patterns in accessing the population data across the three pages. Also, analyse the URLs to identify any commonalities. Make a loop that executes once but simultaneously retrieves the population for all three cities.

2. Data Organisation

  Utilise pandas DataFrame to effectively store the extracted population data. Ensure the data is clean and properly formatted. Remove any unnecessary characters or symbols and ensure the column data types are accurate.

3. Further Enhancement

  3.1. Expand the scope of your data gathering by extracting other relevant information for each city:

 - Latitude and longitude
 - Country of location

  3.2. Create a function from the loop and DataFrame to encapsulate the scraping process. This function can be used repeatedly to fetch updated data whenever necessary. It should return a clean, properly formatted DataFrame.

4. Global Data Scraping

  With your robust scraping skills now honed, venture beyond the confines of Germany and explore other cities around the world. While the extraction methodology for German cities may follow a consistent pattern, this may not be the case for cities from different countries. Can you make a function that returns a clean DataFrame of information for cities worldwide?

In [136]:
#1.1 Population Scraping
#Calling the URL
url = "https://en.wikipedia.org/wiki/Berlin"

response = requests.get(url)

soup_3 = BeautifulSoup(response.content, 'html.parser')

#First step is to approach to the line mentioning "population", then finding the next "td" tag on the script and converting it into an int

population_berlin = soup_3.find(string= re.compile("population", re.IGNORECASE)).find_next("td").get_text()

#It is necessary to convert this string into an int removing firstly the "," and as a second step turning it into an int
population_berlin_int = int(population_berlin.replace(",",""))
population_berlin_int

3850809

In [133]:
#Same steps for Hamburg
#Calling the URL
url = "https://en.wikipedia.org/wiki/Hamburg"

response = requests.get(url)

soup_4 = BeautifulSoup(response.content, 'html.parser')

#First step is to approach to the line mentioning "population", then finding the next "td" tag on the script and converting it into an int

population_hambourg = soup_4.find(string= re.compile("population", re.IGNORECASE)).find_next("td").get_text()
population_hambourg

#It is necessary to convert this string into an int removing firstly the "," and as a second step turning it into an int
population_hambourg_int = int(population_hambourg.replace(",",""))
population_hambourg_int

1945532

In [125]:
#Same steps for Munich
#Calling the URL
url = "https://en.wikipedia.org/wiki/Munich"

response = requests.get(url)

soup_5 = BeautifulSoup(response.content, 'html.parser')

#First step is to approach to the line mentioning "population", then finding the next "td" tag on the script and converting it into an int

population_munich = int(soup_5.find(string= re.compile("population", re.IGNORECASE)).find_next("td").get_text().replace(",",""))
population_munich


1512491

In [143]:
#1.2. Once you have scrapped the population of each city, reflect on the similarities and patterns in accessing the population data across the three pages.
#Also, analyse the URLs to identify any commonalities. Make a loop that executes once but simultaneously retrieves the population for all three cities.

cities = ["Berlin","Hambourg","Munich"]
population = []
for i in cities:
    url_loop = f"https://en.wikipedia.org/wiki/{i}"

    response = requests.get(url_loop)

    soup_6 = BeautifulSoup(response.content, 'html.parser')

    population.append(int(soup_6.find(string= re.compile("population", re.IGNORECASE)).find_next("td").get_text().replace(",","")))

print(population)

[3850809, 1945532, 1512491]


In [153]:
#Data Organisation
#Utilise pandas DataFrame to effectively store the extracted population data. Ensure the data is clean and properly formatted. 
#Remove any unnecessary characters or symbols and ensure the column data types are accurate.

information = pd.DataFrame({"city":cities,"population":population})
information

Unnamed: 0,city,population
0,Berlin,3850809
1,Hambourg,1945532
2,Munich,1512491


In [166]:
#Further Enhancement
#3.1. Expand the scope of your data gathering by extracting other relevant information for each city:

#Latitude and longitude
#Country of location

cities = ["Berlin","Hambourg","Munich"]
population = []
latitude = []
country = []
for i in cities:
    url_loop = f"https://en.wikipedia.org/wiki/{i}"

    response = requests.get(url_loop)

    soup_6 = BeautifulSoup(response.content, 'html.parser')

    population.append(int(soup_6.find(string= re.compile("population", re.IGNORECASE)).find_next("td").get_text().replace(",","")))
    latitude.append(soup_6.find(class_="latitude").get_text())
    country.append(soup_6.find(string= re.compile("country", re.IGNORECASE)).find_next("td").get_text())

information = pd.DataFrame({"city":cities,"population":population,"latitude":latitude,"country":country})
information

Unnamed: 0,city,population,latitude,country
0,Berlin,3850809,52°31′12″N,Germany
1,Hambourg,1945532,53°33′N,Germany
2,Munich,1512491,48°08′15″N,Germany


In [161]:
soup_6.find(class_="latitude").get_text()
  

'48°08′15″N'

In [165]:
soup_6.find(string= re.compile("country", re.IGNORECASE)).find_next("td").get_text()
    

'Germany'

In [224]:
#Next step would be to assure that for many different wikipedia articles, these web scraping will work. A good way to do this is by indicating where to look this information in the first place
#Locating the search only on the side box, will assure better results


url_loop = "https://en.wikipedia.org/wiki/Hambourg"

response = requests.get(url_loop)

original_soup = BeautifulSoup(response.content, 'html.parser')

#Only using information of the lateral -> Wont return a list, as find returns a single value
side_table = original_soup.find("table", class_="infobox ib-settlement vcard")

#Performing the searchs on the already shortened data
side_table.find(string = re.compile("Population")).find_next("td").get_text().replace


cities = ["Berlin","Hambourg","Munich","Paris","Tokyo","Barcelona","Madrid","Villa_Ballester","Cologne"]
population = []
latitude = []
country = []
for i in cities:
    url_loop = f"https://en.wikipedia.org/wiki/{i}"

    response = requests.get(url_loop)

    original_soup = BeautifulSoup(response.content, 'html.parser')

    #Only using information of the lateral -> Wont return a list, as find returns a single value
    side_table = original_soup.find("table", class_="infobox ib-settlement vcard")

    #Performing the searchs on the already shortened data
    side_table.find(string = re.compile("Population")).find_next("td").get_text().replace

    population.append(int(side_table.find(string= re.compile("population", re.IGNORECASE)).find_next("td").get_text().replace(",","")))
    latitude.append(side_table.find(class_="latitude").get_text())
    country.append(side_table.find(string= re.compile("country", re.IGNORECASE)).find_next("td").get_text())

information = pd.DataFrame({"city":cities,"population":population,"latitude":latitude,"country":country})
information

Unnamed: 0,city,population,latitude,country
0,Berlin,3850809,52°31′12″N,Germany
1,Hambourg,1945532,53°33′N,Germany
2,Munich,1512491,48°08′15″N,Germany
3,Paris,2102650,48°51′24″N,France
4,Tokyo,14094034,35°41′23″N,Japan
5,Barcelona,1620343,41°22′58″N,Spain
6,Madrid,3223334,40°25′01″N,Spain
7,Villa_Ballester,35301,34°31′S,Argentina
8,Cologne,1073096,50°56′11″N,Germany
