# Webscraping

In [1]:
url1 = "https://en.wikipedia.org/wiki/World_population"
print(url1)

https://en.wikipedia.org/wiki/World_population


# Get the html content from the webpage

In [2]:
import requests

url1 = "https://en.wikipedia.org/wiki/World_population"
response = requests.get(
    url1,
    headers= {"User-Agent": "UtkarshBot/1.0 (utkarsh@example.com)"}
)
response

<Response [200]>

In [3]:
content = response.content
print(content[0:100])

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'


# For fetching particular tags use BeautifulSoup


In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)

In [5]:
title_tag = soup.find("title")
print(title_tag)

<title>World population - Wikipedia</title>


In [6]:
title_tag.text

'World population - Wikipedia'

# Get h1 tag with a particular class

In [7]:
h1_tag = soup.find("h1", class_ = "firstHeading")
print(h1_tag)

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">World population</span></h1>


In [8]:

h1_tag.text

'World population'


# Show all the subheadings from the url

In [9]:
sub_tags = soup.find_all("div", class_="mw-heading")
sub_tags


[<div class="mw-heading mw-heading2"><h2 id="History">History</h2></div>,
 <div class="mw-heading mw-heading3"><h3 id="Prehistoric_patterns">Prehistoric patterns</h3></div>,
 <div class="mw-heading mw-heading3"><h3 id="Ancient_and_post-classical_history">Ancient and post-classical history</h3></div>,
 <div class="mw-heading mw-heading3"><h3 id="Modern_history">Modern history</h3></div>,
 <div class="mw-heading mw-heading3"><h3 id="20th_century">20th century</h3></div>,
 <div class="mw-heading mw-heading3"><h3 id="Milestones_by_the_billions">Milestones by the billions</h3></div>,
 <div class="mw-heading mw-heading2"><h2 id="Global_demographics">Global demographics</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Population_by_region">Population by region</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Largest_populations_by_country">Largest populations by country</h2></div>,
 <div class="mw-heading mw-heading3"><h3 id="Ten_most_populous_countries">Ten most populous coun

In [10]:
sub_text = [tag.text for tag in sub_tags]
sub_text

['History',
 'Prehistoric patterns',
 'Ancient and post-classical history',
 'Modern history',
 '20th century',
 'Milestones by the billions',
 'Global demographics',
 'Population by region',
 'Largest populations by country',
 'Ten most populous countries',
 'Most densely populated countries',
 'Fluctuation',
 'Annual population growth',
 'Population growth by region',
 'Past population',
 'Projections',
 'Mathematical approximations',
 'Years for world population to double',
 'Number of humans who have ever lived',
 'Human population as a function of food availability',
 'See also',
 'Explanatory notes',
 'References',
 'Citations',
 'General and cited sources',
 'Further reading',
 'External links']

# Get all the paragraph content and save it in a text file

In [11]:
p_tags = soup.find_all("p")
p_tags[1]

<p>In <a href="/wiki/Demographics_of_the_world" title="Demographics of the world">world demographics</a>, the <b>world population</b> is the total number of <a href="/wiki/Human" title="Human">humans</a> currently alive. It was estimated by the <a href="/wiki/United_Nations" title="United Nations">United Nations</a> to have exceeded eight billion in mid-November 2022. It took around 300,000 years of human <a href="/wiki/Prehistory" title="Prehistory">prehistory</a> and <a href="/wiki/Human_history" title="Human history">history</a> for the human population to reach a billion and only 218 more years to reach 8 billion.
</p>

In [12]:
p_tags[1].text

'In world demographics, the world population is the total number of humans currently alive. It was estimated by the United Nations to have exceeded eight billion in mid-November 2022. It took around 300,000 years of human prehistory and history for the human population to reach a billion and only 218 more years to reach 8\xa0billion.\n'

In [13]:
p_content = [tag.text for tag in p_tags]
p_content[0:3]

['\n',
 'In world demographics, the world population is the total number of humans currently alive. It was estimated by the United Nations to have exceeded eight billion in mid-November 2022. It took around 300,000 years of human prehistory and history for the human population to reach a billion and only 218 more years to reach 8\xa0billion.\n',
 'The human population has experienced continuous growth following the Great Famine of 1315–1317 and the end of the Black Death in 1350, when it was nearly 370,000,000.[2] The highest global population growth rates, with increases of over 1.8% per year, occurred between 1955 and 1975, peaking at 2.1% between 1965 and 1970.[3] The growth rate declined to 1.1% between 2015 and 2020 and is projected to decline further in the 21st century.[4] The global population is still increasing, but there is significant uncertainty about its long-term trajectory due to changing fertility and mortality rates.[5] The UN Department of Economics and Social Affair

# Save above in text file

In [14]:

# Write mode to save new file
with open("world.txt", "w", encoding="utf-8") as f:
    f.writelines(p_content)

# Load content from text file

In [15]:
# Read mode is to load the text file
with open("world.txt", "r", encoding="utf-8") as f:
    content = f.read()
    print(content[0:100])


In world demographics, the world population is the total number of humans currently alive. It was e


In [16]:
p_content[1]

'In world demographics, the world population is the total number of humans currently alive. It was estimated by the United Nations to have exceeded eight billion in mid-November 2022. It took around 300,000 years of human prehistory and history for the human population to reach a billion and only 218 more years to reach 8\xa0billion.\n'

In [17]:
p_content[2]

'The human population has experienced continuous growth following the Great Famine of 1315–1317 and the end of the Black Death in 1350, when it was nearly 370,000,000.[2] The highest global population growth rates, with increases of over 1.8% per year, occurred between 1955 and 1975, peaking at 2.1% between 1965 and 1970.[3] The growth rate declined to 1.1% between 2015 and 2020 and is projected to decline further in the 21st century.[4] The global population is still increasing, but there is significant uncertainty about its long-term trajectory due to changing fertility and mortality rates.[5] The UN Department of Economics and Social Affairs projects between 9 and 10\xa0billion people by 2050 and gives an 80% confidence interval of 10–12\xa0billion by the end of the 21st century,[1] with a growth rate by then of zero. Other demographers predict that the human population will begin to decline in the second half of the 21st century.[6]\n'

# Fetch the images from the data

In [18]:
a_tags = soup.find_all("a", class_="mw-file-description")
a_tags[0:4]

[<a class="mw-file-description" href="/wiki/File:World_Population_Prospects.svg"><img class="mw-file-element" data-file-height="676" data-file-width="900" decoding="async" height="263" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/World_Population_Prospects.svg/500px-World_Population_Prospects.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/World_Population_Prospects.svg/525px-World_Population_Prospects.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0e/World_Population_Prospects.svg/700px-World_Population_Prospects.svg.png 2x" width="350"/></a>,
 <a class="mw-file-description" href="/wiki/File:Illustration_of_contemporary_and_past_human_populations_Our_World_in_Data.png"><img class="mw-file-element" data-file-height="7747" data-file-width="5201" decoding="async" height="447" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Illustration_of_contemporary_and_past_human_populations_Our_World_in_Data.png/330px-Illustration_of_contempor

In [19]:

a_tags[0]

<a class="mw-file-description" href="/wiki/File:World_Population_Prospects.svg"><img class="mw-file-element" data-file-height="676" data-file-width="900" decoding="async" height="263" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/World_Population_Prospects.svg/500px-World_Population_Prospects.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/World_Population_Prospects.svg/525px-World_Population_Prospects.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0e/World_Population_Prospects.svg/700px-World_Population_Prospects.svg.png 2x" width="350"/></a>

In [20]:
a_tags[0].get("href")

'/wiki/File:World_Population_Prospects.svg'

In [21]:
home_page = "https://en.wikipedia.org"
home_page + a_tags[0].get("href")

'https://en.wikipedia.org/wiki/File:World_Population_Prospects.svg'

In [22]:
img_links = [home_page + tag.get("href") for tag in a_tags]
img_links

['https://en.wikipedia.org/wiki/File:World_Population_Prospects.svg',
 'https://en.wikipedia.org/wiki/File:Illustration_of_contemporary_and_past_human_populations_Our_World_in_Data.png',
 'https://en.wikipedia.org/wiki/File:2020_1million_cities.jpg',
 'https://en.wikipedia.org/wiki/File:Expectancy_of_life.svg',
 'https://en.wikipedia.org/wiki/File:Population_pyramid_of_the_world_in_continental_groupings_2023.svg',
 'https://en.wikipedia.org/wiki/File:Global_population_cartogram.png',
 'https://en.wikipedia.org/wiki/File:People%27s_-Km%C2%B2_for_all_countries_(and_us_states,_uk_kingdoms).png',
 'https://en.wikipedia.org/wiki/File:Top_5_Country_Population_Graph_1901_to_2021.svg',
 'https://en.wikipedia.org/wiki/File:Population_Density,_v4.11,_2020_(48009093621).jpg',
 'https://en.wikipedia.org/wiki/File:World_population_(UN).svg',
 'https://en.wikipedia.org/wiki/File:Total_Fertility_Rate_Map_by_Country.svg',
 'https://en.wikipedia.org/wiki/File:World_population_counter,_Eureka,_Halifax,_

# Save above image urls in txt

In [23]:
img_str = "\n".join(img_links)
print(img_str)

https://en.wikipedia.org/wiki/File:World_Population_Prospects.svg
https://en.wikipedia.org/wiki/File:Illustration_of_contemporary_and_past_human_populations_Our_World_in_Data.png
https://en.wikipedia.org/wiki/File:2020_1million_cities.jpg
https://en.wikipedia.org/wiki/File:Expectancy_of_life.svg
https://en.wikipedia.org/wiki/File:Population_pyramid_of_the_world_in_continental_groupings_2023.svg
https://en.wikipedia.org/wiki/File:Global_population_cartogram.png
https://en.wikipedia.org/wiki/File:People%27s_-Km%C2%B2_for_all_countries_(and_us_states,_uk_kingdoms).png
https://en.wikipedia.org/wiki/File:Top_5_Country_Population_Graph_1901_to_2021.svg
https://en.wikipedia.org/wiki/File:Population_Density,_v4.11,_2020_(48009093621).jpg
https://en.wikipedia.org/wiki/File:World_population_(UN).svg
https://en.wikipedia.org/wiki/File:Total_Fertility_Rate_Map_by_Country.svg
https://en.wikipedia.org/wiki/File:World_population_counter,_Eureka,_Halifax,_West_Yorkshire_(27th_August_2022)_001.jpg
http

In [24]:

with open("img_urls.txt", "w", encoding="utf-8") as f:
    f.write(img_str)

In [25]:
# load urls again
with open("img_urls.txt", "r", encoding="utf-8") as f:
    urls = f.readlines()
    print(urls)

['https://en.wikipedia.org/wiki/File:World_Population_Prospects.svg\n', 'https://en.wikipedia.org/wiki/File:Illustration_of_contemporary_and_past_human_populations_Our_World_in_Data.png\n', 'https://en.wikipedia.org/wiki/File:2020_1million_cities.jpg\n', 'https://en.wikipedia.org/wiki/File:Expectancy_of_life.svg\n', 'https://en.wikipedia.org/wiki/File:Population_pyramid_of_the_world_in_continental_groupings_2023.svg\n', 'https://en.wikipedia.org/wiki/File:Global_population_cartogram.png\n', 'https://en.wikipedia.org/wiki/File:People%27s_-Km%C2%B2_for_all_countries_(and_us_states,_uk_kingdoms).png\n', 'https://en.wikipedia.org/wiki/File:Top_5_Country_Population_Graph_1901_to_2021.svg\n', 'https://en.wikipedia.org/wiki/File:Population_Density,_v4.11,_2020_(48009093621).jpg\n', 'https://en.wikipedia.org/wiki/File:World_population_(UN).svg\n', 'https://en.wikipedia.org/wiki/File:Total_Fertility_Rate_Map_by_Country.svg\n', 'https://en.wikipedia.org/wiki/File:World_population_counter,_Eureka

# Load and save the the tables in csv format


In [26]:
table_tags = soup.find_all("table", class_="wikitable")
table_tags[0]

<table class="wikitable" style="text-align:center; float:right; clear:right; margin-left:8px; margin-right:0;">
<caption>World population milestones in billions<sup class="reference" id="cite_ref-:6_61-0"><a href="#cite_note-:6-61"><span class="cite-bracket">[</span>61<span class="cite-bracket">]</span></a></sup> (Worldometers estimates)
</caption>
<tbody><tr>
<th scope="row">Population
</th>
<th scope="col">1
</th>
<th scope="col">2
</th>
<th scope="col">3
</th>
<th scope="col">4
</th>
<th scope="col">5
</th>
<th scope="col">6
</th>
<th scope="col">7
</th>
<th scope="col">8
</th>
<th scope="col">9
</th>
<th scope="col">10
</th></tr>
<tr>
<th scope="row">Year
</th>
<td>1804</td>
<td>1927</td>
<td>1960</td>
<td>1974</td>
<td>1987</td>
<td>1999</td>
<td>2011</td>
<td>2022</td>
<td><i>2037</i></td>
<td><i>2057</i>
</td></tr>
<tr>
<th scope="row">Years elapsed
</th>
<td>–</td>
<td>123</td>
<td>33</td>
<td>14</td>
<td>13</td>
<td>12</td>
<td>12</td>
<td>11</td>
<td><i>15</i></td>
<td><i>20<

# Reading above table as a dataframe

In [28]:
!uv add lxml

[2mResolved [1m120 packages[0m [2min 2ms[0m[0m
[2mAudited [1m116 packages[0m [2min 21ms[0m[0m


In [29]:
import pandas as pd
from io import StringIO
table_html = StringIO(str(table_tags[0]))
df = pd.read_html(table_html)[0]
df

Unnamed: 0,Population,1,2,3,4,5,6,7,8,9,10
0,Year,1804,1927,1960,1974,1987,1999,2011,2022,2037,2057
1,Years elapsed,–,123,33,14,13,12,12,11,15,20


In [30]:
df.to_csv("pop.csv", index=False)

In [31]:
dfs = []
for i in table_tags:
    table_html = StringIO(str(i))
    d = pd.read_html(table_html)[0]
    dfs.append(d)

In [32]:
dfs[0]

Unnamed: 0,Population,1,2,3,4,5,6,7,8,9,10
0,Year,1804,1927,1960,1974,1987,1999,2011,2022,2037,2057
1,Years elapsed,–,123,33,14,13,12,12,11,15,20


In [33]:
dfs[1]

Unnamed: 0,Region,2022 (percent),2030 (percent),2050 (percent)
0,Sub-Saharan Africa,"1,152 (14.51%)","1,401 (16.46%)","2,094 (21.62%)"
1,Northern Africa and Western Asia,549 (6.91%),617 (7.25%),771 (7.96%)
2,Central Asia and Southern Asia,"2,075 (26.13%)","2,248 (26.41%)","2,575 (26.58%)"
3,Eastern Asia and Southeastern Asia,"2,342 (29.49%)","2,372 (27.87%)","2,317 (23.92%)"
4,Europe and Northern America,"1,120 (14.10%)","1,129 (13.26%)","1,125 (11.61%)"
5,Latin America and the Caribbean,658 (8.29%),695 (8.17%),749 (7.73%)
6,Australia and New Zealand,31 (0.39%),34 (0.40%),38 (0.39%)
7,Oceania,14 (0.18%),15 (0.18%),20 (0.21%)
8,World,7942,8512,9687


In [34]:
dfs[1].to_csv("Region.csv", index=False)

In [35]:
import pandas as pd
df = pd.read_csv("Region.csv")
df.head()

Unnamed: 0,Region,2022 (percent),2030 (percent),2050 (percent)
0,Sub-Saharan Africa,"1,152 (14.51%)","1,401 (16.46%)","2,094 (21.62%)"
1,Northern Africa and Western Asia,549 (6.91%),617 (7.25%),771 (7.96%)
2,Central Asia and Southern Asia,"2,075 (26.13%)","2,248 (26.41%)","2,575 (26.58%)"
3,Eastern Asia and Southeastern Asia,"2,342 (29.49%)","2,372 (27.87%)","2,317 (23.92%)"
4,Europe and Northern America,"1,120 (14.10%)","1,129 (13.26%)","1,125 (11.61%)"


In [36]:
dfs[2]

Unnamed: 0,Region,Density (inhabitants/km2),Population (millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,439,090,595 – India","13,515,000 – Tokyo Metropolis (37,400,000 – Gr..."
1,Africa,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo (20,076,000 – Greater Cairo)"
2,Europe,73.4,747,"0,146,171,000 – Russia, approx. 110 million in...","13,200,000 – Moscow (20,004,000 – Moscow metro..."
3,Latin America,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City (21,650,000 – São ..."
4,Northern America[note 1],14.9,368,"0,332,909,000 – United States","08,804,000 – New York City (23,582,649 – New Y..."
5,Oceania,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica,~0,0.004[89],N/A[note 2],"00,001,258 – McMurdo Station"


# Scrape any wikipedia webpage

In [37]:
import requests
from bs4 import BeautifulSoup
from io import StringIO
import pandas as pd

class WikiScraper:

    def __init__(self, url: str):
        self.url = url
        self.home_page = "https://en.wikipedia.org"
        response = requests.get(
            self.url,
            headers={
                "User-Agent":"UtkarshBot/1.0 (utkarsh@example.com)" 
            }
        )        
        response.raise_for_status()
        self.soup = BeautifulSoup(response.content)

    def get_title(self):
        title_tag = self.soup.find("title")
        return title_tag.text
    
    def get_h1(self):
        h1_tag = self.soup.find("h1", "firstHeading")
        return h1_tag.text
    
    def get_subheadings(self):
        sub_tags = self.soup.find_all("div", class_="mw-heading")
        sub_text = [tag.text for tag in sub_tags]
        return sub_text
    
    def get_paragraphs(self):
        p_tags = self.soup.find_all("p")
        p_text = [tag.text for tag in p_tags]
        return p_text
    
    def get_img_urls(self):
        img_tags = self.soup.find_all("a", class_="mw-file-description")
        img_urls = [self.home_page + tag.get("href") for tag in img_tags]
        return img_urls
    
    def get_tables(self):
        table_tags = self.soup.find_all("table", class_="wikitable")
        dfs = []
        for tag in table_tags:
            tag_html = StringIO(str(tag))
            d = pd.read_html(tag_html)[0]
            dfs.append(d)
        return dfs

In [38]:
scraper1 = WikiScraper(url = "https://en.wikipedia.org/wiki/World_population")
type(scraper1)

__main__.WikiScraper

In [39]:
scraper1.get_title()

'World population - Wikipedia'

In [40]:
scraper1.get_h1()

'World population'

In [41]:

scraper1.get_subheadings()

['History',
 'Prehistoric patterns',
 'Ancient and post-classical history',
 'Modern history',
 '20th century',
 'Milestones by the billions',
 'Global demographics',
 'Population by region',
 'Largest populations by country',
 'Ten most populous countries',
 'Most densely populated countries',
 'Fluctuation',
 'Annual population growth',
 'Population growth by region',
 'Past population',
 'Projections',
 'Mathematical approximations',
 'Years for world population to double',
 'Number of humans who have ever lived',
 'Human population as a function of food availability',
 'See also',
 'Explanatory notes',
 'References',
 'Citations',
 'General and cited sources',
 'Further reading',
 'External links']

In [43]:
paras1 = scraper1.get_paragraphs()
paras1[0:3]


['\n',
 'In world demographics, the world population is the total number of humans currently alive. It was estimated by the United Nations to have exceeded eight billion in mid-November 2022. It took around 300,000 years of human prehistory and history for the human population to reach a billion and only 218 more years to reach 8\xa0billion.\n',
 'The human population has experienced continuous growth following the Great Famine of 1315–1317 and the end of the Black Death in 1350, when it was nearly 370,000,000.[2] The highest global population growth rates, with increases of over 1.8% per year, occurred between 1955 and 1975, peaking at 2.1% between 1965 and 1970.[3] The growth rate declined to 1.1% between 2015 and 2020 and is projected to decline further in the 21st century.[4] The global population is still increasing, but there is significant uncertainty about its long-term trajectory due to changing fertility and mortality rates.[5] The UN Department of Economics and Social Affair

In [44]:
scraper1.get_img_urls()

['https://en.wikipedia.org/wiki/File:World_Population_Prospects.svg',
 'https://en.wikipedia.org/wiki/File:Illustration_of_contemporary_and_past_human_populations_Our_World_in_Data.png',
 'https://en.wikipedia.org/wiki/File:2020_1million_cities.jpg',
 'https://en.wikipedia.org/wiki/File:Expectancy_of_life.svg',
 'https://en.wikipedia.org/wiki/File:Population_pyramid_of_the_world_in_continental_groupings_2023.svg',
 'https://en.wikipedia.org/wiki/File:Global_population_cartogram.png',
 'https://en.wikipedia.org/wiki/File:People%27s_-Km%C2%B2_for_all_countries_(and_us_states,_uk_kingdoms).png',
 'https://en.wikipedia.org/wiki/File:Top_5_Country_Population_Graph_1901_to_2021.svg',
 'https://en.wikipedia.org/wiki/File:Population_Density,_v4.11,_2020_(48009093621).jpg',
 'https://en.wikipedia.org/wiki/File:World_population_(UN).svg',
 'https://en.wikipedia.org/wiki/File:Total_Fertility_Rate_Map_by_Country.svg',
 'https://en.wikipedia.org/wiki/File:World_population_counter,_Eureka,_Halifax,_

In [45]:
dfs = scraper1.get_tables()
dfs[0]

Unnamed: 0,Population,1,2,3,4,5,6,7,8,9,10
0,Year,1804,1927,1960,1974,1987,1999,2011,2022,2037,2057
1,Years elapsed,–,123,33,14,13,12,12,11,15,20


In [46]:
dfs[1]

Unnamed: 0,Region,2022 (percent),2030 (percent),2050 (percent)
0,Sub-Saharan Africa,"1,152 (14.51%)","1,401 (16.46%)","2,094 (21.62%)"
1,Northern Africa and Western Asia,549 (6.91%),617 (7.25%),771 (7.96%)
2,Central Asia and Southern Asia,"2,075 (26.13%)","2,248 (26.41%)","2,575 (26.58%)"
3,Eastern Asia and Southeastern Asia,"2,342 (29.49%)","2,372 (27.87%)","2,317 (23.92%)"
4,Europe and Northern America,"1,120 (14.10%)","1,129 (13.26%)","1,125 (11.61%)"
5,Latin America and the Caribbean,658 (8.29%),695 (8.17%),749 (7.73%)
6,Australia and New Zealand,31 (0.39%),34 (0.40%),38 (0.39%)
7,Oceania,14 (0.18%),15 (0.18%),20 (0.21%)
8,World,7942,8512,9687


In [47]:
dfs[2]

Unnamed: 0,Region,Density (inhabitants/km2),Population (millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,439,090,595 – India","13,515,000 – Tokyo Metropolis (37,400,000 – Gr..."
1,Africa,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo (20,076,000 – Greater Cairo)"
2,Europe,73.4,747,"0,146,171,000 – Russia, approx. 110 million in...","13,200,000 – Moscow (20,004,000 – Moscow metro..."
3,Latin America,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City (21,650,000 – São ..."
4,Northern America[note 1],14.9,368,"0,332,909,000 – United States","08,804,000 – New York City (23,582,649 – New Y..."
5,Oceania,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica,~0,0.004[89],N/A[note 2],"00,001,258 – McMurdo Station"


In [48]:

dfs[3]

Unnamed: 0,Country / Dependency,Population,% of world,Date,Source (official or from the United Nations)
0,India,1425775850,17.4%,14 Apr 2023,UN projection[92]
1,China,1409670000,17.2%,17 Jan 2024,National annual estimate[93]
2,United States,338560357,4.12%,6 Nov 2025,National population clock[94]
3,Indonesia,278696200,3.39%,1 Jul 2023,National annual estimate[95]
4,Pakistan,229488994,2.80%,1 Jul 2022,UN projection[96]
5,Brazil,219894510,2.68%,6 Nov 2025,National population clock[97]
6,Nigeria,216746934,2.64%,1 Jul 2022,UN projection[96]
7,Bangladesh,168220000,2.05%,1 Jul 2020,Annual Population Estimate[98]
8,Russia,147190000,1.79%,1 Oct 2021,2021 preliminary census results[99]
9,Mexico,128271248,1.56%,31 Mar 2022,


# Article 2 - Data anlaysis

In [49]:
scraper2 = WikiScraper(url = "https://en.wikipedia.org/wiki/Data_analysis")
type(scraper2)

__main__.WikiScraper

In [50]:
scraper2.get_title()

'Data analysis - Wikipedia'

In [51]:
scraper2.get_h1()

'Data analysis'

In [52]:
scraper2.get_subheadings()

['Data analysis process[edit]',
 'Data requirements[edit]',
 'Data collection[edit]',
 'Data processing[edit]',
 'Data cleaning[edit]',
 'Exploratory data analysis[edit]',
 'Modeling and algorithms[edit]',
 'Data product[edit]',
 'Communication[edit]',
 'Quantitative messages[edit]',
 'Analyzing quantitative data in finance[edit]',
 'Analytical activities of data users[edit]',
 'Barriers to effective analysis[edit]',
 'Confusing fact and opinion[edit]',
 'Cognitive biases[edit]',
 'Innumeracy[edit]',
 'Other applications[edit]',
 'Analytics and business intelligence[edit]',
 'Education[edit]',
 'Practitioner notes[edit]',
 'Initial data analysis[edit]',
 'Quality of data[edit]',
 'Quality of measurements[edit]',
 'Initial transformations[edit]',
 'Did the implementation of the study fulfill the intentions of the research design?[edit]',
 'Characteristics of data sample[edit]',
 'Final stage of the initial data analysis[edit]',
 'Analysis[edit]',
 'Nonlinear analysis[edit]',
 'Main data

In [53]:

paras2 = scraper2.get_paragraphs()
print(paras2[0:3])

["\nData analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.[1] Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.[2] In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.[3]\n", 'Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).[4] EDA focuses on d

In [54]:
img2 = scraper2.get_img_urls()
img2

['https://en.wikipedia.org/wiki/File:Data_visualization_process_v1.png',
 'https://en.wikipedia.org/wiki/File:Relationship_of_data,_information_and_intelligence.png',
 'https://en.wikipedia.org/wiki/File:Social_Network_Analysis_Visualization.png',
 'https://en.wikipedia.org/wiki/File:Total_Revenues_and_Outlays_as_Percent_GDP_2013.png',
 'https://en.wikipedia.org/wiki/File:U.S._Phillips_Curve_2000_to_2013.png',
 'https://en.wikipedia.org/wiki/File:US_Employment_Statistics_-_March_2015.png',
 'https://en.wikipedia.org/wiki/File:User-activities.png',
 'https://en.wikipedia.org/wiki/File:Wikiversity_logo_2017.svg',
 'https://en.wikipedia.org/wiki/File:Rayleigh-Taylor_instability.jpg']

In [55]:
dfs2 = scraper2.get_tables()
dfs2

[     #                       Task  \
 0    1             Retrieve Value   
 1    2                     Filter   
 2    3      Compute Derived Value   
 3    4              Find Extremum   
 4    5                       Sort   
 5    6            Determine Range   
 6    7  Characterize Distribution   
 7    8             Find Anomalies   
 8    9                    Cluster   
 9   10                  Correlate   
 10  11          Contextualization   
 
                                   General description  \
 0   Given a set of specific cases, find attributes...   
 1   Given some concrete conditions on attribute va...   
 2   Given a set of data cases, compute an aggregat...   
 3   Find data cases possessing an extreme value of...   
 4   Given a set of data cases, rank them according...   
 5   Given a set of data cases and an attribute of ...   
 6   Given a set of data cases and a quantitative a...   
 7   Identify any anomalies within a given set of d...   
 8   Given a set of 

In [56]:
dfs2[0]

Unnamed: 0,#,Task,General description,Pro forma abstract,Examples
0,1,Retrieve Value,"Given a set of specific cases, find attributes...","What are the values of attributes {X, Y, Z, .....",- What is the mileage per gallon of the Ford M...
1,2,Filter,Given some concrete conditions on attribute va...,"Which data cases satisfy conditions {A, B, C...}?",- What Kellogg's cereals have high fiber? - Wh...
2,3,Compute Derived Value,"Given a set of data cases, compute an aggregat...",What is the value of aggregation function F ov...,- What is the average calorie content of Post ...
3,4,Find Extremum,Find data cases possessing an extreme value of...,What are the top/bottom N data cases with resp...,- What is the car with the highest MPG? - What...
4,5,Sort,"Given a set of data cases, rank them according...",What is the sorted order of a set S of data ca...,- Order the cars by weight. - Rank the cereals...
5,6,Determine Range,Given a set of data cases and an attribute of ...,What is the range of values of attribute A in ...,- What is the range of film lengths? - What is...
6,7,Characterize Distribution,Given a set of data cases and a quantitative a...,What is the distribution of values of attribut...,- What is the distribution of carbohydrates in...
7,8,Find Anomalies,Identify any anomalies within a given set of d...,Which data cases in a set S of data cases have...,- Are there exceptions to the relationship bet...
8,9,Cluster,"Given a set of data cases, find clusters of si...",Which data cases in a set S of data cases are ...,- Are there groups of cereals w/ similar fat/c...
9,10,Correlate,"Given a set of data cases and two attributes, ...",What is the correlation between attributes X a...,- Is there a correlation between carbohydrates...


In [57]:
dfs2[0].to_csv("data_analysis.csv", index=False)

# From 4 links scrape all image urls

In [58]:
urls = [
    "https://en.wikipedia.org/wiki/Data_analysis",
    "https://en.wikipedia.org/wiki/Data_science",
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    "https://en.wikipedia.org/wiki/Rust_(programming_language)",
    "https://en.wikipedia.org/wiki/Machine_learning"
]

In [59]:
imgs = {}

for i in urls:
    print(f"Scraping url : {i}")
    scraper = WikiScraper(url = i)
    h1 = scraper.get_h1()
    print(f"Scraping image urls from : {h1}")
    img_urls = scraper.get_img_urls()
    print(f"Image Urls : {img_urls}")
    imgs[h1] = img_urls
    print("=============================")

Scraping url : https://en.wikipedia.org/wiki/Data_analysis
Scraping image urls from : Data analysis
Image Urls : ['https://en.wikipedia.org/wiki/File:Data_visualization_process_v1.png', 'https://en.wikipedia.org/wiki/File:Relationship_of_data,_information_and_intelligence.png', 'https://en.wikipedia.org/wiki/File:Social_Network_Analysis_Visualization.png', 'https://en.wikipedia.org/wiki/File:Total_Revenues_and_Outlays_as_Percent_GDP_2013.png', 'https://en.wikipedia.org/wiki/File:U.S._Phillips_Curve_2000_to_2013.png', 'https://en.wikipedia.org/wiki/File:US_Employment_Statistics_-_March_2015.png', 'https://en.wikipedia.org/wiki/File:User-activities.png', 'https://en.wikipedia.org/wiki/File:Wikiversity_logo_2017.svg', 'https://en.wikipedia.org/wiki/File:Rayleigh-Taylor_instability.jpg']
Scraping url : https://en.wikipedia.org/wiki/Data_science
Scraping image urls from : Data science
Image Urls : ['https://en.wikipedia.org/wiki/File:PIA23792-1600x1200(1).jpg', 'https://en.wikipedia.org/wik

In [60]:
imgs

{'Data analysis': ['https://en.wikipedia.org/wiki/File:Data_visualization_process_v1.png',
  'https://en.wikipedia.org/wiki/File:Relationship_of_data,_information_and_intelligence.png',
  'https://en.wikipedia.org/wiki/File:Social_Network_Analysis_Visualization.png',
  'https://en.wikipedia.org/wiki/File:Total_Revenues_and_Outlays_as_Percent_GDP_2013.png',
  'https://en.wikipedia.org/wiki/File:U.S._Phillips_Curve_2000_to_2013.png',
  'https://en.wikipedia.org/wiki/File:US_Employment_Statistics_-_March_2015.png',
  'https://en.wikipedia.org/wiki/File:User-activities.png',
  'https://en.wikipedia.org/wiki/File:Wikiversity_logo_2017.svg',
  'https://en.wikipedia.org/wiki/File:Rayleigh-Taylor_instability.jpg'],
 'Data science': ['https://en.wikipedia.org/wiki/File:PIA23792-1600x1200(1).jpg',
  'https://en.wikipedia.org/wiki/File:EDA_example_-_Always_plot_your_data.jpg',
  'https://en.wikipedia.org/wiki/File:Cloud_computing_in_enabling_data_science_at_scale.jpg',
  'https://en.wikipedia.org

In [61]:
import json
with open("images.json", "w") as f:
    content = json.dumps(imgs, indent=4)
    f.write(content)