## Day 18 - DIY Solution

**Q1. Problem Statement: Web Scrapping using BeautifulSoup** <br>
Write a Python program that can extract the data from a website using web scrapping concepts to perform the following tasks
1.	Use the request library and the link to extract the data. 
2.	Use BeautifulSoup to prepare the website's source code, then try to find a table on the source page.
3.	After finding the table, extract data from all available columns and store it in the dataframe.


#### Install library if it is not found ,use pip command to install

In [None]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1277 sha256=f27a42d979fa6c134df5adffbe71a91eab2a98d2e9123fa194846cc2bbfcc4dc
  Stored in directory: c:\users\abhishek\appdata\local\pip\cache\wheels\75\78\21\68b124549c9bdc94f822c02fb9aa3578a669843f9767776bca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


**Step-1:** Importing Libraries.

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import dateutil

**Step-2:** Using request library, fetch data from given link. <br>
Call get method with help of request library and pass given link as perameter.

In [2]:
result = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

In [3]:
assert result.status_code==200  

**Step-3:** Preparing source code of website

In [4]:
src = result.content
document = BeautifulSoup(src, "html.parser")

**Step-4:** Find 'table' tag in prepared document

In [5]:
table = document.find("table", class_="wikitable")
table

<table class="wikitable sortable">
<tbody><tr class="is-sticky">
<th rowspan="2">Rank</th>
<th rowspan="2"><a href="/wiki/List_of_sovereign_states" title="List of sovereign states">Country</a> / <a href="/wiki/Dependent_territory" title="Dependent territory">Dependency</a></th>
<th colspan="2">Population</th>
<th rowspan="2">Date</th>
<th rowspan="2"><span class="nowrap">Source (official or from</span> the <a href="/wiki/United_Nations" title="United Nations">United Nations</a>)</th>
<th class="unsortable" rowspan="2">Notes
</th></tr>
<tr class="is-sticky">
<th>Numbers</th>
<th>% of the world
</th></tr>
<tr>
<td style="text-align:center"><b>–</b>
</td>
<td><b>World</b>
</td>
<td style="text-align:center"><b> 8,032,197,000</b></td>
<td style="text-align:right"><b>100%</b></td>
<td><b><span data-sort-value="000000002023-05-25-0000" style="white-space:nowrap">25 May 2023</span></b></td>
<td style="text-align:left"><b>UN projection<sup class="reference" id="cite_ref-unpop_4-0"><a href="#ci

In [6]:
assert table.find("th").get_text() == "Rank"

**Step-5:** Read prepared document and extract the output and store it in the dataframe.

In [9]:
df = pd.read_html(str(table))

df1 = pd.DataFrame(df[0])
# Rename the columns.
df1.columns = ["Rank", "Country / Dependency", "Population", "% of the world", "Date", "Source (official or from the United Nations)", "Notes"]
df1

Unnamed: 0,Rank,Country / Dependency,Population,% of the world,Date,Source (official or from the United Nations),Notes
0,–,World,8032197000,100%,25 May 2023,UN projection[3],
1,1,China,1411750000,,31 Dec 2022,Official estimate[4],[b]
2,2,India,1392329000,,1 Mar 2023,Official projection[5],[c]
3,3,United States,334800000,,25 May 2023,National population clock[7],[d]
4,4,Indonesia,277749853,,31 Dec 2022,Official estimate[8],
...,...,...,...,...,...,...,...
237,–,Tokelau (New Zealand),1647,,1 Jan 2019,2019 Census [208],
238,–,Niue,1549,,1 Jul 2021,National annual projection[95],
239,195,Vatican City,825,,1 Feb 2019,Monthly national estimate[209],[af]
240,–,Cocos (Keeling) Islands (Australia),593,,30 Jun 2020,2021 Census[210],
