## Web Scraping Tutorial
For this assignment, using the techniques learnt in the previous session, scrape the following website: "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"
<br>For web scraping, use the following libraries
1. BeautifulSoup
2. requests 
3. pandas

Objective: 
* Create a Dataframe containing all countries listed on the Wikipedia website

Steps:
1. Import the libraries
* Pandas 
* Requests 
* BeautifulSoup 
2. Ping the website and return the HTML of the website
3. Use the prettify function to view how the tags are nested in the document
4. Find class 'sortable wikitable sticky-header col2left' in the HTML script
5. Extract all the links within a tag using find_all().
6. From the links found earlier, find extract the title by using the 'get' method to find the titles
* Note: Create a list to append the countries in and name the list variable as 'countries'.
7. Create the dataframe called df_countries
8. Set the column ‘Country’ in df_countries to countries

### 1. Import Libraries

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### 2. Ping the website and return the HTML of the website

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"
page = requests.get(url)

### 3. Use the prettify function to view how the tags are nested in the document

In [None]:
parse = BeautifulSoup(page.content, 'html.parser')
print(parse.prettify())

### 4. Find class 'sortable wikitable sticky-header col2left' in the HTML script

In [62]:
parse.find_all(class_='sortable wikitable sticky-header col2left')

[<table class="sortable wikitable sticky-header col2left" style="text-align: center">
 <tbody><tr>
 <th></th>
 <th>Country / dependency</th>
 <th>%<br/>total</th>
 <th>Asia area<br/>in km<sup>2</sup> (mi<sup>2</sup>)</th>
 <th class="unsortable">
 </th></tr>
 <tr>
 <td>1</td>
 <td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/Russia" title="Russia">Russia</a></td>
 <td>29.3%</td>
 <td><span data-sort-value="7013130831000000000♠"></span>13,083,100 (5,051,400)</td>
 <td><sup class="reference" id="cite_

### 5. Extract all the links within a tag using find_all().

In [113]:
tag = parse.find_all('tr')

tag

[<tr>
 <th></th>
 <th>Country / dependency</th>
 <th>%<br/>total</th>
 <th>Asia area<br/>in km<sup>2</sup> (mi<sup>2</sup>)</th>
 <th class="unsortable">
 </th></tr>,
 <tr>
 <td>1</td>
 <td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/Russia" title="Russia">Russia</a></td>
 <td>29.3%</td>
 <td><span data-sort-value="7013130831000000000♠"></span>13,083,100 (5,051,400)</td>
 <td><sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span class="cite-bracket">[</span>a<span class="cite-bracket"

### 6. From the links found earlier, find extract the title by using the 'get' method to find the titles

In [152]:
countries_draft = [co.get_text() for co in tag]
countries_draft

['\n\nCountry / dependency\n%total\nAsia areain km2 (mi2)\n\n',
 '\n1\n\xa0Russia\n29.3%\n13,083,100 (5,051,400)\n[a]\n',
 '\n2\n\xa0China\n21.5%\n9,596,961 (3,705,407)\n[b]\n',
 '\n3\n\xa0India\n7.4%\n3,287,263 (1,269,219)\n\n',
 '\n4\n\xa0Kazakhstan\n5.8%\n2,600,000 (1,000,000)\n[c]\n',
 '\n5\n\xa0Saudi Arabia\n4.8%\n2,149,690 (830,000)\n\n',
 '\n6\n\xa0Iran\n3.7%\n1,648,195 (636,372)\n\n',
 '\n7\n\xa0Mongolia\n3.5%\n1,564,110 (603,910)\n\n',
 '\n8\n\xa0Indonesia\n3.3%\n1,488,509 (574,717)\n[d]\n',
 '\n9\n\xa0Pakistan\n2.0%\n881,913 (340,509)\n\n',
 '\n10\n\xa0Turkey\n1.7%\n759,805 (293,362)\n[e]\n',
 '\n11\n\xa0Myanmar\n1.5%\n676,578 (261,228)\n\n',
 '\n12\n\xa0Afghanistan\n1.5%\n652,867 (252,073)\n\n',
 '\n13\n\xa0Yemen\n1.2%\n555,000 (214,000)\n\n',
 '\n14\n\xa0Thailand\n1.2%\n513,120 (198,120)\n\n',
 '\n15\n\xa0Turkmenistan\n1.1%\n488,100 (188,500)\n\n',
 '\n16\n\xa0Uzbekistan\n1.0%\n447,400 (172,700)\n\n',
 '\n17\n\xa0Iraq\n1.0%\n438,317 (169,235)\n\n',
 '\n18\n\xa0Japan\n0.8%\n

In [153]:
countries = [item.split('xa0')[1] if 'xa0' in item else '' for item in countries_draft]
countries

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

In [130]:
Sr = []
countries = []
area = []

for i in countries_draft:
    Sr.append(i[0])
    countries.append(i[1])
    area.append(" ".join(i[2:4]))

countries

['\n',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '3',
 '3',
 '3',
 '3',
 '3',
 '3',
 '3',
 '3',
 '3',
 '3',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '\n',
 '4',
 '5',
 '5',
 '\n',
 '\n',
 't',
 'r',
 'E',
 'o',
 'x',
 'e',
 't',
 'L',
 't',
 'o',
 't',
 'e',
 'u',
 'r',
 'n',
 'p',
 '1']

### 7. Create the dataframe called df_countries

### 8. Set the column ‘Country’ in df_countries to countries