## Web Scraping Tutorial
For this assignment, using the techniques learnt in the previous session, scrape the following website: "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"
<br>For web scraping, use the following libraries
1. BeautifulSoup
2. requests 
3. pandas

Objective: 
* Create a Dataframe containing all countries listed on the Wikipedia website

Steps:
1. Import the libraries
* Pandas 
* Requests 
* BeautifulSoup 
2. Ping the website and return the HTML of the website
3. Use the prettify function to view how the tags are nested in the document
4. Find class 'sortable wikitable sticky-header col2left' in the HTML script
5. Extract all the links within a tag using find_all().
6. From the links found earlier, find extract the title by using the 'get' method to find the titles
* Note: Create a list to append the countries in and name the list variable as 'countries'.
7. Create the dataframe called df_countries
8. Set the column ‘Country’ in df_countries to countries

### 1. Import Libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### 2. Ping the website and return the HTML of the website

In [2]:
# Scraping HTML
wiki_url = "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"
page = requests.get(wiki_url)
website = page.text

### 3. Use the prettify function to view how the tags are nested in the document

In [3]:
soup = BeautifulSoup(website,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-

### 4. Find class 'sortable wikitable sticky-header col2left' in the HTML script

In [4]:
class_table = soup.find('table', {'class':'sortable wikitable sticky-header col2left'})
class_table

<table class="sortable wikitable sticky-header col2left" style="text-align: center">
<tbody><tr>
<th></th>
<th>Country / dependency</th>
<th>%<br/>total</th>
<th>Asia area<br/>in km<sup>2</sup> (mi<sup>2</sup>)</th>
<th class="unsortable">
</th></tr>
<tr>
<td>1</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/Russia" title="Russia">Russia</a></td>
<td>29.3%</td>
<td><span data-sort-value="7013130831000000000♠"></span>13,083,100 (5,051,400)</td>
<td><sup class="reference" id="cite_ref-3"><a href

### 5. Now to extract all the links within a tag, we will use find_all().



In [5]:
links = class_table.find_all('a')
links

[<a href="/wiki/Russia" title="Russia">Russia</a>,
 <a href="#cite_note-3"><span class="cite-bracket">[</span>a<span class="cite-bracket">]</span></a>,
 <a href="/wiki/China" title="China">China</a>,
 <a href="#cite_note-4"><span class="cite-bracket">[</span>b<span class="cite-bracket">]</span></a>,
 <a href="/wiki/India" title="India">India</a>,
 <a href="/wiki/Kazakhstan" title="Kazakhstan">Kazakhstan</a>,
 <a href="#cite_note-6"><span class="cite-bracket">[</span>c<span class="cite-bracket">]</span></a>,
 <a href="/wiki/Saudi_Arabia" title="Saudi Arabia">Saudi Arabia</a>,
 <a href="/wiki/Iran" title="Iran">Iran</a>,
 <a href="/wiki/Mongolia" title="Mongolia">Mongolia</a>,
 <a href="/wiki/Indonesia" title="Indonesia">Indonesia</a>,
 <a href="#cite_note-8"><span class="cite-bracket">[</span>d<span class="cite-bracket">]</span></a>,
 <a href="/wiki/Pakistan" title="Pakistan">Pakistan</a>,
 <a href="/wiki/Turkey" title="Turkey">Turkey</a>,
 <a href="#cite_note-10"><span class="cite-brac

### 6. From the links found earlier, find extract the title by using the 'get' method to find the titles
* Note: Create a list to append the countries in and name the list variable as 'countries'.

In [6]:
countries = []
for link in links:
  if(link.get('title') == None):
    pass
  else:
    countries.append(link.get('title'))
print(countries)

['Russia', 'China', 'India', 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'United Arab Emirates', 'Azerbaijan', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'State of Palestine', 'Brunei', 'Hong Kong', 'China', 'Bahrain', 'Singapore', 'Maldives', 'Macau', 'China']


### 7. Create a dataframe called df_countries

In [7]:
df_countries = pd.DataFrame()

8. Set the column 'Country' in df_countries to countries

In [8]:
df_countries['Country'] = countries

In [9]:
df_countries

Unnamed: 0,Country
0,Russia
1,China
2,India
3,Kazakhstan
4,Saudi Arabia
5,Iran
6,Mongolia
7,Indonesia
8,Pakistan
9,Turkey
