**Web Scraping From Wikipedia Using Python**

---

Essential project to understand web scraping as a whole. The goal is to scrape data from the Wikipedia home page and parse it through various  web scraping techniques. 

Various web scraping techniques, python modules for web scraping, and processes of Data extraction/processing will be tackled through this project.

---

**Understanding The Essentials**

**What is Web Scraping:** It is a technique/process in which large amounts of data from a large number of websites is passed through a web scraping software coded in a programming language of choice. 

As a result, the structured data is extracted which can be saved locally in our devices, preferably in Excel sheets, JSON or spreadsheets.

It removes the need to manually copy and paste data from websites, and instead use the a scraper to perform that task in a few seconds. 

**What's the Overall Purpose:** To help programmers write clear, logical code for small and large scale projects.

**Essential Frameworks** in Python, as it is the best for web scraping are *Scrapy* and *Beautiful Soup* as they simplify the processes, making Python easy to use for the task.

---

**Essential Libraries**

The following are the libraries used for Web Scraping in Python

- **Requests (HTTP for Humans) Library for Web Scraping** - Used for making various types of HTTP requests like GET. POST, etc. It's the most basic yet **most essential** of all libraries.
- **lxml Library for Web Scraping** - The lmxl library provides super-fast and high-performance parsing of HTML and XML content from websites. It is the **best** library to use for scraping **large datasets**.
- **Beautiful Soup Library for Web Scraping** - The work of this library involves creating a **Parse Tree** for parsing content. The most beginner friendly library as it is very easy to work with. 
- **Selenium Library for Web Scraping** - Originally made for automated testing of web applications, this library overcomes the issue all the aforementioned libraries face i.e. scraping content from dynamically populated websites. This makes it slower and not suitable for industry level projects.
- **Scrapy for Web Scraping** - The **boss** of all libraries. It's an entire web scraping framework that is asynchronous in its usage, increasing efficiency, making it very fast. 

---

**Practical Implementation - Scraping Wikipedia**

Step 1: How is it that Python is used for web scraping?
- The use of *Virtualenv*, a tool to create isolated Python environments is key, as with it we can create a folder that contains all necessary executables to use the packages that the Python project requires without affecting global execution.
- I did not use that in this, as it is not needed with a Jupyter Notebook. 

**Required installments**
- **Requests:** An efficient HTTP library used for accessing web pages.
- **Urllib3:** Used for retrieving data from URLs
- **Selenium:** An open-source automated testing suite for web applications across different browsers and platforms. 

Step 2: Intro to Requests library
- There are various python modules that are used to fetch data from the web.
- The python requests library will make a **GET** request to a web server. This used to download the **HTML** contents of the webpage we want to scrape. 

In [1]:
# import required modules
import requests

# get URL
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")

In [2]:
# display status code
print(page.status_code)

200


In [3]:
# display scraped data
print(page.content)

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-not-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Wikipedia, the free encyclopedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-d

Step 3: Intro to Beautiful Soup for page parsing
- There are several python modules used for data extraction. BeautifulSoup is the best for this purpose. 
- BeautifulSoup is a Python library for pulling data out of HTML and XML files.
- Needs an input (document/URL) to create a 'soup object' as it cannot fetch a web page by itself. 
- regular expression or lxml are other modules that could do the same.
- Finally, we process the data in CSV, JSON, or MySQL format.

In [4]:
# import required modules
from bs4 import BeautifulSoup
import requests

# get URL
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")

# Scrape webpage
soup = BeautifulSoup(page.content, 'html.parser')

# display scraped data
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-not-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia, the free encyclopedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-wid

- This outputs an HTML document.
- BeautifulSoup library parses this document and extracts the text from the p tag.
- The page is formatted nicely using the 'prettify method' on the BeautifulSoup object.
- Since all the tags are nested, we can move through the structure one level at a time.
- Can first select all elements at the top level of the page using the children's property of soup.
- NOTE: children return a list generator, so it's necessary to call the list function on it. 

Step 4: Digging deeper in BeautifulSoup