<h3>Web Scraping to Create a Dataset</h3>

The datasets that you find on the internet from various data sources are either created by companies and organizations or are collected from websites. You must have scraped data from web pages by using the Python libraries, but may have stuck while preparing the scraped data to create a dataset.

There are so many libraries, frameworks, and tools that are used for the task of web scraping. Some of the most common libraries and modules in Python used for web scraping are:

 - Scrapy
 - Selenium
 - BeautifulSoup
 - Urlib.request

All of the above Python libraries and modules are great for scraping data from websites. After scraping the data, the data is prepared so that it can be stored in a CSV file to create a dataset.

Now let’s see how to create a dataset by scraping the web using Python. For this task, I will be using the BeautifulSoup library in Python. Here I am going to search for a random term on Google and then I will collect the data from the very first page that Google shows me.

So, I searched for “comparison of programming languages” on Google and got this article as the first result: https://en.wikipedia.org/wiki/Comparison_of_programming_languages. Let’s see how we can scrape data from this web page to create a dataset. Below is how we can use the BeautifulSoup library in Python for the task of web scraping to create a dataset:

In [9]:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [10]:
html = urlopen("https://en.wikipedia.org/wiki/Comparison_of_programming_languages")
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

In [11]:
with open("language.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for i in rows:
        row = []
        for cell in i.findAll(["td", "th"]):
            row.append(cell.get_text())
        writer.writerow(row)

In [12]:
import pandas as pd
a = pd.read_csv("language.csv", encoding="ISO-8859-1")
a.head()

Unnamed: 0,Language\n,Original purpose\n,Imperative\n,Object-oriented\n,Functional\n,Procedural\n,Generic\n,Reflective\n,Other paradigms\n,Standardized?\n
0,1C:Enterprise programming language\n,"Application, RAD, business, general, web, mobi...",Yes\n,No\n,Yes\n,Yes\n,Yes\n,Yes\n,"Object-based, Prototype-based programming\n",No\n
1,ActionScript\n,"Application, client-side, web\n",Yes\n,Yes\n,Yes\n,Yes\n,No\n,No\n,prototype-based\n,"Yes1999-2003, ActionScript 1.0 with ES3, Actio..."
2,Ada\n,"Application, embedded, realtime, system\n",Yes\n,Yes[2]\n,No\n,Yes[3]\n,Yes[4]\n,No\n,"Concurrent,[5] distributed[6]\n","Yes1983, 2005, 2012, ANSI, ISO, GOST 27831-88[..."
3,Aldor\n,"Highly domain-specific, symbolic computing\n",Yes\n,Yes\n,Yes\n,No\n,No\n,No\n,\n,No\n
4,ALGOL 58\n,Application\n,Yes\n,No\n,No\n,No\n,No\n,No\n,\n,No\n
