<a href="https://colab.research.google.com/github/GuCuChiara/Web-Scraping-to-Create-a-Dataset-using-Python/blob/main/Web_Scraping_to_Create_a_Dataset_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping to Create a Dataset using Python

# How are Datasets Created by Scraping the Web?
There are so many libraries, frameworks, and tools that are used for the task of web scraping. 

### Some of the most common libraries and modules in Python used for web scraping are:

* Scrapy
* Selenium
* BeautifulSoup
* Urlib.request

All of the above Python libraries and modules are great for scraping data from websites. 

After scraping the data, the data is prepared so that it can be stored in a CSV file to create a dataset.

### We link our google drive to colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### For this task, we will be using the **BeautifulSoup library** in Python.

we import the necessary libraries

In [None]:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

### We open the url of **Wikipedia** to go through the html tables:

In [None]:
html = urlopen("https://en.wikipedia.org/wiki/Comparison_of_programming_languages")
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

### After scraping the data, the data is prepared so that it can be stored in a CSV file to create a local dataset.
```
programing_languages.csv
```



In [None]:
with open("/content/drive/MyDrive/Colab Notebooks/Python scripts/programing_languages.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for i in rows:
        row = []
        for cell in i.findAll(["td", "th"]):
            row.append(cell.get_text())
        writer.writerow(row)

### We read the header of our dataset with pandas:

In [None]:
import pandas as pd
a = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Python scripts/programing_languages.csv")
a.head()

Unnamed: 0,Language\n,Original purpose\n,Imperative\n,Object-oriented\n,Functional\n,Procedural\n,Generic\n,Reflective\n,Event-driven\n,Other paradigms\n,Standardized?\n
0,1C:Enterprise programming language\n,"Application, RAD, business, general, web, mobi...",Yes\n,No\n,Yes\n,Yes\n,Yes\n,Yes\n,Yes\n,"Object-based, Prototype-based programming\n",No\n
1,ActionScript 3.0\n,"Application, client-side, web\n",Yes\n,Yes\n,Yes\n,No\n,No\n,No\n,Yes\n,\n,"Yes1996, ECMA\n"
2,Ada\n,"Application, embedded, realtime, system\n",Yes\n,Yes[2]\n,No\n,Yes[3]\n,Yes[4]\n,No\n,No\n,"Concurrent,[5] distributed,[6]\n","Yes1983, 2005, 2012, ANSI, ISO, GOST 27831-88[..."
3,Aldor\n,"Highly domain-specific, symbolic computing\n",Yes\n,Yes\n,Yes\n,No\n,No\n,No\n,No\n,\n,No\n
4,ALGOL 58\n,Application\n,Yes\n,No\n,No\n,No\n,No\n,No\n,No\n,\n,No\n


### Another example with the **worldometers** web page containing information about COVID-19:

### COVID-19 CORONAVIRUS PANDEMIC
## https://www.worldometers.info/coronavirus/



In [None]:
url = 'https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

### We look for the table with the id **main_table_countries_today** that contains the statistical information of interest:

In [None]:
table = soup.find('table', id='main_table_countries_today')

### After scraping the data, the data is prepared so that it can be stored in a CSV file to create a local dataset:


```
coronavirus.csv
```




In [None]:
rows = table.findAll("tr")

with open("/content/drive/MyDrive/Colab Notebooks/Python scripts/coronavirus.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for i in rows:
        row = []
        for cell in i.findAll(["td", "th"]):
            row.append(cell.get_text())
        writer.writerow(row)

### We read the header of our local **coronavirus.csv** dataset with pandas:

In [None]:
a = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Python scripts/coronavirus.csv")
a.head()

Unnamed: 0,#,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",...,TotalTests,Tests/\n1M pop\n,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
0,,\nNorth America\n,123822984,8168.0,1602516,109.0,118877067,12634.0,3343401,8343,...,,,,North America,\n,,,,,
1,,\nAsia\n,213437083,71043.0,1532261,339.0,197923518,59535.0,13981304,15751,...,,,,Asia,\n,,,,,
2,,\nEurope\n,245173554,19292.0,2009753,76.0,240857241,25514.0,2306560,6467,...,,,,Europe,\n,,,,,
3,,\nSouth America\n,67812576,2256.0,1348472,36.0,66007415,9519.0,456689,10215,...,,,,South America,\n,,,,,
4,,\nOceania\n,13914877,,25471,,13754315,,135091,78,...,,,,Australia/Oceania,\n,,,,,


### We can save our table in html format, to use it on our website:

In [None]:
with open('/content/drive/MyDrive/Colab Notebooks/Python scripts/tabla.html', 'w') as f:
    f.write(str(table))


In [None]:
import pandas as pd
import re

r = requests.get('https://www.worldometers.info/coronavirus/').text

soup = BeautifulSoup(r,"lxml")
today = str(soup.find('div', attrs={'id':'nav-today'}))
# Converted into string because re need string not bytes

today = re.sub(r'<.*?>', lambda g: g.group(0).upper(), today)
# Changing characters inside <> into uppercase

dfs = pd.read_html(today)[0]
#print(dfs)
dfs.head

<bound method NDFrame.head of       #  Country,Other  TotalCases  NewCases  TotalDeaths  NewDeaths  \
0   NaN  North America   123825061   10245.0    1602555.0      148.0   
1   NaN           Asia   213437083   71043.0    1532261.0      339.0   
2   NaN         Europe   245173554   19292.0    2009753.0       76.0   
3   NaN  South America    67812576    2256.0    1348472.0       36.0   
4   NaN        Oceania    13914877       NaN      25471.0        NaN   
..   ..            ...         ...       ...          ...        ...   
242 NaN         Total:    67812576    2256.0    1348472.0       36.0   
243 NaN         Total:    13914877       NaN      25471.0        NaN   
244 NaN         Total:    12782188      44.0     258535.0        NaN   
245 NaN         Total:         721       NaN         15.0        NaN   
246 NaN         Total:   676946060  102880.0    6777062.0      599.0   

     TotalRecovered  NewRecovered  ActiveCases  Serious,Critical  ...  \
0       118891433.0       27000.