In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import os
import json
import logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

#### The necessary packages needed to run the scraper was imported in the cell above. Scrapy is the webcrawling framework used for this assignment. OS, JSON and logging were imported to directly write the scraped data into a JSON file. The last two imports are only used in the Jupyter notebook version of the code as it was necessary for restarting the reactor in case the process was rerun. The last two imports are not used in the .py file as the compiler used doesn't require it.  

In [2]:
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        if not os.path.exists('./data'):
            os.mkdir('./data')
        self.file = open('./data/coronacases.jsonl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


#### The code above is used to manage the data scraped by the spider. A JSON file is made and the data is written into the file after being processed by the spider.

In [3]:
class Coronacases(scrapy.Spider):
    name = 'countries'
    allowed_domains = ['www.worldometers.info']
    start_urls = ['https://www.worldometers.info/coronavirus/']

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1},
        'FEED_FORMAT':'json',                                 
        'FEED_URI': './data/coronacases.json'                        
    }

    def parse(self, response):

        rows = response.xpath('(.//table[@id="main_table_countries_today"])[1]/tbody/tr[@style=""]')
        for row in rows:
            name = row.xpath('.//td[2]/a/text() | .//td[2]/span/text() ').get()
            totcases = row.xpath(".//td[3]/text()").get()
            totdeaths = row.xpath(".//td[5]/text()").get()
            actcases = row.xpath(".//td[9]/text()").get()
            totpop = row.xpath(".//td[15]/a/text() | .//td[15]/text() ").get()
            if ((actcases!='N/A' and actcases!=' ') and (totpop!='N/A' and totpop!=' ')):
                percpop = round(((float(actcases.replace(',',''))/float(totpop.replace(',','')))*100),5)
            yield {
               "country_name": name,
                "Total_cases": totcases,
                "Total_deaths":totdeaths,
                "Active_Cases":actcases,
                "Total_Population":totpop,
                "% Active Cases per capita": str(percpop) + " %"
            }

#### For this assignment, the worldmeter coronavirus dataset was scraped for data on every country that still has active cases of Corona.The class defined above is the spider that scrapes the data. Firt a name is assigned to the spider. The allowed domains and the starting url are defined. The custom settings are defined so that the data calls the pipeline function that was defined earlier. 

#### Next, the parse function is defined. Every row in the page defined in the URL represents the data from a different country. Therefore, a response set of all the rows are first extracted. The data is the parsed through, row by row, selecting the columns corresponding to the different values required. For each country, the name, total number of corona cases, total number of deaths, active number of cases and the total population values were extracted. This was done by finding where the value was placed within the HTML page through "inspect element" and the corresponding tag was used. Using the active cases and the total population, a new parameter, not found on the webpage, was calculated and added to the pipeline (% of active cases per capita). This was done simply to test whether the raw scraped data can be used to create new features when added to a database. After the values were obtained, they were added to the json and jsonl files that were created earlier.



In [4]:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})


2020-10-06 12:49:51 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-06 12:49:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-10-06 12:49:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor


In [5]:
process.crawl(Coronacases) 
process.start()


2020-10-06 12:49:51 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
  exporter = cls(crawler)



#### The crawler process is initialized and called with the Spider that was created.

#### This process can be run, as is, in either a jupyter notebook or any standard compiler. Once the crawler process is initialized and called with the Spider(Coronacases). The last two imports are beneficial when running in a jupyter notebook as it avoid rerun errors that are common. Once run the json and jsonl files will be placed in the directory chosen in the pipeline writer function.

#### All the values that were obtained for each country are attributes related to that country's coronavirus statistics. Therefore, a relational database like PostgreSQL or MySQL would be ideal to store the data. As mentioned above, an additional parameter was added to the scraped data to show how the scraped data can be used to create new features. This process would be a lot faster and easier using a relational database. A relational database would allow more interconnected features to be created as well as easily accessible storage for many more attributes.