![UKDS Logo](./media/UKDS_Logos_Col_Grey_300dpi.png)

# Web-scraping for Social Science Research

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
April 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-notebook" data-toc-modified-id="Guide-to-using-this-notebook-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this notebook</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Table of Contents</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Collecting-data-from-web-pages-(web-scraping)" data-toc-modified-id="Collecting-data-from-web-pages-(web-scraping)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collecting data from web pages (web-scraping)</a></span><ul class="toc-item"><li><span><a href="#Reasons-to-engage-in-web-scraping" data-toc-modified-id="Reasons-to-engage-in-web-scraping-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Reasons to engage in web-scraping</a></span></li><li><span><a href="#Logic-of-web-scraping" data-toc-modified-id="Logic-of-web-scraping-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Logic of web-scraping</a></span></li></ul></li><li><span><a href="#Example:-Capturing-Covid-19-data" data-toc-modified-id="Example:-Capturing-Covid-19-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example: Capturing Covid-19 data</a></span><ul class="toc-item"><li><span><a href="#Identifying-URL-of-web-page" data-toc-modified-id="Identifying-URL-of-web-page-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Identifying URL of web page</a></span></li><li><span><a href="#Locating-information" data-toc-modified-id="Locating-information-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Locating information</a></span></li><li><span><a href="#Requesting-the-web-page" data-toc-modified-id="Requesting-the-web-page-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Requesting the web page</a></span></li><li><span><a href="#Parsing-the-web-page" data-toc-modified-id="Parsing-the-web-page-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Parsing the web page</a></span></li><li><span><a href="#Extracting-information" data-toc-modified-id="Extracting-information-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Extracting information</a></span></li><li><span><a href="#Saving-results-from-the-scrape" data-toc-modified-id="Saving-results-from-the-scrape-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Saving results from the scrape</a></span></li><li><span><a href="#Country-level-Covid-19-data" data-toc-modified-id="Country-level-Covid-19-data-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Country-level Covid-19 data</a></span></li><li><span><a href="#A-social-research-example:-charity-data" data-toc-modified-id="A-social-research-example:-charity-data-4.8"><span class="toc-item-num">4.8&nbsp;&nbsp;</span>A social research example: charity data</a></span></li></ul></li></ul></div>

## Introduction

In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:
1. Collecting data stored on web pages. [Focus of this notebook]
2. Downloading data from online databases using Application Programming Interfaces (APIs). <a href="https://github.com/UKDataServiceOpen/new-forms-of-data/tree/master/web-scraping/notebooks" target=_blank>[LINK]</a>
    
Do not be alarmed by the technical aspects of it: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection.    

Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which web-scraping techniques can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency and, narrowly speaking, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease.

We will also indulge the research interests of one of the authors of these materials (Diarmuid) by drawing on examples relating to the UK charity sector. Rest assured that this field offers an excellent example of what is possible using web-based data collection techniques: charity regulators increasingly share the data they hold through multiple channels e.g. monthly data downloads, tables stored on web pages and, in some cases, an API. In order to build a (near) complete picture of a given charity sector, it is necessary to interact with all of these data resources (e.g., scrape data from the regulator's website, connect to the API, download files from a data portal etc).

## Guide to using this notebook

This is a <a href="https://jupyter.org/" target=_blank>Jupyter notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (Section 3). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**To interact with this notebook, you only need to execute the code cells which are marked by `In []`.** <br>These cells contain Python code that you can execute in real time.

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself: execute the cell below to produce a short video guide to Jupyter notebooks.

In [None]:
from IPython.display import Video

Video("./media/jupyter-notebook-quick-demo-2020-04-20.mp4", width=600, height=600)

### Table of Contents

There is a table of contents provided at the top of the notebook, but you can also access this menu at any point by clicking the Navigate button on the top toolbar.

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Collecting data from web pages (web-scraping)


### Reasons to engage in web-scraping

Websites can be an important source of publicly available information on phenomena of interest - for instance, they are used to store and disseminate files, text, photos, videos, tables etc. However, the data stored on websites are typically not structured or formatted for ease of use by researchers: for example, it may not be possible to perform a bulk download of all the files you need (think of needing the annual accounts of all registered companies in London for your research...), or the information may not even be held in a file and instead spread across paragraphs and tables throughout a web page (or worse, web pages). Luckily, web-scraping provides a means of quickly and accurately capturing and formatting data stored on web pages.

Before we delve into writing code to capture data from the web, let's clearly state the logic underpinning the technique.

### Logic of web-scraping

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
3. Request the web page using its web address.
4. Parse the structure of the web page so your programming language can work with its contents.
5. Extract the information we are interested in.
6. Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed. 

For our first example, let's convert the steps above into executable Python code for capturing data about Covid-19 cases.

## Example: Capturing Covid-19 data

Worldometer is a website that provides up-to-date statistics about: the global population; food, water and energy consumption; environmental degradation etc (known as its Real Time Statistics Project). In its own words:<sup>[1]</sup>
> Worldometer is run by an international team of developers, researchers, and volunteers with the goal of making world statistics available in a thought-provoking and time relevant format to a wide audience around the world. Worldometer is owned by Dadax, an independent company. We have no political, governmental, or corporate affiliation.

Since the outbreak of Covid-19 it has provided regular daily snapshots on the progress of this disease, both globally and at a country level.

[1]: https://www.worldometers.info/about/

### Identifying URL of web page

The website can be accessed here: <a href="https://www.worldometers.info/coronavirus/" target=_blank>https://www.worldometers.info/coronavirus/</a>

Let's work through the steps necessary to collect data about the number of Covid-19 cases, deaths and recoveries globally.

First, let's use Python to view this website in our notebook. We can do this by using the `IFrame` function to display external content in the notebook.

In [None]:
from IPython.display import IFrame

IFrame("https://www.worldometers.info/coronavirus/", width="600", height="700")

### Locating information

In the above example, we can see multiple tags enclosing elements of text. For instance, we can see that the Covid-19 statistics are enclosed in `<span><\span>` tags, which themselves are located within `<div><\div>` tags.

Navigating and locating the contents of a web page remains a manual and visual process, and in Brooker's estimation (2020, 252):
> Hence, more so than the actual Python, it's the detective work of unpicking the internal structure of a webpage that is probably the most vital skill here.

### Requesting the web page

Now that we possess the necessary information, let's begin the process of scraping the web page. There is a preliminary step, which is setting up Python with the modules it needs to perform the web-scrape.

In [4]:
# Import modules

import os
import requests
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup

print("Succesfully imported necessary modules")

Succesfully imported necessary modules


Modules are additional techniques or functions that are not present when you launch Python (remember: we are using Python through this notebook); some do not even come with Python when you download it, they must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

Now, let's implement the process of scraping the page. First, we need to request the web page using Python; this is analogous to opening a web browser and entering the web address manually. We refer to a page's location on the internet as its web address or Uniform Resource Locator (URL).

In [5]:
# Define the URL where the webpage can be accessed

url = "https://www.worldometers.info/coronavirus/"

# Request the webpage from the URL

response = requests.get(url, allow_redirects=True)
response.status_code

200

First, we declare a variable (also known as an 'object') called `url` that contains the web address of the web page we want to request. Next, we use the `get()` method of the `requests` module to request the web page, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see <a href="https://www.python.org/dev/peps/pep-0008/" target=_blank>here for some guidance</a>).

For example, the following would also work:

In [6]:
web_address = "https://www.worldometers.info/coronavirus/"

scrape_result = requests.get(web_address, allow_redirects=True)
scrape_result.status_code

200

Back to the request:

Good, we get a status code of _200_ - this means we successfully requested the web page. <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> provide a succinct description of different types of response status codes:

* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)
* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)
* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)
* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)
* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

For clarity:
* **Client**: your machine
* **Server**: the machine you are requesting the web page from

You may be wondering exactly what it is we requested: if you were to type the URL (https://www.worldometers.info/coronavirus/) into your browser and hit `enter`, the web page should appear on your screen. This is not the case when we request the URL through Python but rest assured, we have successfully requested the web page. To see the content of our request, we can call the `text` attribute of the `response` variable:

In [7]:
response.text

'\n<!DOCTYPE html>\n<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->\n<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->\n<!--[if !IE]><!-->\n<html lang="en">\n<!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>Coronavirus Update (Live): 2,503,383 Cases and 171,796 Deaths from COVID-19 Virus Pandemic - Worldometer</title>\n<meta name="description" content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates">\n\n<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">\n<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">\n<link re

This shows us the underlying code (HTML) of the web page we requested. It should be obvious that in its current form, it will be difficult to work with. This is where the `BeautifulSoup` module comes in handy.

(See Appendix C for more examples of how `requests` works and what information it returns.)

### Parsing the web page

Now it's time to identify and understand the structure of the web page we requested. We do this by converting the content contained in the `response.text` attribute into a `BeautifulSoup` variable. `BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:

In [8]:
# Extract the contents of the webpage from the response

soup_response = soup(response.text, "html.parser")
soup_response


<!DOCTYPE html>

<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Coronavirus Update (Live): 2,503,383 Cases and 171,796 Deaths from COVID-19 Virus Pandemic - Worldometer</title>
<meta content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates" name="description"/>
<link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="/favico

The mass of text that is produced should look familiar: it is the full version of the source code [we examined earlier](#source_code_example). Note again how we call on a method (`soup()`) from a module (`BeautifulSoup`) and store the results in a variable (`soup_response`).

How do we navigate such voluminous results? Thankfully the `BeautifulSoup` module provides some intuitive methods for doing so.

In [9]:
# Find the sections containing the data of interest

sections = soup_response.find_all("div", id="maincounter-wrap")
sections

[<div id="maincounter-wrap" style="margin-top:15px">
 <h1>Coronavirus Cases:</h1>
 <div class="maincounter-number">
 <span style="color:#aaa">2,503,383 </span>
 </div>
 </div>,
 <div id="maincounter-wrap" style="margin-top:15px">
 <h1>Deaths:</h1>
 <div class="maincounter-number">
 <span>171,796</span>
 </div>
 </div>,
 <div id="maincounter-wrap" style="margin-top:15px;">
 <h1>Recovered:</h1>
 <div class="maincounter-number" style="color:#8ACA2B ">
 <span>659,300</span>
 </div>
 </div>]

We used the `find_all()` method to search for all `<div>` tags where the id="maincounter-wrap". And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags match this id by calling on the `len()` function:

In [10]:
len(sections)

3

We can view each element in the list of results as follows:

In [11]:
for section in sections:
    print("--------")
    print(section)
    print("--------")
    print("\r") # print some blank space for better formatting

--------
<div id="maincounter-wrap" style="margin-top:15px">
<h1>Coronavirus Cases:</h1>
<div class="maincounter-number">
<span style="color:#aaa">2,503,383 </span>
</div>
</div>
--------

--------
<div id="maincounter-wrap" style="margin-top:15px">
<h1>Deaths:</h1>
<div class="maincounter-number">
<span>171,796</span>
</div>
</div>
--------

--------
<div id="maincounter-wrap" style="margin-top:15px;">
<h1>Recovered:</h1>
<div class="maincounter-number" style="color:#8ACA2B ">
<span>659,300</span>
</div>
</div>
--------



### Extracting information

We are nearing the end of our scrape. The penultimate task is to extract the statistics within the `<span>` tags and store them in some variables. We do this by accessing each item in the _sections_ list using its positional value (index).

In [12]:
cases = sections[0].find("span").text.replace(" ", "").replace(",", "")
deaths = sections[1].find("span").text.replace(",", "")
recov = sections[2].find("span").text.replace(",", "")
print("Number of cases: {}; deaths: {}; and recoveries: {}.".format(cases, deaths, recov))

Number of cases: 2503383; deaths: 171796; and recoveries: 659300.


The above code performs a couple of operations:
* For each item (i.e., set of `<div>` tags) in the list, it finds the `<span>` tags and extracts the text enclosed within them.
* We clean the text by removing blank spaces and commas.

In this example, referring to an item's positional index works because our list of `<div>` tags stored in the `sections` variable is ordered: the tag containing the number of cases appears before the tag containing the number of deaths, which appears before the tag containing the number of recovered patients.

In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in the list is accessed using `sections[0]`, the second using `sections[1]` etc.

### Saving results from the scrape

The final task is to save the variables to a file that we can use in the future. We'll write to a Comma-Separated Values (CSV) file for this purpose, as it is an open-source, text-based file format that is very common for sharing data on the web.

In [13]:
# Create a downloads folder

try:
    os.mkdir("./downloads")
except:
    print("Unable to create folder: already exists")

Unable to create folder: already exists


The use of "./" tells the `os.mkdir()` command that the "downloads" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at "C:/Users/joebloggs/notebooks", the `os.mkdir()` command would result in a new folder located at "C:/Users/joebloggs/notebooks/downloads".
   
Technically the "./" is not needed and you could just write `os.mkdir("downloads")` but it's good practice to be explicit.

In [14]:
# Write the results to a CSV file

date = datetime.now().strftime("%Y-%m-%d")

variables = ["Cases", "Deaths", "Recoveries"]
outfile = "./downloads/covid-19-statistics-" + date + ".csv" 
obs = cases, deaths, recov

with open(outfile, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(variables)
    writer.writerow(obs)

The code above defines some headers and a name and location for the file which will store the results of the scrape. We then open the file in *write* mode, and write the headers to the first row, and the statistics to subsequent rows.

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [15]:
# Check presence of file in "downloads" folder

os.listdir("./downloads")

['annual-accounts-211535-2014.pdf',
 'annual-accounts-211535-2015.pdf',
 'annual-accounts-211535-2016.pdf',
 'annual-accounts-211535-2017.pdf',
 'annual-accounts-211535-2018.pdf',
 'annual-accounts-211535-2019.pdf',
 'charity-beneficiaries-2020-04-02.csv',
 'charity-beneficiaries-2020-04-15.csv',
 'charity-beneficiaries-log-2020-04-02.csv',
 'covid-19-country-statistics-2020-04-15.csv',
 'covid-19-country-statistics-2020-04-20.csv',
 'covid-19-statistics-2020-04-15.csv',
 'covid-19-statistics-2020-04-21.csv']

In [16]:
# Open file and read (import) its contents

with open(outfile, "r") as f:
    data = f.read()
    
print(data)    

Cases,Deaths,Recoveries
2503383,171796,659300



And Voila, we have successfully carried out a web-scrape!

### Country-level Covid-19 data

We will complete our work gathering data on the Covid-19 pandemic by employing the techniques we learned previously to capture country-level statistics.

**TASK**: Document the code below so that a person new to Python and web-scraping could understand what is happening in each block. Once you've written your notes, execute the code to see the results. (If you get stuck, see Appendix A for an annotated copy of the script)

In [17]:
from IPython.display import IFrame

IFrame("https://www.worldometers.info/coronavirus/", width="600", height="700")

In [20]:
import os
import requests
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup

date = datetime.now().strftime("%Y-%m-%d")

url = "https://www.worldometers.info/coronavirus/"
response = requests.get(url, allow_redirects=True)
print(response.status_code)

soup_response = soup(response.text, "html.parser")

table = soup_response.find("table", id="main_table_countries_today").find("tbody")
rows = table.find_all("tr", style="")

200


In [21]:
global_info = []
for row in rows:
    columns = row.find_all("td")
    country_info = [column.text.strip() for column in columns]
    global_info.append(country_info)

print(global_info[0:5])
print("\r")
print("Number of rows in table: {}".format(len(global_info)))
print("\r")

[['World', '2,503,392', '+22,889', '171,796', '+1,399', '659,457', '1,672,139', '57,603', '321', '22.0', '', '', 'All'], ['USA', '792,938', '+179', '42,518', '+4', '72,389', '678,031', '13,951', '2,396', '128', '4,027,367', '12,167', 'North America'], ['Spain', '204,178', '+3,968', '21,282', '+430', '82,514', '100,382', '7,705', '4,367', '455', '930,230', '19,896', 'Europe'], ['Italy', '181,228', '', '24,114', '', '48,877', '108,237', '2,573', '2,997', '399', '1,398,024', '23,122', 'Europe'], ['France', '155,383', '', '20,265', '', '37,409', '97,709', '5,683', '2,380', '310', '463,662', '7,103', 'Europe']]

Number of rows in table: 210



In [24]:
try:
    os.mkdir("./downloads")
except OSError as error:
    print("Folder already exists")

variables = ["Country", "Total Cases", "New Cases", "Total Deaths", 
            "New Deaths", "Total Recovered", "Active Cases", 
            "Serious_Critical", "Total Cases Per 1m Pop", "Deaths Per 1m Pop",
            "Total Tests", "Tests Per 1m Pop"]
outfile = "./downloads/covid-19-country-statistics-" + date + ".csv"
print(outfile)

with open(outfile, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(variables)
    for country in global_info:
        writer.writerow(country)

Folder already exists
./downloads/covid-19-country-statistics-2020-04-21.csv


In [29]:
data = pd.read_csv(outfile, encoding = "ISO-8859-1", index_col=False)
data.sample(5)

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,Active Cases,Serious_Critical,Total Cases Per 1m Pop,Deaths Per 1m Pop,Total Tests,Tests Per 1m Pop
0,World,2503392,22889.0,171796.0,1399.0,659457,1672139,57603.0,321,22.0,,
59,Kuwait,2080,85.0,11.0,2.0,412,1657,46.0,487,3.0,,
169,Nepal,32,1.0,,,4,28,,1,,29567.0,1015.0
177,Belize,18,,2.0,,2,14,1.0,45,5.0,651.0,1637.0
75,Slovakia,1199,26.0,14.0,1.0,258,927,7.0,220,3.0,49428.0,9053.0


In [30]:
data[data["Country"]=="China"]

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,Active Cases,Serious_Critical,Total Cases Per 1m Pop,Deaths Per 1m Pop,Total Tests,Tests Per 1m Pop
209,China,82758,11,4632,,77123,1003,82,57,3,,


In [31]:
data[data["Country"]=="Ireland"]

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,Active Cases,Serious_Critical,Total Cases Per 1m Pop,Deaths Per 1m Pop,Total Tests,Tests Per 1m Pop
18,Ireland,15652,,687,,77,14888,294,3170,139,90646,18358


### A social research example: charity data

This example introduces slightly more complicated web pages, and techniques for handling exceptions, downloading files and more. If you feel comfortable with what you've learned so far then we highly recommend completing this lesson; if not take some more time to digest the Covid-19 example and return to it at a later date.

[Charity Data Example - Appendix C](#section_9_3)

-- END OF FILE --