# Fetching files and working with them: `requests` and `os`

There are alternatives to `pandas` when it comes to fetching data. We might, for example, want to download a number of files and then work with those files directly (as opposed to using dataframes).

Journalistic scenarios where this might come in useful include:
* You've been sent multiple files and need to find ones that contain key words
* You need to combine multiple files or extract the same piece of data from each

The [`os` library](https://docs.python.org/3/library/os.html) allows you to [use 'command line'-type commands](https://www.geeksforgeeks.org/os-module-python-examples/) to work with files that have been downloaded to the computer. 

This approach can have the advantage of being faster than using software interfaces like Excel.

However, remember that in *Colab* the operating system isn't on your computer - it's on a remote server that is hosted by Google. 

You can navigate through files and folders in that system using the *Files* area on the left hand column in Colab. Make sure this is expanded while you run the code below.

Start by importing the `os` library.

In [None]:
#import the library
import os

## Find out what the 'current working directory' is: `.getcwd()`

The current working directory is basically the folder that Python is working in right now. 

If we want to access any files and do things with them, we need to either make sure they're in that folder, or use a path that's *relative to* that folder (e.g. 'go into another folder and find the file there').

To find out what the current working directory is you can use the `.getcwd()` function from `os`.

That tells us that we're in a folder called 'content'.

In [None]:
os.getcwd()

'/content'

If we look at the *Files* area in Colab, we can't see any folder called 'content'. But then, we don't know what the folder we are in is called.

We have a couple of options here: 

* We can move around to try to see where `os` is
* We can move `os` around 

## List files in the current directory: `.listdir`

A good way to work out where `os` is is to use its `listdir()` function.

This will show you what files and folders it can see in its current directory.

In [None]:
#list files in this directory
os.listdir()

['.config',
 'UK Gender Pay Gap Data - 2021 to 2022.csv',
 'UK Gender Pay Gap Data - 2020 to 2021.csv',
 'sample_data']

Now we can see the 'sample_data' folder that we can also see in the *Files* view. That tells us that we *are* in the same place as `os` - we just didn't know that the folder we were in was called 'content'.

You can check this yourself by clicking on the 'up one level' button in *Files* (the folder icon with an arrow on it). 

This will take you to the folder containing the folder you were in before - and you'll be able to see all the other folders, too. 

This can be disorientating, though - which of those folders were you in? 

It's at this point that you remember `os` saying the current working directory was 'content', so you can click on the arrow next to that to expand its contents.

### Listing files in a subdirectory

The `.listdir()` function defaults to the current directory - but you can also specify a directory, or a path to a certain directory, to find out what's there.

In [None]:
#list the files in the 'sample_data' folder
os.listdir("sample_data")

['anscombe.json',
 'README.md',
 'mnist_test.csv',
 'california_housing_train.csv',
 'mnist_train_small.csv',
 'california_housing_test.csv']

## Moving about in the folders: `.chdir()`

Once you know where you are by using `.getcwd()`, and what's in that folder by using `.listdir()`, you can navigate using `.chdir()` (change directory)

Unlike the other functions, `.chdir()` needs an ingredient: which directory you want to change to. 

To remind yourself which directories you could move to, use `.listdir()` again. The results come in a list: `['.config', 'sample_data']`

Let's change into the 'sample_data' directory, then, and see what's there.

In [None]:
#change into the 'sample_data' directory
os.chdir('sample_data')
#show the current working directory now
print(os.getcwd())
#show what files and folders are in here
print(os.listdir())

/content/sample_data
['anscombe.json', 'README.md', 'mnist_test.csv', 'california_housing_train.csv', 'mnist_train_small.csv', 'california_housing_test.csv']


This directory has a number of files - we can tell they're files and not folders because they have file extensions, like .json, .md and .csv.

This command is especially useful if you want to loop through a number of files and repeat an action with each (e.g. extracting data or combining them).

### Moving back 'up' a level: `".."`

What if you don't want to move to a folder in this directory - but want to move back up to the folder you came from? 

In that case you can use two periods - `".."` - to indicate 'the folder one level up'.

Here's that being used in practice, to get us back to the 'content' folder we started in.

In [None]:
#change directory to one level above this one
os.chdir("..")
#show the current working directory
os.getcwd()

'/content'

## Putting this into practice: combining gender pay gap data

There are a few good sources of data that we can use to test this out: 

* [MPs' expenses in CSVs for each year](https://www.theipsa.org.uk/mp-staffing-business-costs/annual-publications)
* [Food ratings in XML format for each authority](https://ratings.food.gov.uk/open-data/en-gb)
* And [gender pay gap data for each year](https://gender-pay-gap.service.gov.uk/viewing/download
)

Download the two most recent gender pay gap CSV files and upload them to the same folder ('content') we are in (the default folder in the Files view).

We can now use something called **command line** to combine those.

## Combining files using `.system()` and command line

The `.system()` function in `os` allows you to run **command line** within Python. 

Command line is actually a separate language to Python - it's used on Linux and Mac computers to interact with files directly on the computer. 

We aren't going to learn it here but there is one *very* useful command in command line which we can use to combine files: `cat` (short for 'concatenate').

To use `cat` to combine all the CSV files into one, you can just reuse this code:

`cat *.csv > alldata.csv`

This breaks down like so:

* Use the `cat` command
* On any files ending in '.csv' (The asterisk is a wildcard meaning 'any character')
* Put the results... (the `>` operator)
* ...in a file called 'alldata.csv'

To use that command in Python you just put it in quotation marks inside `os.system()` like so:

In [None]:
#combine all CSV files in the folder into one
os.system("cat *.csv > alldata.csv")

0

Ignore the output above - instead look in the *Files* view on the right in Colab and refresh the view. You should see the new files 'alldata.csv' that has been created. 

You can now download that to check it's worked.

Note that the two files will have been combined *including the header row*. In other words, you will find an extra header row in the middle of the data where one CSV has been joined to the other. Try sorting your data to get all those rows together, and then delete the duplicates (or use the built-in remove duplicate rows option in Excel).

Remember  also that this code will work on any collection of CSV files as long as:

* You have navigated to the same folder as the files using `.chdir()`
* All the files end in .csv (make sure there aren't any other CSV files in there you don't want to include)

## Fetching files using `requests`

A [useful library for fetching files](https://betterprogramming.pub/3-simple-ways-to-download-files-with-python-569cb91acae6) is `requests`. Let's import that.

In [None]:
import requests

### The `.get()` method

The `requests` library has a method called `.get()` which will fetch a page and store it in a requests 'object'.

Say for example you wanted to get the webpages for a [politician's register of interests](https://publications.parliament.uk/pa/cm/cmregmem/220117/hancock_matt.htm), here is the code to do it for one page...

In [None]:
#get a webpage and store in a variable called 'rqobject'. This is a 'requests object'.
rqobject = requests.get("https://publications.parliament.uk/pa/cm/cmregmem/220117/hunt_jeremy.htm")

When you try to print that variable you get something unexpected, however.

In [None]:
#print the object you've created
print(rqobject)

<Response [403]>


### Showing the status of your request

In order to work with a requests object you need to know how to call its various properties. 

For example, one of its properties is a status code: that tells you whether your attempt to fetch it was successful. 

(One status code you'll have seen when browsing the web is a '404 error' which means "page not found", but there are [lots of others too](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_client_errors))

You fetch that code by adding `.status_code` to the name of the requests object like this:

In [None]:
#show the 'status_code' property of that object
rqobject.status_code

403

### Showing the 'content' of a request

A 403 status code [means](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403) "the server understands the request but refuses to authorize it."

In other words, our attempt to access the webpage has been blocked. 

But we can still see the *content* that our request managed to get hold of, by adding `.content` to the requests object.

In [None]:
#print the content of the object
print(rqobject.content)



### Showing the text of a request object

An alternative to `.content` is `.text`. This is easier to read, even if it's not massively clear what the difference is. 

[One very succinct explanation puts it like this:](https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text) "r.text is the content of the response in Unicode, and r.content is the content of the response in bytes."

It's not worth going into the difference right now, but this can come in useful later.

In [None]:
#print the text
rqobject.text



### Using 'pretty printer' to make it easier to read

Although the `.text` is easier to read than the `.content`, there is a small library that makes long strings like this even more readable still - it's called 'Data pretty printer' - or `pprint`.

We can import just one part of that library like this:

In [None]:
#import pretty printer to show the text in a more readable way
from pprint import pprint

And we can then use `pprint` instead of `print`

In [None]:
#show the content of our object
pprint(rqobject.text)

('<!DOCTYPE html>\n'
 '<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->\n'
 '<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->\n'
 '<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->\n'
 '<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->\n'
 '<head>\n'
 '\n'
 '<title>Please Wait... | Cloudflare</title>\n'
 '  \n'
 '<meta name="captcha-bypass" id="captcha-bypass" />\n'
 '<meta charset="UTF-8" />\n'
 '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n'
 '<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />\n'
 '<meta name="robots" content="noindex, nofollow" />\n'
 '<meta name="viewport" content="width=device-width,initial-scale=1" />\n'
 '<link rel="stylesheet" id="cf_styles-css" '
 'href="/cdn-cgi/styles/cf.errors.css" type="text/css" '
 'media="screen,projection" />\n'
 '<!--[if lt IE 9]><link rel="stylesheet" id=\'cf_styles-ie-css\' '
 'href="/cdn-cgi/st

There's a lot here, but scan through it and at some point you might notice that there's text here for a CAPTCHA. In other words, what we've got is a page that's asking us to prove that we're human.

And, of course, 'we' are not. We are a Python script, and that's why we've triggered this 403.

### An example of a successful request

Before we explore how to handle problems like the 403 status code, it's useful to see a successful request. 

The status code for a successful request is 200. 

In [None]:
#fetch another url, and store in an object
ojb = requests.get("https://onlinejournalismblog.com/")
#print the status code - 200 means 'success'
print(ojb.status_code)

200


In [None]:
#print the content of the requests object 'ojb'
pprint(ojb.text)

('<!DOCTYPE html>\n'
 '<!--[if IE 7]>\n'
 '<html class="ie ie7" lang="en">\n'
 '<![endif]-->\n'
 '<!--[if IE 8]>\n'
 '<html class="ie ie8" lang="en">\n'
 '<![endif]-->\n'
 '<!--[if !(IE 7) & !(IE 8)]><!-->\n'
 '<html lang="en">\n'
 '<!--<![endif]-->\n'
 '<head>\n'
 '<meta charset="UTF-8" />\n'
 '<meta name="viewport" content="width=device-width" />\n'
 '<title>Online Journalism Blog | Comment, analysis and links covering online '
 'journalism and online news, citizen journalism, blogging, vlogging, '
 'photoblogging, podcasts, vodcasts, interactive storytelling, publishing, '
 'Computer Assisted Reporting, User Generated Content, searching and all '
 'things internet.</title>\n'
 '<link rel="profile" href="https://gmpg.org/xfn/11" />\n'
 '<link rel="pingback" href="https://onlinejournalismblog.com/xmlrpc.php">\n'
 '<!--[if lt IE 9]>\n'
 '<script '
 'src="https://s2.wp.com/wp-content/themes/pub/twentytwelve/js/html5.js?ver=3.7.0" '
 'type="text/javascript"></script>\n'
 '<![endif]-->\n'

Again, there's a whole bunch of HTML but this time you can see we get the page we expected.

## Saving your fetched object as a local file using `open` and `.write()`

So far this requests object only exists in a variable in Python. 

But we can use the `open` function [to create](https://www.w3schools.com/python/python_file_write.asp) a new HTML file and then `.write()` the text property of our requests object into it. 

In [None]:
#create a new file - it's empty for now
fd = open("thisisanewpage.html","a")
#write the .text property of the variable 'ojb' into that object
fd.write(ojb.text)

214098

Refresh the *Files* view on the left in Colab and you should see the new file. Download it and open it to test it's worked. 

The `open` function is [part of basic Python](https://www.w3schools.com/python/ref_func_open.asp). It "opens a file, and returns it as a file object".

You need to specify the name of the file even if it doesn't exist (in which case it will create it), but it also needs a second ingredient: the 'mode' you want to open it in. Specifying `"a"` means it's in 'append' mode, and so we can add to it. 

Once created, the 'file object' is stored in a variable - in this case, we store the results of the `open` function in a variable called 'fd'. 

Like many objects created by functions, it has certain properties, including the method `.write()` which allows you to write information into that file object. 

So, `fd.write(ojb.text)` uses the `.write()` method of the object `fd` to write the results of using `ojb.text` into it. 

The empty HTML file, then, now has that text appended to it. 

## Repeating that process for a CSV

The process is pretty much the same for a CSV - here's how we might do it for the gender pay gap data. 

In [None]:
#fetch the CSV from its link URL - store in a variable called gpg23
gpg23 = requests.get("https://gender-pay-gap.service.gov.uk/viewing/download-data/2022")
#create a new file - it's empty for now - make sure it ends in .csv
fd = open("gpg23.csv","a")
#write the .text property of the variable 'gpg23' into that object - a CSV file is just text 
fd.write(gpg23.text)

## For PDFs: use `.content`

The PDF format isn't a text format like .html or .csv or .txt - so [instead of calling the `.text` property of the requests object, use `.content`](https://stackoverflow.com/questions/34503412/download-and-save-pdf-file-with-python-requests-module).

We also use some slightly different code.

In [None]:
#here's a PDF URL
pdfurl = "https://dataharvest.eu/wp-content/uploads/2019/06/2019-05-The-Bureau-Local-Building-the-Bureau-Local-a-user-guide.pdf"
#fetch it
pdfobj = requests.get(pdfurl)
#show the content
pdfobj.content

Output hidden; open in https://colab.research.google.com to view.

In [None]:
#write the PDF to that file
with open('thisisanewpdf.pdf', 'wb') as f:
    f.write(pdfobj.content)

## Dealing with problems

When we tried to fetch the MP's register of interests earlier, we got a 403 status. Fixing these problems can involve a bit of googling around, but [this article provides a useful overview](https://flutterq.com/solved-python-requests-403-forbidden/) of possible causes and solutions.