# Lecture 12: Interacting with files
 
In this lecture, we will learn how to interact with the internet and files on your local machine

## Local files
We start by interacting with files on your computer. First, we create a file "text_file.txt" and store the string `Hello world!` in it.

In [1]:
with open('text_file.txt', 'w') as f:
    f.write('Hello world!')

Now, we can open the file again and read the message

In [2]:
with open('text_file.txt', 'r') as f:
    print(f.read())

Hello world!


Next, we want to add a new line, saying hello back

In [3]:
with open('text_file.txt', 'w') as f:
    f.write('Hello back :)')

Let us now see if we successfully managed to add two lines to the file

In [4]:
with open('text_file.txt', 'r') as f:
    print(f.read())

Hello back :)


Wait, why is there only one line here?

When we open a file in write mode we delete everything in it. Let us see that in action!

In [5]:
with open('text_file.txt', 'w') as f:
    pass

In [6]:
with open('text_file.txt', 'r') as f:
    print(f.read())




So how can we then modify a file!? To do that, we need to open it in *append* mode:

In [7]:
with open('text_file.txt', 'w') as f:
    f.write('Hello world!')

In [8]:
with open('text_file.txt', 'a') as f:
    f.write('Hello back :)')

In [9]:
with open('text_file.txt', 'r') as f:
    print(f.read())

Hello world!Hello back :)


Now, we at least have the correct text, but we are forgetting a line-break. It is good practice to end all text files with a linebreak!

In [10]:
with open('text_file.txt', 'w') as f:
    f.write('Hello world!\n')

with open('text_file.txt', 'a') as f:
    f.write('Hello back :)\n')

with open('text_file.txt', 'r') as f:
    print(f.read())

Hello world!
Hello back :)



Now we see that we have the file as we want it. Let us think about what the code did.

First, the file is opened in write-mode, then all its contents are deleted and a new file is created. Then, we open it again, but this time in append-mode so that the previous file-content is not deleted. Finally, we open it in read-mode and print the contents.

There is another thing I want you to take notice of. Namely that the file we create is a text-file. It stores plaintext, and as you may remember from the second lecture, plaintext is encoded using UTF-8 (unicode). That is, we have a table with number-letter paris that is used to figure out what text the file contains. 

However, text-files are not well suited for data-storage. Consider, for example, if we want to store a 10-digit number in a text file. Then, to load the number, we first need to parse each of the 10 digits and then convert it into an integer. Let us see an example.

In [11]:
with open('number_file.txt', 'w') as f:
    f.write('1002010100')

with open('number_file.txt') as f:  # Read mode is default
    digit = int(f.read())

### Downsides with text-files for data storage:
Here, we store the digit using 10 bytes instead of the 8 bytes we usually spend to store an integer. This is suboptimal both with respect to space and parse-time. Moreover, if we store tabular data and we want to extract a single row, then we need to read all rows before the one we are interested, since each line may take a different amount of space. By binary data, we can ensure that each row takes the same amount of disk-space, which makes it easier to read from the middle of  the file.

### Downsides with binary-files for data storage
However, there are also downsides with binary files. Firstly, it is not clear solely from the file how it should be read. We need some metadata as well. In Windows, this is the file-extension. For binary files, therefore need special programs or libraries to read and write them. Figuring out the contents of a binary file is therfore much more difficult (we cannot only open it in notepad and view its contents).

In Python, we have the pickle library, which can store arbitrary Python files to disk

In [12]:
import pickle

with open('binary_file.pickle', 'wb') as f:
    pickle.dump(1002010100, f)

with open('binary_file.pickle', 'rb') as f:
    number = pickle.load(f)

In [13]:
number

1002010100

This is a neat way to store Python files. Be aware, however, since arbitrary code can be stored in pickle files -- also viruses! So don't open pickle files from sources you don't trust.

In [14]:
with open('binary_file.pickle', 'rb') as f:
    print(f.read())

b'\x80\x03J\xf4u\xb9;.'


## Storing data on disk -- serialising object
Often, we wish to save the state of our program to disk to view it later. There are two ways to do this, either as binary files (for example with pickle), or with a text-serialisation language, such as JSON.

JSON, or JavaScript Object Notation is one of the most popular ways to store data on disk. With JSON files, we store the data as a dictionary whose keys are numbers or strings and values are numbers, strings, lists and more dictionaries (with the same options for keys and values).

JSON is very common, and it is therefore often a good idea to implement a way to store you data as JSON and load your data from a JSON file.

In [15]:
import json

class Polygon:
    def __init__(self, points):
        self.points = points
    
    def to_json(self):
        return json.dumps(self.points)
    
    def load_json(self, json_string):
        self.points = json.loads(json_string)

In [16]:
polygon = Polygon([1, 2, 3])

with open('polygon.json', 'w') as f:
    f.write(polygon.to_json())

In [17]:
new_polygon = Polygon([])

with open('polygon.json', 'r') as f:
    new_polygon.load_json(f.read())

In [18]:
new_polygon.points

[1, 2, 3]

This works well. Now, let us take it one step further... Let us use class methods

In [19]:
import json

class Polygon:
    def __init__(self, points):
        self.points = points
    
    def to_json(self):
        return json.dumps(self.points)
    
    @classmethod
    def load_json(cls, json_string):
        return cls(json.loads(json_string))  # cls becomes Polygon

In [20]:
with open('polygon.json', 'r') as f:
    new_polygon = Polygon.load_json(f.read())

In [21]:
new_polygon.points

[1, 2, 3]

What happened here?

A class method is similar to a normal method, but instead of being used on the instances of the classes, they are used on the class themselves. For instance-methods, we use `self` to represent the instance, in our case, a special case of a polygon. Similarly, for class-methods, we use `cls` to represent the class itself, the input is the class Polygon itself, not a specific instance of it.

For another explanation, see here: https://stackoverflow.com/questions/12179271/meaning-of-classmethod-and-staticmethod-for-beginner

## Interacting with web pages
### Requests and responses

Have you ever wondered what happens when you connect to a webpage? Well, first you say that you wish to visit a specific webpage, for example, http://nmbu.no. Then, you create a data-structure on your computer called a package. This package has an adress and some information about what you want to do. Then, you send this package to your router, which sends it to your internet service provider, which finds out where to send it next. Then, once the package reaches the server for http://nmbu.no, it is read and http://nmbu.no creates a response package (or more realistically, many response packages, but we'll pretend that all info is sent in one) that it adressess to you. This package is then sent to the server's ISP, which figures out how to get it to you.

I want you to think of each package as a tiny text field which contains four fields. A destination adress (http://nmbu.no), a return adress (your personal IP adress), what you wish to do and the contents of the package. We call these packages for requests, and there are two kinds of requests, GET requests and PUT requests.

A GET-request is you simply saing that you want some information from the server. An example of a GET-request is: "I want to see your frontpage". A PUT-request is the reverse, you wish to give the server some information. An example of a PUT-request is: "I want to log in, here is my username and password".

In this lecture, we will only consider GET-requests.

To send web-requests in Python, we should use the requests library. This library is not a builtin, but it has become the de-facto standard for web-requests with Python.

In [22]:
import requests

Let us start with a simple example

In [23]:
request = requests.get('http://nmbu.no')

In [24]:
request

<Response [200]>

Here we see the response-code, 200 means success! Likewise, 404 means not found.

In [25]:
requests.get('http://nmbu.no/python_is_cool')

<Response [404]>

There are a whole list of response-codes, you can read about them here: https://realpython.com/python-requests/

Now, back to the first, successful, request.

In [26]:
request.content

b'<!DOCTYPE html>\n<html lang="nb" dir="ltr">\n  <head>\n    <!-- META FOR IOS & HANDHELD -->\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/>\n    <meta name="HandheldFriendly" content="true" />\n    <meta name="apple-touch-fullscreen" content="YES" />\n    <meta name="msvalidate.01" content="B2B560DCA480DA106B5ED009F4B70148" />\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="https://www.nmbu.no/sites/default/files/favicon_0.ico" type="image/vnd.microsoft.icon" />\n<meta name="generator" content="Drupal 7 (http://drupal.org)" />\n<link rel="canonical" href="https://www.nmbu.no/" />\n<link rel="shortlink" href="https://www.nmbu.no/" />\n<meta property="og:title" content="Forsiden" />\n<meta property="og:updated_time" content="2019-10-11T12:33:36+02:00" />\n<meta name="twitter:title" content="Forsiden" />\n<meta property="article:published_time" content="2019-06-05T13:04:00+02:00" />\

Here, we see the content of the request. It is a byte-string (remember from lecture 2), so we should convert it to a normal string.

In [27]:
content = request.content.decode()

In [28]:
content

'<!DOCTYPE html>\n<html lang="nb" dir="ltr">\n  <head>\n    <!-- META FOR IOS & HANDHELD -->\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/>\n    <meta name="HandheldFriendly" content="true" />\n    <meta name="apple-touch-fullscreen" content="YES" />\n    <meta name="msvalidate.01" content="B2B560DCA480DA106B5ED009F4B70148" />\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="https://www.nmbu.no/sites/default/files/favicon_0.ico" type="image/vnd.microsoft.icon" />\n<meta name="generator" content="Drupal 7 (http://drupal.org)" />\n<link rel="canonical" href="https://www.nmbu.no/" />\n<link rel="shortlink" href="https://www.nmbu.no/" />\n<meta property="og:title" content="Forsiden" />\n<meta property="og:updated_time" content="2019-10-11T12:33:36+02:00" />\n<meta name="twitter:title" content="Forsiden" />\n<meta property="article:published_time" content="2019-06-05T13:04:00+02:00" />\n

## Enough of a demo, let us see it in action.

A good way to get information from the internet is through public data APIs. For example, Oslo Bysykkel (Oslo city bikes) share anonymous travel information from their webpage https://oslobysykkel.no/apne-data

In [29]:
headers = {'Client-Identifier': 'Programming class at NMBU'}

bike_data = requests.get(
    'https://gbfs.urbansharing.com/oslobysykkel.no/gbfs.json',
    headers=headers
)

In [30]:
bike_data.content.decode()

'{"last_updated": 1574685720, "ttl": 10, "data": {"nb": {"feeds": [{"name": "system_information", "url": "http://gbfs.urbansharing.com/oslobysykkel.no/system_information.json"}, {"name": "station_information", "url": "http://gbfs.urbansharing.com/oslobysykkel.no/station_information.json"}, {"name": "station_status", "url": "http://gbfs.urbansharing.com/oslobysykkel.no/station_status.json"}]}}}\n'

Let us see some cooler examples

In [31]:
status_endpoint = 'https://gbfs.urbansharing.com/oslobysykkel.no/station_status.json'

station_information_text = requests.get(
    status_endpoint,
    headers=headers
).content.decode()

In [32]:
station_information_text

'{"last_updated": 1574685724, "ttl": 10, "data": {"stations": [{"station_id": "1101", "is_installed": 1, "is_renting": 1, "is_returning": 1, "last_reported": 1574685724, "num_bikes_available": 2, "num_docks_available": 20}, {"station_id": "1023", "is_installed": 1, "is_renting": 1, "is_returning": 1, "last_reported": 1574685724, "num_bikes_available": 10, "num_docks_available": 2}, {"station_id": "1009", "is_installed": 1, "is_renting": 1, "is_returning": 1, "last_reported": 1574685724, "num_bikes_available": 0, "num_docks_available": 10}, {"station_id": "970", "is_installed": 1, "is_renting": 1, "is_returning": 1, "last_reported": 1574685724, "num_bikes_available": 0, "num_docks_available": 23}, {"station_id": "787", "is_installed": 1, "is_renting": 1, "is_returning": 1, "last_reported": 1574685724, "num_bikes_available": 8, "num_docks_available": 0}, {"station_id": "627", "is_installed": 1, "is_renting": 1, "is_returning": 1, "last_reported": 1574685724, "num_bikes_available": 16, "n

Does this data-format look familiar? It is JSON!

In [33]:
station_status = json.loads(station_information_text)
station_status

{'last_updated': 1574685724,
 'ttl': 10,
 'data': {'stations': [{'station_id': '1101',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 2,
    'num_docks_available': 20},
   {'station_id': '1023',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 10,
    'num_docks_available': 2},
   {'station_id': '1009',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 0,
    'num_docks_available': 10},
   {'station_id': '970',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 0,
    'num_docks_available': 23},
   {'station_id': '787',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 8,
    'num_docks_available': 0},
  

Ok, this code is somewhat cumbersome to work with. Let us therefore create a function that does everything above

In [34]:
def request_json(adress, headers):
    request = requests.get(adress, headers=headers)
    return json.loads(request.content.decode())

station_status = request_json(status_endpoint, headers)

In [35]:
station_status

{'last_updated': 1574685724,
 'ttl': 10,
 'data': {'stations': [{'station_id': '1101',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 2,
    'num_docks_available': 20},
   {'station_id': '1023',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 10,
    'num_docks_available': 2},
   {'station_id': '1009',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 0,
    'num_docks_available': 10},
   {'station_id': '970',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 0,
    'num_docks_available': 23},
   {'station_id': '787',
    'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1574685724,
    'num_bikes_available': 8,
    'num_docks_available': 0},
  

In [36]:
station_status['data'].keys()

dict_keys(['stations'])

Let us now create a station-class that we can use to wrap these station-dictionaries in

In [39]:
class Station:
    def __init__(
        self,
        station_id,
        is_installed,
        is_renting,
        is_returning,
        last_reported,
        num_bikes_available,
        num_docks_available
    ):
        self.station_id = station_id
        self.is_installed = is_installed
        self.is_renting = is_renting
        self.is_returning = is_returning
        self.last_reported = last_reported
        self.num_bikes_available = num_bikes_available
        self.num_docks_available = num_docks_available

    def __repr__(self):
        return f'<Station {self.station_id}>'
    
    def __str__(self):
        return f"""\
            ID: {self.station_id}
            Installed: {self.is_installed}
            Renting: {self.is_renting}
            Returning: {self.is_returning}
            Last reported: {self.last_reported}
            Number of available bikes: {self.num_bikes_available}
            Number of available docking stations: {self.num_docks_available}"""
    
    
stations = [
    Station(**station_info) for station_info in station_status['data']['stations']
]

In [40]:
print(str(stations[0]))

            ID: 1101
            Installed: 1
            Renting: 1
            Returning: 1
            Last reported: 1574685724
            Number of available bikes: 2
            Number of available docking stations: 20


Now, what is this last-reported information? It is a time-stamp with the number of seconds since 1st of January, 1970, also known as time since epoch

In [41]:
import time

def parse_time(time_since_epoch):
    return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time_since_epoch))

In [45]:
class Station:
    def __init__(
        self,
        station_id,
        is_installed,
        is_renting,
        is_returning,
        last_reported,
        num_bikes_available,
        num_docks_available
    ):
        self.station_id = station_id
        self.is_installed = is_installed
        self.is_renting = is_renting
        self.is_returning = is_returning
        self.last_reported = parse_time(last_reported)
        self.num_bikes_available = num_bikes_available
        self.num_docks_available = num_docks_available

    def __repr__(self):
        return f'<Station {self.station_id}>'
    
    def __str__(self):
        return f"""\
            ID: {self.station_id}
            Installed: {self.is_installed}
            Renting: {self.is_renting}
            Returning: {self.is_returning}
            Last reported: {self.last_reported}
            Number of available bikes: {self.num_bikes_available}
            Number of available docking stations: {self.num_docks_available}"""
    
    
stations = [
    Station(**station_info) for station_info in station_status['data']['stations']
]

In [46]:
print(stations[0])

            ID: 1101
            Installed: 1
            Renting: 1
            Returning: 1
            Last reported: 2019-11-25 13:42:04
            Number of available bikes: 2
            Number of available docking stations: 20


In [47]:
print(stations[1])

            ID: 1023
            Installed: 1
            Renting: 1
            Returning: 1
            Last reported: 2019-11-25 13:42:04
            Number of available bikes: 10
            Number of available docking stations: 2


Next, we want a map between IDs and station names

In [48]:
station_metadata = request_json('https://gbfs.urbansharing.com/oslobysykkel.no/station_information.json', headers)
station_metadata

{'last_updated': 1574685758,
 'ttl': 10,
 'data': {'stations': [{'station_id': '1101',
    'name': 'Stortingstunellen',
    'address': 'Rådhusgata 34',
    'lat': 59.91065301806209,
    'lon': 10.737365277561025,
    'capacity': 24},
   {'station_id': '1023',
    'name': 'Professor Aschehougs plass',
    'address': 'Professor Aschehougs plass',
    'lat': 59.9147672,
    'lon': 10.740971,
    'capacity': 12},
   {'station_id': '1009',
    'name': 'Borgenveien',
    'address': 'Borgenveien',
    'lat': 59.942742106473666,
    'lon': 10.703833031254021,
    'capacity': 10},
   {'station_id': '970',
    'name': 'Enerhaugen',
    'address': 'ved Sørligata',
    'lat': 59.91320242563816,
    'lon': 10.767579386407874,
    'capacity': 25},
   {'station_id': '787',
    'name': 'Kirkegata 15',
    'address': 'Kirkegata 15, Oslo',
    'lat': 59.91015615055511,
    'lon': 10.743456971511705,
    'capacity': 12},
   {'station_id': '627',
    'name': 'Skøyen Stasjon',
    'address': 'Skøyen Stasjo

Ok, so our class from above was not ideal, it doesn't contain the metadata. To get this, we need to modify our code so that we  have an ID to information mapping. A good data structure for this would be a dictionary.

In [49]:
station_data = {}


def join_dicts(*dicts):
    joined = {}
    for dict in dicts:
        joined = {**joined, **dict}
    return joined


for single_station_metadata in station_metadata['data']['stations']:
    station_id = single_station_metadata['station_id']
    station_data[station_id] = single_station_metadata
    
for single_station_status in station_status['data']['stations']:
    station_id = single_station_status['station_id']
    station_data[station_id] = join_dicts(station_data[station_id], single_station_status)

In [50]:
station_data

{'1101': {'station_id': '1101',
  'name': 'Stortingstunellen',
  'address': 'Rådhusgata 34',
  'lat': 59.91065301806209,
  'lon': 10.737365277561025,
  'capacity': 24,
  'is_installed': 1,
  'is_renting': 1,
  'is_returning': 1,
  'last_reported': 1574685724,
  'num_bikes_available': 2,
  'num_docks_available': 20},
 '1023': {'station_id': '1023',
  'name': 'Professor Aschehougs plass',
  'address': 'Professor Aschehougs plass',
  'lat': 59.9147672,
  'lon': 10.740971,
  'capacity': 12,
  'is_installed': 1,
  'is_renting': 1,
  'is_returning': 1,
  'last_reported': 1574685724,
  'num_bikes_available': 10,
  'num_docks_available': 2},
 '1009': {'station_id': '1009',
  'name': 'Borgenveien',
  'address': 'Borgenveien',
  'lat': 59.942742106473666,
  'lon': 10.703833031254021,
  'capacity': 10,
  'is_installed': 1,
  'is_renting': 1,
  'is_returning': 1,
  'last_reported': 1574685724,
  'num_bikes_available': 0,
  'num_docks_available': 10},
 '970': {'station_id': '970',
  'name': 'Enerha

In [51]:
class Station:
    def __init__(
        self,
        station_id,
        is_installed,
        is_renting,
        is_returning,
        last_reported,
        num_bikes_available,
        num_docks_available,
        name,
        address,
        lat,
        lon,
        capacity
    ):
        self.station_id = station_id
        self.is_installed = is_installed
        self.is_renting = is_renting
        self.is_returning = is_returning
        self.last_reported = parse_time(last_reported)
        self.num_bikes_available = num_bikes_available
        self.num_docks_available = num_docks_available
        self.name = name
        self.address = address
        self.lat = lat
        self.lon = lon
        self.capacity = capacity

    def __repr__(self):
        return f'<Station {self.station_id}>'
    
    def __str__(self):
        return f"""\
        Station:
            ID: {self.station_id}
            Name: {self.name}
            Address: {self.address}
            Installed: {self.is_installed}
            Renting: {self.is_renting}
            Returning: {self.is_returning}
            Number of available bikes: {self.num_bikes_available}
            Number of available docking stations: {self.num_docks_available}
            Capcity: {self.capacity}
            Latitude: {self.lat}
            Longitude: {self.lon}
            Last reported: {self.last_reported}
            """
    
    
stations = {
    station_id: Station(**station_info) for station_id, station_info in station_data.items()
}

Is the station names unique?

In [52]:
names = {station_info['name'] for station_info in station_data.values()}
len(names), len(station_data)

(244, 245)

There are two stations with the same name

In [56]:
from collections import defaultdict


class StationsDatabase:
    def __init__(self, stations):
        self.names_to_id = defaultdict(list)
        for station_id, station_data in stations.items():
            self.names_to_id[station_data.name].append(station_id)
        
        self.stations = stations
    
    def show_location_info(self, name):
        if name not in self.names_to_id:
            raise NameError(f'{name} is not a valid name for a station')
        for station_id in self.names_to_id[name]:
            print(self.stations[station_id])

In [57]:
stations_database = StationsDatabase(stations)
stations_database.show_location_info('Bislettgata')

        Station:
            ID: 620
            Name: Bislettgata
            Address: Bislettgata
            Installed: 1
            Renting: 1
            Returning: 1
            Number of available bikes: 0
            Number of available docking stations: 38
            Capcity: 40
            Latitude: 59.9238336
            Longitude: 10.7346377
            Last reported: 2019-11-25 13:42:04
            


In [58]:
stations_database.names_to_id

defaultdict(list,
            {'Stortingstunellen': ['1101'],
             'Professor Aschehougs plass': ['1023'],
             'Borgenveien': ['1009'],
             'Enerhaugen': ['970'],
             'Kirkegata 15': ['787'],
             'Skøyen Stasjon': ['627'],
             '7 Juni Plassen': ['623'],
             'Dælenenggata': ['624'],
             'Drammensveien': ['626'],
             'Sinsen T-bane': ['614'],
             'Salt': ['616'],
             'Bak Niels Treschows hus sør': ['618'],
             'Torshovdalen øst': ['621'],
             'Pilestredet 63': ['622'],
             'Schives gate': ['613'],
             'Bislettgata': ['620'],
             'Idioten': ['596'],
             'Bjerregaardsgate Øst': ['617'],
             'Fred Olsens gate': ['609'],
             'Marcus Thranes gate': ['607'],
             'Frogner plass': ['603'],
             'Majorstua skole': ['590'],
             'Sotahjørnet': ['610'],
             'Colletts gate': ['608'],
             'B