# Reading and writing data

**September 08 2020**  
*Vincenzo Perri*

In the fifth unit, we show how we can read and write data from files or databases.

## Reading and writing data from and to files

Without data there is no data science, so we better learn how to read and write data in different formats in `python`. The basic, low-level interface to read and write data from/to the filesystem is provided by the function `open`. It returns a handle to a file that can be used to read/write text or binary data. If we only want to read data, we can pass the path to an existing file and open the file in read mode by specifying the access mode `r` as the second argument. In general, we would have to manually close a file that we opened by calling `f.close`. If we fail to do so, the file might remain locked or the contents we intended to write may actually not be fully written when the process exits. To save us the hazzle of remembering to manually close the file, we can use `python`'s `with` construct. It allows us to group statements and couple them to a so-called context manager, which will automatically close the file `f` for us as soon as we leave the scope of the compound statements.

Let us read a text file in (default) text mode by calling the read function. This will return a single string object that contains the whole contents of the file.

Let me highlight that I use the cell magic `%cd` to change the current directory of the `python` interpreter relative to the root of the folder that I am usually opening in Visual Studio Code. This is a very useful feature if you have a hierarchy of `jupyter` notebooks, and within these notebooks you want to access files relative to the notebook location.

In [1]:
# %cd "P02 - data and collections"

with open('data/posterior_analytics.txt', 'r') as f:
    text = f.read()
print(type(text))
print(len(text))
print(text[0])

<class 'str'>
277
E


Apart from reading the file contents into a single string, it is often convenient to read individual lines as separate string, returning a list of strings that contain the lines of the file. We can do this using the function `readlines`:

In [2]:
with open('data/posterior_analytics.txt', 'r') as f:
    lines = f.readlines()
print(type(lines))
print(len(lines))
print(lines[0])

<class 'list'>
11
ESTRAGON:



Above, we have opened the file in default text mode, which assumes a default character encoding that can be changed via the `encoding` argument. Alternatively, we can open a file in binary mode, in which case the function read will return a stream of bytes. Each entry in the iterable `bytes` object is a single byte, represented by an integer value in the range from 0 to 255. Looking up the value `69` in an ASCII encoding table confirms that the first character in the file is a `E`.

In [3]:
with open('data/posterior_analytics.txt', 'rb') as f:
    binary = f.read()
print(type(binary))
print(len(binary))
print(binary[0])

<class 'bytes'>
288
69


Let us perform the simplest possible data analytics on the text file, i.e. counting the frequency of words. Since this lecture is not about text mining and natural language processing, here we take a maximally simple approach to split the text into words. We simply split the string based on a regular expression that contains newline, tab and whitespace characters. We can specify this using `python`'s regular expression module `re`. We can then use the `Counter` class introduced in the previous unit to efficiently count word frequencies and display the ten most common words:

In [4]:
from collections import Counter
import re

freq = Counter([ x.strip() for x in re.split(" |\n|\t|-", text.lower())])
freq.most_common(10)

[('estragon:', 3),
 ('(', 3),
 (')', 3),
 ('he', 2),
 ('to', 2),
 ('vladimir:', 2),
 ('charming', 1),
 ('spot.', 1),
 ('turns,', 1),
 ('advances', 1)]

Let us assume we wish to store this statistics for later use. A low-level way to do this is to format the word-frequency pairs as strings, and write individual lines to a file. For this, we can simply use the `open` function again to obtain a file handle to a new file with write (`w`) permission. In this case, the file does not need to exist before and it will be overwritten whenever we open it again. If we instead wanted to append to an already existing file, we need to specify the access string `a`. Each call to the write function will add the string to the end of the current file stream. Moreover, we use the `python`'s `format` function to format the string and integer values into a comma-separated value.

In [5]:
with open('data/word_frequencies.dat', 'w') as f:
    for word in freq:
        f.write('{0},{1}\n'.format(word, freq[word]))

Wit this, we have manually created a file of comma-separated values (CSV), which is an important general text-based format to exchange data between applications and systems. In fact, we should not do this manually, as it leaves a lot of room for errors. Even our simple file will not be easy to parse, because some of the words actually contain commas. We missed to escape those special characters. A better idea is to use the support for the import and export of `CSV` data integrated in `python`. For this, we can simply use the `reader` and `writer` classes in the module `csv`. We an set the value delimiters as wells as the (OS-dependent) newline format and the writer class will automatically take care about values that contain the delimiter character upon writing and reading. We can even export data in formats that are easy to interpret by third-party applications like Excel.

In [6]:
import csv

with open("data/word_frequencies.csv", 'w', newline='') as f:
    writer = csv.writer(f, delimiter=',')
    for word in freq:
        writer.writerow([word, freq[word]])

In the following, we show how we can use the `reader` class in the `csv` module to read CSV data with a given `delimiter` character. We can directly access data in the different columns of a row by indexing the `row` entries of the iterable `reader`.

In [7]:
counts = {}

with open("data/word_frequencies.csv", 'r', newline='') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        counts[row[0]] = row[1]

print(counts['to'])

2


## Serialising `python` objects

In the examples above, we have manually taken care of the format in which we want to store our data. Another approach is to directly write `python` objects into files, a process that is called serialization because it requires us to convert a potentially nested object structure to a sequence of bytes. Due to the importance of the World Wide Web and JavaScript, over the past years `JSON` (short for JavaScript Object Notation) has become a universal format to exchange arbitrary objects with nested data fields. You can basically think about this as a simple way to store a nested dictionary structure in an easily interpretable and human-readable text file. In fact, you are using `JSON` right now since `jupyter` notebooks are just `JSON` files. `python` provides an easy way to store arbitrary objects as a `JSON` file. We just need to import the `json` module and call the `dumps` and `loads` functions to retrieve a `JSON`-formatted string representation of an object. We can then read or write this string using the functions above.

In [8]:
import json

j = json.dumps(freq)

with open('data/word_frequencies.json', 'w') as f:
    f.write(j)

In [9]:
with open('data/word_frequencies.json', 'r') as f:
    j = f.read()

loaded_freq = json.loads(j)
print(type(loaded_freq))

<class 'dict'>


Rather than calling the read/write function ourselves, we can also use the functions without the `s` suffix (which stands for `string`) to directly operate on a file handle:

In [10]:
with open('data/word_frequencies.json', 'r') as f:
    loaded_freq = json.load(f)
type(loaded_freq)

dict

In any case, you see that we retrieve a `dict` object that only contains the words and frequencies stored in the original `Counter` object. This is due to the fact that the specific `python` classes and data types cannot (and should not!) be stored in the language-independent `JSON` format. If we want to store a full copy of an arbritrary object or data type while maintaining this information, we can use the `pickle` module. As the name indicates, this `preserves` an object in a file for later use. It also uses a binary file format, which makes it more efficient to store large objects. Using `pickle` is very similar to the `json` module. We can simply `dump` an object into a (binary) file and read it again using the `load` function.

In [11]:
import pickle

with open('data/word_frequencies.pickled', 'wb') as f:
    pickle.dump(freq, f)

Different from `JSON` files, here we retain all information about the object, so if we load the object we obtain an instance of the `Counter` class and we can directly use its function `most_common` to extract the most common words.

In [12]:
with open('data/word_frequencies.pickled', 'rb') as f:
    loaded_freq = pickle.load(f)
    
print(type(loaded_freq))
loaded_freq.most_common(10)

<class 'collections.Counter'>


[('estragon:', 3),
 ('(', 3),
 (')', 3),
 ('he', 2),
 ('to', 2),
 ('vladimir:', 2),
 ('charming', 1),
 ('spot.', 1),
 ('turns,', 1),
 ('advances', 1)]

As a general rule, using `pickle` is preferrable if you do not care about interoperability and a future-proof archival of data, while using `JSON` is a good idea for data that you intend to use in environments different than `python`, or across different `python` versions.

## Basic data management with sqlite

We now introduce some basics on advanced data management in `python` using `SQL`-based relational database management systems. Here we focus on the simplest possible setup, using the in-process, file-based database `sqlite`. This means you will neither have to install a database system and start a server, nor do you need to set up users or priviledges. We can simply create databases with multiple tables that are stored in a single stand-alone file, and that can easily be exchanged between systems. For simple data analytics applications that often do not require concurrent write-access, authentication, or transaction support, `sqlite` is a very efficient Open Source solution to data management, that is further available on any machine that can run `python`. As a side note, `sqlite` files are also used as the base data format in a number of commercial applications, such as for the catalogue files of `Adobe Lightroom`, the local database of `Evernote`, or as data store within `Google Chrome` and `Mozilla Thunderbird`.

Creating a new or conencting to an existing database is extremely easy. All we have to do is (1) import the standard `python` module `sqlite3`, (2) connect to a local (possibly existing) database file, and (3) obtain a `cursor` object that we can use to create tables, manipulate or query data.

In [13]:
import sqlite3

con = sqlite3.connect('data/example.db')
c = con.cursor()

Let us first create a table that can store our word frequency data. We then create a list of tuples consisting of the words and their frequencies and add all of them at once by calling the `executemany` function. As you see below, in the query below we use two placeholder characters `?` that will be automatically filled by the corresponding values in the (ordered) tuple. While we could also manually create and execute string SQL queries that contain the data, I strongly advise against this approach as it makes your code vulnerable against SQL injections. Moreover, you will need to be careful to escape any special characters in your data. The placeholder function of `execute` and `executemany` automatically handle such situations, making sure that your code is both robust and secure!

Once we have executed the queries, we need to issue a `commit` command to actually write the changes to the underlying database.

In [14]:
c.execute('CREATE TABLE word_freq (word text, count real)')

data = [ (word, freq[word]) for word in freq ]
c.executemany('INSERT INTO word_freq VALUES (?,?)', data)
con.commit()

This structured approach to store and manage data in a relational database allows us to use the full power of `SQL` queries, i.e. rather than efficiently searching the data ourselves we can leave this to the database system. The following `SQL` query returns the ten most frequent words:

In [15]:
c.execute('SELECT word, count FROM word_freq ORDER BY count DESC LIMIT 10')
print(c.fetchall())

[('estragon:', 3.0), ('(', 3.0), (')', 3.0), ('he', 2.0), ('to', 2.0), ('vladimir:', 2.0), ('charming', 1.0), ('spot.', 1.0), ('turns,', 1.0), ('advances', 1.0)]


The following query returns the count of the word `end`, again using the placeholder mechanism of the `execute` function.

In [16]:
c.execute('SELECT * FROM word_freq WHERE word=?', ('he',))
print(c.fetchone())
con.close()

('he', 2.0)
