There is a natural dichotomy between data files that are easy for a computer to read, and data files that are easy for a human to read. In this article, we'll look at three common kinds of data files you are likely to encounter, are fairly human-readable, as well as being very easily manipulated in Python.

## CSV files

CSV is a very simple file format. The initials stand for "comma-separated values", and is used to encode a table of data. All spreadsheets and most databases can be exported into CSV format, and spreadsheet software such as Excel can import CSV files, so this is a nice format to be familiar with. Line-breaks (that is, <code>\n</code> or <code>\r</code> characters) separate the rows, and commas separate the values in each row. Well, it's a bit more lenient than that; you can choose any character to separate the values, but comma is the one that stuck around in the name for some reason. So, this is a perfectly reasonable .csv file:

but so is

As we should have come to expect in Python, there is a standard library module for handling .csv files, called (wait for it) <code>csv</code>. The two main functions offered by this module are <code>reader()</code> and <code>writer()</code>, respectively used for reading and writing <code>csv</code> files. Let's look at reading, first.

In [14]:
import csv
with open('topmovies.csv', newline='') as f: # file contains highest grossing films of 2016
    moviescsv = csv.reader(f, delimiter=',')

    for line in moviescsv:
        print(line)
    

Pretty printing has been turned ON
['movie name', ' gross', 'IMDB rating', 'IMDB votes']
['Captain America: Civil War', ' 1153304495', '7.9', '418614']
['Rogue One', ' 1056057273', '7.9', '334132']
['Finding Dory', ' 1028570889', '7.4', '161213']
['Zootopia', ' 1023784195', '8.1', '302479']
['The Jungle Book', ' 966550600', '7.5', '200798']
['The Secret Life Of Pets', ' 875457937', '6.6', '124027']
['Batman v Superman', ' 873260194', '6.7', '479880']
['Deadpool', ' 783112979', '8.0', '637685']
['Suicide Squad', ' 745600054', '6.2', '404862']
['Sing', ' 632443719', '7.2', '66129']


As we can see, each line is turned into a list of strings (by default). In this use of <code>reader</code>, we specified a delimiter character -- the character that separates each entry. We chose a comma, although this is redundant since it is the default anyway.

We can see a problem with this import. Most entries have a space following the comma, which we don't want. This is easily fixed:

In [2]:
with open('topmovies.csv', newline='') as f:
    moviescsv = csv.reader(f, delimiter=',', skipinitialspace=True)

    for line in moviescsv:
        print(line)
    

['movie name', 'gross', 'IMDB rating', 'IMDB votes']
['Captain America: Civil War', '1153304495', '7.9', '418614']
['Rogue One', '1056057273', '7.9', '334132']
['Finding Dory', '1028570889', '7.4', '161213']
['Zootopia', '1023784195', '8.1', '302479']
['The Jungle Book', '966550600', '7.5', '200798']
['The Secret Life Of Pets', '875457937', '6.6', '124027']
['Batman v Superman', '873260194', '6.7', '479880']
['Deadpool', '783112979', '8.0', '637685']
['Suicide Squad', '745600054', '6.2', '404862']
['Sing', '632443719', '7.2', '66129']


The <code>csv.reader</code> object provides a large number of options and fixes to get the import just right. Another consideration with csvs is "what if the entry contains the delimiter?" For example, what if one of the movie titles in the file I imported contained a comma? For this, we use quotes: punctuation inside quotation marks is ignored. Hence, the <code>csv</code> reader also allows us to specify which character is used for quoting.

The <code>csv.reader</code> object has not created a copy of the csv file in Python's memory; it is still reading from the file. Therefore, the file must still be open when we access the lines via the reader. If we want to close the file, we should move the entries into another structure such as a list or dictionary:

In [3]:
with open('topmovies.csv', newline='') as f:
    moviescsv = csv.reader(f, delimiter=',', skipinitialspace=True)

    movielist = [line for line in moviescsv]

print(movielist)
    

[['movie name', 'gross', 'IMDB rating', 'IMDB votes'], ['Captain America: Civil War', '1153304495', '7.9', '418614'], ['Rogue One', '1056057273', '7.9', '334132'], ['Finding Dory', '1028570889', '7.4', '161213'], ['Zootopia', '1023784195', '8.1', '302479'], ['The Jungle Book', '966550600', '7.5', '200798'], ['The Secret Life Of Pets', '875457937', '6.6', '124027'], ['Batman v Superman', '873260194', '6.7', '479880'], ['Deadpool', '783112979', '8.0', '637685'], ['Suicide Squad', '745600054', '6.2', '404862'], ['Sing', '632443719', '7.2', '66129']]


This is, of course, a rather crude demonstration (in fact it breaks our rule of avoiding lists of mixed types); we'd likely want to be more organized with how we structure the data we obtain from a csv, but that will depend on your individual needs.

Writing to a csv is basically just as easy, and follows practically the same format. Let's put the <code>movielist</code> into a new csv, with different delimiters, and quotation marks around non-numeric data (this will make it easier to import in the future, since with this style, the <code>reader</code> can distinguish between numbers and strings when it reads the file. Firstly, we should actually convert the numeric data into numbers:

In [16]:
for row in movielist:
    for i, entry in enumerate(row):
        try:
            row[i] = float(entry)
        except ValueError:
            pass

print(movielist)

[['movie name', 'gross', 'IMDB rating', 'IMDB votes'], ['Captain America: Civil War', 1153304495.0, 7.9, 418614.0], ['Rogue One', 1056057273.0, 7.9, 334132.0], ['Finding Dory', 1028570889.0, 7.4, 161213.0], ['Zootopia', 1023784195.0, 8.1, 302479.0], ['The Jungle Book', 966550600.0, 7.5, 200798.0], ['The Secret Life Of Pets', 875457937.0, 6.6, 124027.0], ['Batman v Superman', 873260194.0, 6.7, 479880.0], ['Deadpool', 783112979.0, 8.0, 637685.0], ['Suicide Squad', 745600054.0, 6.2, 404862.0], ['Sing', 632443719.0, 7.2, 66129.0]]


Exactly how this code works will be explained in a future article, but try and guess what it does anyway! (Clue: <code>float("movie name")</code> will usually produce an error. What do <code>try</code> and <code>except</code> do?) 

Now that our table is in a nicer format, let's write it to a csv.

In [8]:
with open("movies-new.csv", "w", newline='') as f:
    # creates a "writer" that is ready to write whatever we give to it
    moviecsv = csv.writer(f, delimiter='|', quotechar="`", quoting=csv.QUOTE_NONNUMERIC)
    
    # give the list to the writer
    moviecsv.writerows(movielist)

The resulting file on my computer looks like this:

Each argument here should be fairly self-explanatory. Rows can be written one at a time with <code>writer.writerow(row)</code>, instead of providing a full list all at once.

## JSON

JSON stands for "JavaScript Object Notation". JavaScript is a programming languaged used to make interactive and dynamic webpages. This means there needs to be a way for a JavaScript script in a webpage to send and receive bundles of data without reloading the whole page. The JSON file format is one such solution. However, JSON files can be used to store many kinds of data, and are commonly encountered when grabbing information from the web. In fact, Jupyter notebook files such as this very article are really just JSON files with a different suffix; try opening one in your text editor to see for yourself!

Thankfully for us, JSON files are easy for any Python programmer to understand. Here is an example of a JSON file containing an Amazon review for a piece of audio equipment (and yes, Amazon reviews really are sent over to your computer as JSON - in fact most of the things you see on any given Amazon page are):

Look familiar? It's basically just a dictionary. This is a good reason for Python programmers to know about JSON files, as it means we can save the data in our dictionaries on an external file, so long as the values are of a suitable format (strings, numbers, etc).

Of course, it's not precisely a dictionary. There are some translations that need to be made from the JSON format (and terminology). The dictionary-like structure of a JSON file is called a JSON object. Since objects can have objects as their values, this gives JSON objects a tree-like structure. Booleans in JSON files use all lower case (so Python's <code>True</code> is <code>true</code> in JSON). Lists and tuples are rolled into one object called an array, and all floats and ints are rolled into one object called a number. Let's see how we can import JSONs, using the <code>json</code> module, which does all the necessary translations for us (phew!)

The main functions of the module are <code>dump</code> and <code>dumps</code> (for creating JSON objects), and <code>load</code> and <code>loads</code> (for reading JSON objects). The <code>s</code> appearing in two of these functions simply stands for string. We'll see the difference in a second.

In [10]:
import json
with open("Musical_Instruments_5.json", "r") as f:
    reviews = [json.loads(line) for line in f]

for review in reviews[:5]:
    print(review)

{'reviewTime': '02 28, 2014', 'reviewerID': 'A2IBPI20UZIR0U', 'reviewerName': 'cassandra tu "Yeah, well, that\'s just like, u...', 'overall': 5.0, 'summary': 'good', 'asin': '1384719342', 'helpful': [0, 0], 'reviewText': "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,", 'unixReviewTime': 1393545600}
{'reviewTime': '03 16, 2013', 'reviewerID': 'A14VAT5EAX3D9S', 'reviewerName': 'Jake', 'overall': 5.0, 'summary': 'Jake', 'asin': '1384719342', 'helpful': [13, 14], 'reviewText': "The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake

Let's break it down. As usual, we open the file using <code>open</code> to give us a file object. This file contains a different JSON object on each line. Because we can read through a file line-by-line, and those lines are read <i>as text</i>, we use <code>loads</code> because each line is a string.

One we have loaded it, each JSON object is now precisely a dictionary:

In [17]:
print(type(reviews[0]))
print("This person thought the product they purchased was: {}".format(reviews[0]["summary"]))

<class 'dict'>
This person thought the product they purchased was: good


This strategy is very good when we have a file containing lots of JSON objects. We read each object (line) as a string and then pass it to <code>loads</code> to convert it. If we have a file containing only a single JSON object, we can pass the whole file to <code>load</code> instead:

In [12]:
with open("book_review.json") as f:
    bookreview = json.load(f)

print(bookreview["reviewText"])

Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!


Writing JSON objects to files is much the same scenario.