# Data formats 1 - introduction

## [Download exercises zip](../_static/generated/formats.zip)

[Browse files online](https://github.com/DavidLeoni/softpython-en/tree/master/formats)

## Introduction

In these tutorials we will see how to load and write tabular data such as CSV, and we will mention tree-like data such as JSON files. We will also spend a couple of words about  opendata catalogs and licenses (creative commons).


In these tutorials we will review main data formats:

Textual formats

* Line files
* CSV (tabular data)
* JSON (tree-like data, just mention)

Binary formats (just mention)

* fogli Excel

We will also mention open data catalogs and licenses (Creative Commons)

### What to do

1. unzip exercises in a folder, you should get something like this: 

```
formats
    formats1-lines.ipynb     
    formats1-lines-sol.ipynb
    formats2-csv.ipynb     
    formats2-csv-sol.ipynb    
    formats3-json.ipynb     
    formats3-json-sol.ipynb    
    formats4-chal.ipynb
    jupman.py
```

<div class="alert alert-warning">

**WARNING**: to correctly visualize the notebook, it MUST be in an unzipped folder !
</div>

2. open Jupyter Notebook from that folder. Two things should open, first a console and then browser. The browser should show a file list: navigate the list and open the notebook `formats/formats1-lines.ipynb`
3. Go on reading that notebook, and follow instuctions inside.


Shortcut keys:

- to execute Python code inside a Jupyter cell, press `Control + Enter`
- to execute Python code inside a Jupyter cell AND select next cell, press `Shift + Enter`
- to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press `Alt + Enter`
- If the notebooks look stuck, try to select `Kernel -> Restart`

## Line files

Line files are typically text files which contain information grouped by lines. An example using historical characters might be like the following:

```
Leonardo
da Vinci
Sandro
Botticelli
Niccolò 
Macchiavelli
```
We can immediately see a regularity: first two lines contain data of Leonardo da Vinci, second one the name and then the surname. Successive lines instead have data of Sandro Botticelli, with again first the name and then the surname and so on.

We might want to do a program that reads the lines and prints on the terminal names and surnames like the following:

```
Leonardo da Vinci 
Sandro Botticelli
Niccolò Macchiavelli
```

To start having an approximation of the final result, we can open the file, read only the first line and print it:


In [1]:
with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    print(line)


Leonardo



What happened? Let's examing first rows:



### open command

The command

```python
open('people-simple.txt', encoding='utf-8')
```

allows us to open the text file by telling PYthon the file path `'people-simple.txt'` and the encoding in which it was written (`encoding='utf-8'`). 

### The encoding

The encoding dependes on the operating system and on the editor used to write the file. When we open a file, Python is not capable to divine the encoding, and if we do not specify anything Python might open the file assuming an encoding different from the original - in other words, if we omit the encoding (or we put a wrong one) we might end up seeing weird characters (like little squares instead of accented letters).

In general, when you open a file, try first to specify the encoding `utf-8` which is the most common one. If it doesn't work try others, for example for files written in south Europe with Windows you might check `encoding='latin-1'`. If you open a file written elsewhere, you might need other encodings. For more in-depth information, you can read [Dive into Python - Chapter 4 - Strings](https://diveintopython3.problemsolving.io/strings.html), and [Dive into Python - Chapter 11 - File](https://diveintopython3.problemsolving.io/files.html), **both of which are extremely recommended readings**.

### with block

The `with` defines a block with instructions inside:

```python
with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    print(line)
```

We used the `with` to tell PYthon that in any case, even if errors occur, we want that after having used the file, that is after having executed the instructions inside the internal block (the `line=f.readline()` and `print(line)`) Python must automatically close the file. Properly closing a file avoids to waste memory resources and creating hard to find paranormal errors. If you want to avoid hunting for never closed zombie files, always remember to open all files in `with` blocks! Furthermore, at the end of the row in the part `as f:` we assigned the file to a variable hereby called `f`, but we could have used any other name we liked.


<div class="alert alert-warning">

**WARNING**: To indent the code, ALWAYS use sequences of four white spaces. Sequences of 2 spaces. Sequences of only 2 spaces even if allowed are not recommended.
</div>

<div class="alert alert-warning">

**WARNING**: Depending on the editor you use, by pressing TAB you might get a sequence o f white spaces like it happens in Jupyter (4 spaces which is the recommended length), or a special tabulation character (to avoid)! As much as this annoying this distinction might appear, remember it because it might generate very hard to find errors.

</div>

<div class="alert alert-warning">

**WARNING**: In the commands to create blocks such as `with`, always remember to put the character of colon `:` at the end of the line !
</div>



The command

```
    line=f.readline()
```
puts in the variable `line` the entire line, like a string. Warning: the string will contain at the end the special character of line return !

You might wonder where that `readline` comes from. Like everything in Python, our variable `f` which represents the file we just opened is an object, and like any object, depending on its type, it has particular methods we can use on it. In this case the method is `readline`. 

The following command prints the string content:

```python
    print(line) 
```


**✪ 1.1 EXERCISE**: Try to rewrite here the block we've just seen, and execute the cell by pressing Control-Enter. Rewrite the code with the fingers, not with copy-paste ! Pay attention to correct indentation with spaces in the block.

In [2]:
# write here

with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    print(line)


Leonardo



In [2]:
# write here



Leonardo



**✪ 1.2 EXERCISE**: you might wondering what exactly is that `f`, and what exatly the method `readlines` should be doing. When you find yourself in these situations, you might help yourself with functions `type` and `help`. This time, directly copy paste the same code here, but insert inside `with` block the commands:

* `print(type(f))`
* `help(f)`
* `help(f.readline)`      # Attention: remember the f. before the readline !!

Every time you add something, try to execute with Control+Enter and see what happens

In [3]:
# write here the code (copy and paste)
with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    print(line)
    print(type(f))    
    help(f.readline)
    help(f)

Leonardo

<class '_io.TextIOWrapper'>
Help on built-in function readline:

readline(size=-1, /) method of _io.TextIOWrapper instance
    Read until newline or EOF.
    
    Returns an empty string if EOF is hit immediately.

Help on TextIOWrapper object:

class TextIOWrapper(_TextIOBase)
 |  TextIOWrapper(buffer, encoding=None, errors=None, newline=None, line_buffering=False, write_through=False)
 |  
 |  Character and line based layer over a BufferedIOBase object, buffer.
 |  
 |  encoding gives the name of the encoding that the stream will be
 |  decoded or encoded with. It defaults to locale.getpreferredencoding(False).
 |  
 |  errors determines the strictness of encoding and decoding (see
 |  help(codecs.Codec) or the documentation for codecs.register) and
 |  defaults to "strict".
 |  
 |  newline controls how line endings are handled. It can be None, '',
 |  '\n', '\r', and '\r\n'.  It works as follows:
 |  
 |  * On input, if newline is None, universal newlines mode is
 |    en

In [3]:
# write here the code (copy and paste)



Leonardo

<class '_io.TextIOWrapper'>
Help on built-in function readline:

readline(size=-1, /) method of _io.TextIOWrapper instance
    Read until newline or EOF.
    
    Returns an empty string if EOF is hit immediately.

Help on TextIOWrapper object:

class TextIOWrapper(_TextIOBase)
 |  TextIOWrapper(buffer, encoding=None, errors=None, newline=None, line_buffering=False, write_through=False)
 |  
 |  Character and line based layer over a BufferedIOBase object, buffer.
 |  
 |  encoding gives the name of the encoding that the stream will be
 |  decoded or encoded with. It defaults to locale.getpreferredencoding(False).
 |  
 |  errors determines the strictness of encoding and decoding (see
 |  help(codecs.Codec) or the documentation for codecs.register) and
 |  defaults to "strict".
 |  
 |  newline controls how line endings are handled. It can be None, '',
 |  '\n', '\r', and '\r\n'.  It works as follows:
 |  
 |  * On input, if newline is None, universal newlines mode is
 |    en


First we put the content of the first line into the variable `line`, now we might put it in a variable witha  more meaningful name, like `name`. Also, we can directly read the next row into the variable `surname` and then print the concatenation of both:


In [4]:
with open('people-simple.txt', encoding='utf-8') as f:
    name=f.readline()
    surname=f.readline()
    print(name + ' ' + surname)


Leonardo
 da Vinci



**PROBLEM !** The printing puts a weird carriage return. Why is that? If you remember, first we said that `readline` reads the line content in a string adding to the end also the special newline character. To eliminate it, you can use the command `rstrip()`: 


In [5]:
with open('people-simple.txt', encoding='utf-8') as f:
    name=f.readline().rstrip()
    surname=f.readline().rstrip()
    print(name + ' ' + surname)


Leonardo da Vinci


**✪ 1.3 EXERCISE**: Again, rewrite the block above in the cell below, ed execute the cell with Control+Enter. Question: what happens if you use `strip()` instead of `rstrip()`? What about `lstrip()`? Can you deduce the meaning of `r` and `l`? If you can't manage it, try to use python command `help` by calling `help(string.rstrip)`

In [6]:
# write here

with open('people-simple.txt', encoding='utf-8') as f:
    name=f.readline().rstrip()
    surname=f.readline().rstrip()
    print(name + ' ' + surname)

Leonardo da Vinci


In [6]:
# write here



Leonardo da Vinci


Very good, we have the first line ! Now we can read all the lines in sequence. To this end, we can use a `while` cycle:

In [7]:
with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    while line != "":                
        name = line.rstrip()
        surname=f.readline().rstrip()
        print(name + ' ' + surname)
        line=f.readline()

Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli


<div class="alert alert-info">

**NOTE**: In Python there are [shorter ways](https://thispointer.com/5-different-ways-to-read-a-file-line-by-line-in-python/) 
to read a text file line by line, we used this approach  to make explicit all passages.

</div>

What did we do? First, we added a `while` cycle in a new block

<div class="alert alert-warning">

**WARNING**: In new block, since it is already within the external `with`, the instructions are indented of 8 spaces and not 4! If you use the wrong spaces, bad things happen !

</div>

We first read a line, and two cases are possible: 

a. we are the end of the file (or file is empty) : in this case  `readline()` call returns an empty string

b. we are not at the end of the file: the first line is put as a string inside the variable `line`. Since Python internally uses a pointer to keep track at which position we are when reading inside the file, after the read such pointer is moved at the beginning of the next line. This way the next call to `readline()` will read a line from the new position.

In `while` block we tell Python to continue the cycle as long as `line` is _not_ empty. If this is the case, inside the `while` block we parse the name from the line and put it in variable `name` (removing extra newline character with `rstrip()` as we did before), then we proceed reading the next line and parse the result inside the `surname` variable. Finally, we read again a line into the `line` variable so it will be ready for the next round of name extraction. If line is empty the cycle will terminate:


```python
while line != "":                   # enter cycle if line contains characters
    name = line.rstrip()            # parses the name
    surname=f.readline().rstrip()   # reads next line and parses surname
    print(name + ' ' + surname)     
    line=f.readline()               # read next line
```

**✪ 1.4 EXERCISE**: As before, rewrite in the cell below the code with the `while`, paying attention to the indentation (for the external `with` line use copy-and-paste):

In [8]:
# write here the code of internal while

with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    while line != "":                
        name = line.rstrip()
        surname=f.readline().rstrip()
        print(name + ' ' + surname)
        line=f.readline()

Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli


In [8]:
# write here the code of internal while



Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli


## people-complex line file

Look at the file `people-complex.txt`:

```
name: Leonardo
surname: da Vinci
birthdate: 1452-04-15
name: Sandro
surname: Botticelli
birthdate: 1445-03-01
name: Niccolò 
surname: Macchiavelli
birthdate: 1469-05-03
```
Supposing to read the file to print this output, how would you do it? 

```
Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03
```


**Hint 1**: to obtain the string `'abcde'`, the substring  `'cde'`, which starts at index 2, you can ue the operator square brackets, using the index followed by colon `:`


In [9]:
x = 'abcde'
x[2:]

'cde'

In [10]:
x[3:]

'de'

**Hint 2**: To know the length of a string, use the function `len`:

In [11]:
len('abcde')

5

**✪ 1.5 EXERCISE**: Write here the solution of the exercise 'People complex':

In [12]:
# write here  
        
with open('people-complex.txt', encoding='utf-8') as f:
    line=f.readline()
    while line != "":                
        name = line.rstrip()[len("name: "):]
        surname= f.readline().rstrip()[len("surname: "):]
        born = f.readline().rstrip()[len("birthdate: "):]
        print(name + ' ' + surname + ', ' + born)
        line=f.readline()        

Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03


In [12]:
# write here  



Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03


## Exercise - line file immersione-in-python-toc

✪✪✪ This exercise is more challenging, if you are a beginner you might skip it and go on to CSVs

The book Dive into Python is nice and for the italian version there is a PDF, which has a problem though: if you try to print it, you will discover that the index is missing. Without despairing, we found a program to extract titles in a file as follows, but you will discover it is not exactly nice to see. Since we are Python ninjas, we decided to transform raw titles in a [real table of contents](http://softpython.readthedocs.io/it/latest/_static/toc-immersione-in-python-3.txt). Sure enough there are smarter ways to do this, like loading the pdf in Python with an appropriate module for pdfs, still this makes for an interesting exercise.

You are given the file `immersione-in-python-toc.txt`:

```
BookmarkBegin
BookmarkTitle: Il vostro primo programma Python
BookmarkLevel: 1
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Immersione!
BookmarkLevel: 2
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Dichiarare funzioni
BookmarkLevel: 2
BookmarkPageNumber: 41
BookmarkBeginint
BookmarkTitle: Argomenti opzionali e con nome
BookmarkLevel: 3
BookmarkPageNumber: 42
BookmarkBegin
BookmarkTitle: Scrivere codice leggibile
BookmarkLevel: 2
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Stringhe di documentazione
BookmarkLevel: 3
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Il percorso di ricerca di import
BookmarkLevel: 2
BookmarkPageNumber: 46
BookmarkBegin
BookmarkTitle: Ogni cosa &#232; un oggetto
BookmarkLevel: 2
BookmarkPageNumber: 47
```

Write a python program to print the following output:

```
   Il vostro primo programma Python  38
      Immersione!  38
      Dichiarare funzioni  41
         Argomenti opzionali e con nome  42
      Scrivere codice leggibile  44
         Stringhe di documentazione  44
      Il percorso di ricerca di import  46
      Ogni cosa è un oggetto  47
```

For this exercise, you will need to insert in the output artificial spaces, in a qunatity determined by the rows `BookmarkLevel`


**QUESTION**: what's that weird value `&#232;` at the end of the original file? Should we report it in the output?

**HINT 1**: To convert a string into an integer number, use the function `int`:


In [13]:
x = '5'

In [14]:
x

'5'

In [15]:
int(x)

5


<div class="alert alert-warning">

**Warning**: `int(x)` returns a value, and never modifies the argument `x`!  
</div>


**HINT 2**: To substitute a substring in a string, you can use the method `.replace`:

In [16]:
x = 'abcde'
x.replace('cd', 'HELLO' )

'abHELLOe'

**HINT 3**: while there is only one sequence to substitute, `replace` is fine, but if we had a milion of horrible sequences like `&gt;`, `&#62;`, `&x3e;`, what should we do? As good data cleaners, we recognize these are [HTML escape sequences](https://corsidia.com/materia/web-design/caratterispecialihtml), so we could use  methods specific to sequences like [html.escape](https://docs.python.org/3/library/html.html#html.unescape). TRy it instead of `replace` and check if it works!


NOTE: Before using `html.unescape`, import the module `html` with the command: 

```python
import html
```

**HINT 4**: To write _n_ copies of a character, use `*` like this:

In [17]:
"b" * 3

'bbb'

In [18]:
"b" * 7

'bbbbbbb'

**IMPLEMENTATION**: Write here the solution for the line file `immersione-in-python-toc.txt`, and try execute it by pressing Control + Enter:


In [19]:
# write here 

import html

with open("immersione-in-python-toc.txt", encoding='utf-8') as f:

    line=f.readline()  
    while line != "":
        line = f.readline().strip()
        title = html.unescape(line[len("BookmarkTitle: "):])
        line=f.readline().strip()
        level = int(line[len("BookmarkLevel: "):])
        line=f.readline().strip()
        page = line[len("BookmarkPageNumber: "):]
        print(("   " * level) + title + "  " + page)
        line=f.readline()

   Il vostro primo programma Python  38
      Immersione!  38
      Dichiarare funzioni  41
         Argomenti opzionali e con nome  42
      Scrivere codice leggibile  44
         Stringhe di documentazione  44
      Il percorso di ricerca di import  46
      Ogni cosa è un oggetto  47


In [19]:
# write here 



   Il vostro primo programma Python  38
      Immersione!  38
      Dichiarare funzioni  41
         Argomenti opzionali e con nome  42
      Scrivere codice leggibile  44
         Stringhe di documentazione  44
      Il percorso di ricerca di import  46
      Ogni cosa è un oggetto  47


## Continue

Go on with [CSV tabular files](https://en.softpython.org/formats/formats2-csv-sol.html)