# Operating system and files

Working with files is important if we are dealing with more data than we want to type into to our code/have a user input. It is also important when the output of the program should be saved and easily shared.

### Before we start... `import`
To interact with the operating system and read files, we need to import the module `os`:
```python
import os
```

In [None]:
import os
print(os.name)

This means we are running a unix-like OS.

In [None]:
import platform
print(platform.system())

## Paths
Paths identify locations (including of files) on a filesystem.

Examples:
- Windows path: `C:\User\Documents\file.ext`
- Linux path: `/home/user/file`
- Mac path: `/Users/User/Documents/file.ext`


In Windows, most filenames have extensions and extensions are how the OS determines the file type.
In Linux, file type and extension are unrelated at a fundamental level but extensions are of help to the user and applications.

White it is technically possible to manipulate paths as strings, **don't do it**. It's messy, ugly and does not play well across different operating systems!

### Python paths, the old way


In [None]:
# Let's find out our current directory
base_dir = os.getcwd()
print(base_dir)

Now let's define a new directory...

In [None]:
new_dir = os.path.join(base_dir, 'work')
print(new_dir)

We have our path, let's create it!

In [None]:
# if we try to create a directory with an existing name, we get an error
if not os.path.exists(new_dir):
    os.makedirs(new_dir)

print(new_dir)

### Python paths using `pathlib`
We have noticed that, after all, we are still manipulating a path as a string. Can we do better?

In [None]:
from pathlib import Path

In [None]:
base_dir = Path(base_dir)
print(base_dir)
type(base_dir)

We can make a new directory with the `mkdir()` method.

In [None]:
new_dir = base_dir / "work"

Path.mkdir(new_dir, exist_ok=True) 
print(new_dir)

- We can manipulate paths as objects.
- Better functionality in the form of class and instance methods of `Path`. We will see some examples in a moment.
- Useful `/` operator! 

In [None]:
print(Path.home())

In [None]:
print(Path.cwd())

In [None]:
print(new_dir)

In [None]:
print(new_dir.exists())

In [None]:
second_dir = base_dir / "play"
print(second_dir.exists())

`Path.mkdir()` allows you to make a new directory. 

In [None]:
Path.mkdir(second_dir, exist_ok=True) 

In [None]:
file_path = second_dir / 'file1.dat'
file_path.touch()

In [None]:
print(file_path.is_file())
print(file_path.is_dir())
print(second_dir.is_dir())

In [None]:
file_path.name

In [None]:
file_path.stem

In [None]:
file_path.suffix

`glob()` allows you to search a directory for files that meet a condition.

In [None]:
print(new_dir)

In [None]:
for path in second_dir.glob("*.dat"):
    print(path)

If you want to move or delete a file or directory, you can use `replace()` and `unlink()` respectively. But be careful! 

## Preliminaries: what is a file?
A file generally has a
- Header
- Data
- End of file (EOF)

Types of files
- Text files
- Buffered binary files
- Raw binary files (not typically used)

## First file: a text file

In [None]:
# first, some data
names = ["NGC 5128", "TXS 0506+056", "NGC 1068", "GB6 J1040+0617", "TXS 2226-184"]
distances = [3.7, 1.75e3, 14.4, 1.51e4, 107.1]  # Mpc
luminosities = [1e40, 3e46, 4.9e38, 6.2e45, 5.5e41] # erg/s

dataset = { 'names' : names, 'distances' : distances, 'luminosities' : luminosities }

### Opening, writing and reading files

In [None]:
filename = 'galaxy_names.dat'
filepath = new_dir / filename

with open(filepath, 'w') as f:
    for string in names:
        f.write(string + '\n')

Note that `write` writes a **string**. You can also use `writelines`, which writes a **sequence**.

You should close files after you are done with them. This is done automatically with the `with` syntax.

In [None]:
with open(filepath, 'r') as f:
    data = f.read()

print(data)

If the file is really big, this is not ideal because all the file content gets loaded in a variable (on the RAM). Better to read line by line:

In [None]:
with open(filepath, 'r') as f:
    for line in f:
        print(line)

# you can also used f.readline() to read one line at a time

By default, the file is opened in text mode for reading (option/mode `r`), meaning that:
- only characters/string can be written;
- everything is read as a character.
- the file is only read, contents cannot be overwritten

Other options:
- `w` - write
- `rb` - read in binary mode
- `rw` - write in binary mode

### Other reading and writing options

It is possible read and write files simultaneously!

In [None]:
filepath_r = new_dir / 'galaxy_names_reversed.dat'
with open(filepath, 'r') as reader, open(filepath_r, 'w') as writer:
    galaxy_names = reader.readlines()
    writer.writelines(reversed(galaxy_names))

### Appending to a file

In [None]:
with open(filepath, 'a') as a_writer:
    a_writer.write('M87 \n')

### Line endings
There are unfortunately different ways of signaling the end of a line.
- `\r\n` used by Windows
- `\n` used by Unix and Mac

### Character encodings
**Encoding** refers to translating byte data to characters. Bytes are integers with value between 0 and 255. Data is stored in terms of bytes, then read in sequence when file is accessed, according to the encoding format. Both ASCII and Unicode are common encoding formats. ASCII is a (small) subset of Unicode, and can only store 128 characters (compared to >1 million). What can happen if you try to parse a Unicode-formatted file as ASCII?

## Serialization and deserialization
**Serialization** means to take an object and transform it into a stream of bytes, for storage or transmission. **Deserialization** means to re-encode the object based on the stream of bytes.

## Binary files
- Writing binary content by hand is complicated and messy.
- In `python` we can use `pickle` to dump an arbitrary object into a file.

When is `pickle` useful?

`pickle` has four methods: 
- `dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)`
- `dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)`
- `load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)`
- `loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)`

In [None]:
import pickle

filename = 'galaxy_binary.dat'
filepath = new_dir / filename

with open(filepath, 'wb') as f:
    pickle.dump(dataset, f)

In [None]:
with open(filepath, 'rb') as f:
    obj = pickle.load(f)

print(obj)

Works with basically any object (even your own classes), but it also very opaque:
- `python` specific, no cross-language standard;
- basically you need to know in advance what's inside the file;
- writing and reading iteratively is possible but complicated;

The behavior is strongly dependent on protocol version (new `pickle` versions added with new `python` versions)

## Using JSON
- JSON (JavaScript Object Notation) is a standard encoding format that allows to write multiple data types in the form of a text file.
- You can think of a JSON file as a big nested dictionary.
- Most `python` native data types can be translated to JSON objects.
    - `dict` -> `object`
    - `list, tuple` -> `array`
    - `str` -> `string`
    - `int, long, float` -> `number`
    - `True` -> `true`
    - `False` -> `false`
    - `None` -> `null`
 
Serialization methods are `dump()`, which writes the data to a file in JSON format, and `dumps()`, which returns a string of the data in JSON format.

Deserialization methods are `load()`, which loads a file in JSON format, and `loads()`, which loads a string of the data in JSON format.

In [None]:
import json

filename = 'galaxy_json.dat'
filepath = new_dir / filename

with open(filepath, 'w') as f:
    json_data = json.dumps(dataset) # dumps() returns a string
    json.dump(dataset, f) # dump() writes to file!

In [None]:
print(json_data)
type(json_data)

- It seems like python syntax, but this is JSON.
- The file is human-readable!

In [None]:
with open(filepath, 'r') as f:
    obj = json.load(f)

print(obj)
type(obj) # original type is restored!

### Additional options

In [None]:
with open(filepath, 'r') as f:
    json_data1 = json.dumps(dataset)
    json_data2 = json.dumps(dataset, indent=4)

In [None]:
print(json_data1)
print(json_data2)

### Lost in translation
One has to be a bit careful due to the inexact mapping between python types and JSON objects.

In [None]:
simple_tuple = (1,2,3)
encoded_tuple = json.dumps(simple_tuple)
decoded_tuple = json.loads(encoded_tuple)

In [None]:
simple_tuple == decoded_tuple

In [None]:
type(simple_tuple)

In [None]:
type(decoded_tuple)

In [None]:
simple_tuple == tuple(decoded_tuple)

In [None]:
new_tuple = tuple(decoded_tuple)

In [None]:
simple_tuple == new_tuple

### Custom types

You can write objects from your own classes to JSON, however, they need to be broken down into JSON objects.

## CSV
CSV is acronym for "comma separated values", it is the format of choice for tabular data. A CSV files consists of lines (entries) where different values (fields) are separated, usually by commas. Content is text in ASCII or Unicode format.

Fields can also be separated by tab (\t), colon (:) and semi-colon (;) characters. 

In [None]:
import csv

### Reading CSV files 
This can be done with the `reader` object.

In [None]:
filename = 'galaxy_csvformat.dat'
filepath = new_dir / filename

with open(filepath, 'r') as f:
    csv_reader = csv.reader(f, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{row[0]} is located at distance {row[1]} Mpc, and has luminosity {row[2]} erg/s.')
            line_count += 1
    print(f'Processed {line_count} lines.')

### Writing CSV files
This can be done with the `writer` object and `writerow` method.

In [None]:
filename = 'galaxy_file.csv'
filepath = new_dir / filename

with open(filepath, mode='w') as f:
    galaxy_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    
    for name, dist, lum in zip(names, distances, luminosities):
        galaxy_writer.writerow([name, dist, lum])

### Reading/writing csv files to a dictionary

In [None]:
filename = 'galaxy_dict_file.csv'
filepath = new_dir / filename

with open(filepath, 'w') as f:
    writer = csv.DictWriter(f, fieldnames=["name","distance", "luminosity"])
    for name, dist, lum in zip(names, distances, luminosities):
        writer.writerow({"name": name, "distance": dist, "luminosity": lum})

In [None]:
with open(filepath, 'r') as f:
    reader = csv.DictReader(f, fieldnames=["name","distance", "luminosity"])
    for row in reader:
        print(row)

## Final Note

We will cover working with ***large*** datasets in the last lecture of the course, on working with `pandas`.