# Files are the main source of data

In most data science related scripts and analysis workflows, data will enter via files. To be more precise: via text files.  
Fortunately, reading from file is really simple in Python. 

And if you have structured data in the form of csv, tsv, xml or Excel, the Python ecosystem prvides a wealth of dedicated data reading functions. If you are going to work with excel-style data (data organized in rows with examples and variables in columns) a lot, it is recommended to have a look at the Pandas library (we'll have a peek at that at the end of this chapter).
In this chapter however we are going to check out the basics of file reading and writing, I/O in short.


Suppose we have some data file named `lengths.csv` which contains the body lengths (in centimeters) of a sample of male and female subjects:

```
1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189
```

This file in *csv format* (for Comma-Separated Values) can be found [here](./data/lengths.csv) (at `./data/lengths.csv`)

To read this data in the most convenient way possible, we can read its contents in one operation:

In [9]:
file = open("data/lengths.csv", "r")
data = file.read()
print(data)


1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189



The statement 
```python
file = open("data/lengths.csv", "r")
```
opens the file in read mode (the second argument is the `mode` argument which defaults to `'r'`, so it could have been omitted). The functions returns a stream, or handle on the file. **Not the actual contents yet**.  

Reading the contents happens with the `file.read()` function call.

## Iterating contents

Usually you want to iterate over file contents line by line without the need to store it all in memry as-is. This is done by applying the for-loop on the file stream:

In [2]:
file = open("data/lengths.csv", "r")
for line in file:
    print(line.strip().split(',')) # of course you want to split the data to separate values

['1', 'm', '180']
['2', 'm', '188']
['3', 'f', '178']
['4', 'f', '182']
['5', 'f', '172']
['6', 'm', '189']


The file stream object returned by the `open()` function supports iteration. Note that line endings are data in the file and are included when reading the lines. To remove any leading and trailing whitespaces we use the `strip()` function.  
To only remove whitespace characters at the end, use `rstrip()` with an optional argument specifying which characters to strip off.

## Closing files
It is good custom to close streams to files that you open. In read mode this is not essential, but in write mode it is. You do this using the `close()` method. The above fragment is better like this:

In [21]:
file = open("data/lengths.csv", "r")
for line in file:
    print(line.strip())
file.close()            # explicitly closing resources is always a good idea

1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189


## The best way: using `with`

Since programmers forgot to close their files all the time, the "with open" syntax was introduced. If you simply always use this form you will never go wrong.

In [24]:
with open("data/lengths.csv", "r") as file:
    for line in file:
        print(line.strip())
# no need to close since that is assured by using with

1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189


## Writing to file

To open a stream for writing you need to set the mode to one of these:  

- "w" (open for writing, truncating the file first) 
- "a" (open for writing, appending to the end of the file if it exists).

When writing to file, using the `with` syntax is the best way.

In [32]:
my_data = ["Better safe\n", "then sorry\n"] #note the newlines already present!

with open("data/saying.txt", "w") as sayings:
    for l in my_data:
        sayings.write(l)

#or, in one operatition
#with open("data/saying.txt", "w") as sayings:
#    sayings.writelines(my_data)

Both operations will result in a file with these contents:

```
Better safe
then sorry
```

And no matter how often the code is run, the same file will be created.
If the mode `"a"` had been used, the saying would be added to the file every time the code was run.