# Files and Directories

Data does not generally spend its entire lifespan loaded into memory as Python objects. Generally the lifecycle of data in a scientific Python project involves one or more cycles that look like this:

1. load some data from a file into Python
2. process the data in Python
3. save the processed data to another file

In this notebook, we are going to cover the basics of steps 1 and 3.

### Be cautious (and a little afraid)

Prior to this section any problems caused by bugs in your code were limited to that code; once you have detected and fixed the bugs, rerunning your code eliminates any further issues. As we are now dealing with external files, programming errors may alter or delete those files. To reduce the chances of this happening:

1. Backup your data
2. Reduce cross-contaimination by putting individual projects in their own directories
3. Give your files stereotyped names and locations
4. Try writing one file first. Then do the others.

### Paths

Directories (folders) and files on your hard drives(s) are organized in a tree structure. Nodes are directories and files, while edges indicate containment. As you navigate from directory to directory, you trace out a path through this tree. Paths can be represented as strings like this:

    "/drive_name/folder_a/file.format"
    
on linux and like this:

    "drive_name:\\folder_a\file.format"
    
on windows. You can use representations like these to tell Python which locations ought to be read from and written to.

Dealing with paths will be substantially easier if you first import the <code>os</code> module:

In [2]:
import os

Now that you have done that, we can use one of the module's more useful features - automatic system-appropriate path construction:

<code>
os.path.join('name_1', 'name_2', 'name_3')
</code>

where the names might be the names of files or directories. Give it a shot:

You will notice that this path does not quite look like the ones from above - look at the left hand side. This is because it is a relative as opposed to absolute path. This means that it proceeds from an arbitrary directory rather than the root of your file system (or a drive). The directory that it is relative to is your Current Working Directory, which you can find by:

<code>
os.getcwd()
</code>

In [3]:
print os.getcwd()

/local1/code/python_bootcamp/pythonbootcamp2016


You can always convert a relative path to an absolute path by using:

<code>
os.path.abspath(rel_path)
</code>

To find out if a path actually exists on your drive, use:


<code>
os.path.exists(path)
</code>

To make a directory (or directories), use

<code>
os.makedirs(directory_path)
</code>

In [4]:
# Try making a directory called eg 'io_stuff' (unless that already exists on your machine). Store the path in a variable.
# You can use this directory to do he remaining excercises in this section.

io_dir_path = #

### A basic text file

Now we are going to make a basic text file.

In [None]:
# First we must specify a path. If you define a variable called text_file_path, the below code should run just fine.
# make sure that the resulting path is in io_dir_path and that it ends in 'some_file_name.txt'.

text_file_path = #

In [None]:
with open(text_file_path, 'wb') as text_file: 
    # this syntax makes a file object - text_file - in memory. This file's path is text_file_path.
    # Because we used 'wb' as the second argument to open(), text_file is open for writing.
    # The with statement is followed by an indented code block. When this block ends, text_file will be safely closed
    
    text_file.write('Hello disk!')
    text_file.write('\n') # newline character
    text_file.write('line 2')

In [None]:
# Now we can read our text back out

with open(text_file_path, 'rb') as text_file: # rb for read
    print text_file.read() # this reads the whole file into a string
    
# we can also read the lines from the file into a list
with open(text_file_path, 'rb') as text_file:
    
    lines = text_file.readlines()
    print lines

### IOError

Try opening a nonexistant (check this!) text file for reading:

### A slightly more advanced text file

csv files are similar to the file types used by excel and similar programs. CSV is an initialism for 'comma-seperated values'. Basically, we are talking about a text file that has a tabular shema - a delimiter character divides columns while a newline character divides rows.

The easiest way to interact with CSVs is probably pandas.

In [None]:
import pandas as pd

In [None]:
# make a dataframe
data = pd.DataFrame([{'color': 'red', 'float': 8.7}, {'color': 'green', 'float': 4.29}])
print data

In [None]:
# write it to a file
csv_path = os.path.join(csv_file_path) # you will need to specify the path
data.to_csv(csv_path)

In [None]:
# read it back out
read_data = pd.DataFrame.from_csv(csv_file_path)
print read_data

Check out the resulting csv in excel/libreoffice/the mac one. What happens if you add another row and then read it back to python?

### Excercize: Really just text

CSV files are really just text files with a particular data schema. They consist of values, which are divided into columns by a delimiter (such as a comma) and into rows by a different delimiter (such as a newline: \n). Such a file might look like:

a,b\nc,d

which would correspond to a matrix like:
<code>
a b
c d
</code>

Make a function that writes 2d arrays to csv files as text. Make another function that reads them back. Use only the tools appropriate for text files.


### Excercize: Collation

1. Programatically generate 50 numerical arrays. Try using numpy with <code>np.random.randn</code>
2. Create a directory. Save each array in a seperate 1-row csv file in that directory.
3. Read the data from each such file. Create a new csv file in a seperate directory that contains all of the data as rows.