# Input and Output

In [85]:
import numpy as np

There are many ways to write files in python. The default way, while general, is not the most suitable for most scientific applications. In science, we tend to have very structured data. We will take advantage of that to simplify our IO (input/output)

## The standard way

Here, first we will open a _file handler_, then write some stuff inside and finally close it. We will not use it again for ther rest of this course.

In [86]:
#open a file called "test.dat", where we will write some data
f = open("03_standard_way.dat", "w") #here "w" means we are going to *write* the file. You can replace it by "r" to read, and "r+" for appending

In [87]:
a = np.linspace(0, 1) #Generate some data 
for x in a:
    f.write(str(x) + "\n") # We need to convert the number into a string. We will also add an "end-of-line" character

In [88]:
f.close() # The file needs to be closed to guarantee all data has been written

Now, lets read this back into our program. We will first read the whole file into an array, and then convert that data into floats.

In [89]:
f_in = open("03_standard_way.dat", "r")
a = f_in.readlines()
print(a)

['0.0\n', '0.02040816326530612\n', '0.04081632653061224\n', '0.061224489795918366\n', '0.08163265306122448\n', '0.1020408163265306\n', '0.12244897959183673\n', '0.14285714285714285\n', '0.16326530612244897\n', '0.18367346938775508\n', '0.2040816326530612\n', '0.22448979591836732\n', '0.24489795918367346\n', '0.26530612244897955\n', '0.2857142857142857\n', '0.3061224489795918\n', '0.32653061224489793\n', '0.3469387755102041\n', '0.36734693877551017\n', '0.3877551020408163\n', '0.4081632653061224\n', '0.42857142857142855\n', '0.44897959183673464\n', '0.4693877551020408\n', '0.4897959183673469\n', '0.5102040816326531\n', '0.5306122448979591\n', '0.5510204081632653\n', '0.5714285714285714\n', '0.5918367346938775\n', '0.6122448979591836\n', '0.6326530612244897\n', '0.6530612244897959\n', '0.673469387755102\n', '0.6938775510204082\n', '0.7142857142857142\n', '0.7346938775510203\n', '0.7551020408163265\n', '0.7755102040816326\n', '0.7959183673469387\n', '0.8163265306122448\n', '0.836734693877

Currently, the data is contained in strings. Normally we don't want this, but rather as numbers. We got some converting to do!

In [90]:
a_float = []
for x in a:
    a_float.append(float(x)) #Convert x to float and then add to the new list
    
a_float = np.array(a_float) #Convert a_float into a numpy array
print(a_float)

[0.         0.02040816 0.04081633 0.06122449 0.08163265 0.10204082
 0.12244898 0.14285714 0.16326531 0.18367347 0.20408163 0.2244898
 0.24489796 0.26530612 0.28571429 0.30612245 0.32653061 0.34693878
 0.36734694 0.3877551  0.40816327 0.42857143 0.44897959 0.46938776
 0.48979592 0.51020408 0.53061224 0.55102041 0.57142857 0.59183673
 0.6122449  0.63265306 0.65306122 0.67346939 0.69387755 0.71428571
 0.73469388 0.75510204 0.7755102  0.79591837 0.81632653 0.83673469
 0.85714286 0.87755102 0.89795918 0.91836735 0.93877551 0.95918367
 0.97959184 1.        ]


### Task: Write a cell that writes a matrix

1. Generate a matrix using `m = np.random.rand(5,6)`
2. Open a file called `03_standard_matrix.dat`
3. Write a loop to write a single row of the matrix. Remember to add an empty space (" ") between entries!
4. Write an external loop to finish each line with an end-of-line ("\n")
5. Close the file



In [91]:
m = np.random.rand(5,6)
f = open("03_standard_matrix.dat", "w")
for i in range(5):
    for j in range(6):
        f.write(str(m[i,j]) + " ")
    f.write("\n")
f.close()

**Bonus task**: Read the matrix back

#### Bonus content:

As this open-use-close structure is so common, Python has the `with` command, which simplifies a bit this loop

In [92]:
m = np.random.rand(5,6)
with open("03_standard_matrix.dat", "w") as f:
    for i in range(5):
        for j in range(6):
            f.write(str(m[i,j]) + " ")
        f.write("\n")

Now, the file usage is clearly local, and it is closed automatically! Slightly better, but still a lot of work.


----- End of bonux content -----

That was quite a lot of work to read the file! We can grandly simplify this by using the Numpy functions `savetxt` and `loadtxt`!

## The Numpy way: `savetxt` and `loadtxt`

Numpy has two sibling functions, called `savetxt` and `loadtxt`. The names are rather self explanatory: they save and load data to/from plain text files. 

Plain text has a serious advantage in relation to every other file format: They are extremely simple to read. You can expect your file to be readable in any computer ever produced, and any future produced computer. This is a great advantage, but it also has some disadvantages. First and foremost, it is relatively unstructured, carrying little metadata. Additionally, it can be a surprisingly inneficient way to store data. We will talk about more data formats later.

In [93]:
a = np.linspace(0, 1) #Generate some data 
np.savetxt("03_python_way.dat", a) #Save our data into a file
print(a)

[0.         0.02040816 0.04081633 0.06122449 0.08163265 0.10204082
 0.12244898 0.14285714 0.16326531 0.18367347 0.20408163 0.2244898
 0.24489796 0.26530612 0.28571429 0.30612245 0.32653061 0.34693878
 0.36734694 0.3877551  0.40816327 0.42857143 0.44897959 0.46938776
 0.48979592 0.51020408 0.53061224 0.55102041 0.57142857 0.59183673
 0.6122449  0.63265306 0.65306122 0.67346939 0.69387755 0.71428571
 0.73469388 0.75510204 0.7755102  0.79591837 0.81632653 0.83673469
 0.85714286 0.87755102 0.89795918 0.91836735 0.93877551 0.95918367
 0.97959184 1.        ]


In [94]:
b = np.loadtxt("03_python_way.dat")
print(b)

[0.         0.02040816 0.04081633 0.06122449 0.08163265 0.10204082
 0.12244898 0.14285714 0.16326531 0.18367347 0.20408163 0.2244898
 0.24489796 0.26530612 0.28571429 0.30612245 0.32653061 0.34693878
 0.36734694 0.3877551  0.40816327 0.42857143 0.44897959 0.46938776
 0.48979592 0.51020408 0.53061224 0.55102041 0.57142857 0.59183673
 0.6122449  0.63265306 0.65306122 0.67346939 0.69387755 0.71428571
 0.73469388 0.75510204 0.7755102  0.79591837 0.81632653 0.83673469
 0.85714286 0.87755102 0.89795918 0.91836735 0.93877551 0.95918367
 0.97959184 1.        ]


Muuuch easier than before! It only takes a single command to do what took us 5 lines to do beforehand. And that scales well for other formats as well. If you want to do the same thing for a matrix, little would change!

In [95]:
m = np.random.rand(5, 6)
np.savetxt("03_python_way_matrix.dat", m) #Save our data into a file
print(m)

[[0.30858072 0.14943803 0.72530661 0.23654462 0.90237324 0.26844437]
 [0.58850649 0.114349   0.04627874 0.26475038 0.66429396 0.68570908]
 [0.49651972 0.63779162 0.81619905 0.64085929 0.3121698  0.20809521]
 [0.64026878 0.85821003 0.80721423 0.55544774 0.95304777 0.58110973]
 [0.62464309 0.15294107 0.94310223 0.81161621 0.55600159 0.10457878]]


In [96]:
m2 = np.loadtxt("03_python_way_matrix.dat")

In [97]:
print(m2)

[[0.30858072 0.14943803 0.72530661 0.23654462 0.90237324 0.26844437]
 [0.58850649 0.114349   0.04627874 0.26475038 0.66429396 0.68570908]
 [0.49651972 0.63779162 0.81619905 0.64085929 0.3121698  0.20809521]
 [0.64026878 0.85821003 0.80721423 0.55544774 0.95304777 0.58110973]
 [0.62464309 0.15294107 0.94310223 0.81161621 0.55600159 0.10457878]]


### Multiple datasets per file

One of the main disadvantages of the plain-text format is the lack of data organization/metadata. When adding multiple arrays, one can organize them into columns, add a header. This is a good, even if not ideal solution. But what about the case when one has several matrices, perhaps of different sizes? 

In this case, we commonly use the npz format. This file has some advantages over the plain text:

1. It can save the name as well as the data itself
2. It can be compressed, taking less space
3. It is binary, meaning no loss of precision

They are mostly associated with numpy, but there are several libraries to read them in other programming languages as well. 

In [98]:
matrix1 = np.random.rand(100, 100)
np.savez("03_numpy_way.npz", matrix1) #Write a npz file
print(matrix1)

[[0.23223099 0.86668783 0.49068717 ... 0.56915889 0.64084644 0.76511881]
 [0.40130754 0.12511403 0.80029882 ... 0.65556164 0.65682571 0.73830427]
 [0.50577201 0.29973007 0.9664485  ... 0.05695182 0.21375511 0.23768066]
 ...
 [0.01430463 0.06911239 0.09709622 ... 0.31333414 0.79959621 0.55203082]
 [0.45222739 0.6006339  0.81559152 ... 0.12883778 0.12604453 0.86456412]
 [0.35289879 0.03496491 0.71341582 ... 0.72865455 0.52379088 0.20824996]]


In [99]:
data = np.load("03_numpy_way.npz")

The data returned in called a dictionary. We can check which datasets we have by 

In [100]:
print(list(data.keys()))

['arr_0']


But that is not the original name! Unfortunately, Python cannot automatically get the name of the variables. Thus, we need to specify when saving.

In [101]:
np.savez("03_numpy_way.npz", saved_matrix=matrix1) #Write a npz file
data = np.load("03_numpy_way.npz")
print(list(data.keys()))

['saved_matrix']


Now we are talking! All left for now is to print the saved matrix.

In [102]:
matrix = data['saved_matrix']
print(matrix)

[[0.23223099 0.86668783 0.49068717 ... 0.56915889 0.64084644 0.76511881]
 [0.40130754 0.12511403 0.80029882 ... 0.65556164 0.65682571 0.73830427]
 [0.50577201 0.29973007 0.9664485  ... 0.05695182 0.21375511 0.23768066]
 ...
 [0.01430463 0.06911239 0.09709622 ... 0.31333414 0.79959621 0.55203082]
 [0.45222739 0.6006339  0.81559152 ... 0.12883778 0.12604453 0.86456412]
 [0.35289879 0.03496491 0.71341582 ... 0.72865455 0.52379088 0.20824996]]


Lets now do this for several matrices. 

In [103]:
matrix1 = np.random.rand(100, 100)
matrix2 = np.random.rand(30, 52)
matrix3 = np.random.rand(99, 128)
matrix4 = np.random.rand(300, 1000)
np.savez("03_numpy_way_multi.npz", matrix1=matrix1, matrix2=matrix2, matrix3=matrix3, matrix4=matrix4)

In [104]:
data = np.load("03_numpy_way_multi.npz")
print(list(data.keys()))

['matrix1', 'matrix2', 'matrix3', 'matrix4']


The gang is all here. If you are using only python, I strongly recommmend you consitently use plain files and/or npz files. However, the reality is that often another file format for data is commonly used: Excel files. They are not completely trivial to import in python, but we can do it. For this, we will use the Pandas library.

## Reading excel files

In order to read/write data from Excel, we will use the `Pandas` library. This works by leveraging the `DataFrame` structure of Pandas. You can think of this data structure as a mix between the arrays from Numpy and Excel data. The largest change is that the `DataFrames` can have named rows/columns, which make them really useful for data science. As always, we want to keep the metadata close to the data.

This is a good moment to open the file `03_scaling.xlsx` and explore its contents.

In [105]:
import pandas as pd
import numpy as np

In [106]:
data = pd.read_excel("./03_scaling.xlsx")

In [107]:
data

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,2.0
2,3,2.8
3,4,3.7
4,8,7.1
5,12,9.0
6,16,13.2


Now, we can select any given column using its name.

In [108]:
cores = data["Cores"]

Sometimes, we actually need a subsection of the full data. For this, we can use the `loc` method. This works very similarly to the array indexing in Numpy

In [109]:
data.loc[0, :]

Cores      1.0
Speedup    1.0
Name: 0, dtype: float64

### Task: Select the speedup for `cores <= 4`

In [110]:
data.loc[cores <= 4, "Speedup"]

0    1.0
1    2.0
2    2.8
3    3.7
Name: Speedup, dtype: float64

In [111]:
data.loc[data["Cores"] <= 4, "Speedup"]

0    1.0
1    2.0
2    2.8
3    3.7
Name: Speedup, dtype: float64

### Files with more than one sheet

Frequently, we aggregate several sheets into a single Excel file. By default, Pandas will read the first one of these, which is not necessarily what we want. There are several ways to handle this. 

First, we can tell Pandas to read the $i^{th}$ sheet

In [112]:
data = pd.read_excel("./03_scaling_multiple.xlsx", sheet_name=1)

In [113]:
data

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,1.1
2,3,1.2
3,4,1.5
4,8,1.8
5,12,2.001
6,16,3.01


We can also explictly choose the sheet by its name

In [114]:
data = pd.read_excel("./03_scaling_multiple.xlsx", sheet_name="System2")

In [115]:
data

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,1.1
2,3,1.2
3,4,1.5
4,8,1.8
5,12,2.001
6,16,3.01


Even better, we can read the several sheets at once!

In [116]:
datas = pd.read_excel("./03_scaling_multiple.xlsx", sheet_name=[0,1])

In [117]:
data0 = datas[0]
data0

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,2.0
2,3,2.8
3,4,3.7
4,8,7.1
5,12,9.0
6,16,13.2


In [118]:
data1 = datas[1]
data1

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,1.1
2,3,1.2
3,4,1.5
4,8,1.8
5,12,2.001
6,16,3.01


### Task: Try to do load both sheets at once, but using the name of the sheets: "System1" and "System2"

In [119]:
datas = pd.read_excel("./03_scaling_multiple.xlsx", sheet_name=["System1", "System2"])

In [120]:
datas["System1"]

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,2.0
2,3,2.8
3,4,3.7
4,8,7.1
5,12,9.0
6,16,13.2


In [121]:
datas["System2"]

Unnamed: 0,Cores,Speedup
0,1,1.0
1,2,1.1
2,3,1.2
3,4,1.5
4,8,1.8
5,12,2.001
6,16,3.01


**Bonus**: Try passing `None` to `sheet_name`

In [122]:
datas = pd.read_excel("./03_scaling_multiple.xlsx", sheet_name=None)
datas

OrderedDict([('System1',    Cores  Speedup
              0      1      1.0
              1      2      2.0
              2      3      2.8
              3      4      3.7
              4      8      7.1
              5     12      9.0
              6     16     13.2), ('System2',    Cores  Speedup
              0      1    1.000
              1      2    1.100
              2      3    1.200
              3      4    1.500
              4      8    1.800
              5     12    2.001
              6     16    3.010)])

## Manipulating the DataFrame

As mentioned before, `DataFrame`s work in the same way a normal Numpy array

In [123]:
data = pd.read_excel("./03_scaling.xlsx")
data["Cores"]+= 1

In [124]:
data

Unnamed: 0,Cores,Speedup
0,2,1.0
1,3,2.0
2,4,2.8
3,5,3.7
4,9,7.1
5,13,9.0
6,17,13.2


## Saving an Excel file

Now that we manipulated the data, we can save it into a new file

In [125]:
data.to_excel("03_scaling_extracores.xlsx", sheet_name="System1")

However, with this method we can only save a single sheet per file. We can overcome this issue with the help of `ExcelWriter`, an utility class from Pandas

In [126]:
# Read some data, just be sure we have something to work with
# Normally, one would have manipulated the data before simply
# rewriting
datas = pd.read_excel("./03_scaling_multiple.xlsx", sheet_name=["System1", "System2"]) 

In [127]:
with pd.ExcelWriter('03_scaling_extracores_multiple.xlsx') as writer:
    datas["System1"].to_excel(writer, sheet_name="CoolNewSystem1")
    datas["System2"].to_excel(writer, sheet_name="NotSoCoolSystem2")