# File manipulation

- Files are used to save all processed data in each execution. There are two types of them: text and binary.
- We will learn some of the most common functions to manipulate files
- Despite each OS has its own system to create and access files, Python is independent of it as it uses a "file handle"

## File Handing

### Open


The key function for working with files in Python is the open() function. It takes two parameters: filename, and mode.

There are four different methods (modes) for opening a file:

| Indicator | Opening mode | Opening mode + | Pointer |
| --- | --- | --- | --- |
| r/r+ | Read only | +writing | Beginning |
| w/w+ | Write only. Overwrites file if already existing. Creates file otherwise | +reading | Beginning |
| x/x+ | Write only. FileExistsError if already exists. Creates file otherwise | +reading | Beginning | 
| a/a+ | Add if file exists. Creates file otherwise. | +read & write | End


Let's see an example:

In [2]:
nameHandle = open("File.txt","w")

where:
- nameHandle stands for the name of the file handle
- open() is the function to open a file
- "File.txt" is the name (string) of the file we want to open
- "w" indicates we want to write on this file

Moreover, it is possible to specify if the file should be handled as binary or text mode:

| Indicator | Opening mode | Example of use |
| --- | --- | --- |
| t | Text | Texts |
| b | Binary | Images |

The default parameter is 'rt', meaning the file would be opened in text-reading mode.

### Write and Close

We can refer the file handle as a variable with associateed functions that allow the user to manipulate files. One of the functions is write(). Let's see an example.

In [2]:
fileHandle = open("File.txt","w") #Creation of the file
fileHandle.write("Hi!\nWelcome to the python course.\n")
fileHandle.write("Enjoy!\n")
fileHandle.close()

You may have noticed '\n'. The character '\\' is an escapement character, meaning that the following one must be treated in a speacial way. In this case, for example, the string '\n' indicates the beginning of a new line.

After having edited the file, we want to save the changes to let other programmes access its contents. To do so, we use close() function.



The result of this operation will be the creation of the file in this same directory. Is it possible to create files and other subdirectories in other locations? Absolutely! 

To create a file in another location, we can use $\texttt{pathlib}$ library to know our current location and change it for the location we want to save the file in. On the other hand, $\texttt{os}$ library allows us to works with files nd directories of our Operative System.

Note: If needed you can install the libraries by executing $\texttt{pip install os}$ and $\texttt{pip install pathlib}$.

In [3]:
import os
import pathlib

path = pathlib.Path().absolute()
print(path)

try:
    os.mkdir("aux_directory/")
    print("The directory has been created succesfully!")

except FileExistsError as exc:
    print(exc)

/home/janska/Documents/UAB/PythonCourse/Theory
The directory has been created succesfully!


In [4]:
new_path = str(path) + "/aux_directory/"

fileHandle = open(new_path + "File.txt","w")
fileHandle.write("This file is located in a diferent location\n")
fileHandle.close()

It is possible to create files and directories in previous locations as well:

In [5]:
new_path = "/home/janska/Documents/UAB/auxPythonCourse/"

try:
    os.mkdir(new_path)
    print("The directory has been created succesfully!")

except FileExistsError as exc:
    print(exc)

[Errno 17] File exists: '/home/janska/Documents/UAB/auxPythonCourse/'


In [6]:
fileHandle = open(new_path + "File.txt","w")
fileHandle.write("This file is located in another diferent location: auxPythonCourse\n")
fileHandle.close()

### Read

The instruction $\texttt{read}()$ allows us to read a file. Let's see the following example

In [7]:
fileHandle = open("File.txt","r") #read only
print(fileHandle.read())
fileHandle.close()

Hi!
Welcome to the python course.
Enjoy!



It also accepts an integer as a parameter to indicate the number of characters to read:

In [8]:
fileHandle = open("File.txt","r") #read only
print(fileHandle.read(5))
fileHandle.close()

Hi!
W


A way to interpretate files in Python is as if they were a seqüence of lines. Consequently, we can use for() to iterate over their contents.

In [9]:
fileHandle = open("File.txt","r")
for line in fileHandle:
    print(line)
fileHandle.close()

Hi!

Welcome to the python course.

Enjoy!



Notice the blank line between lines. As each line is treated as a string, it is possible to avoid the '\n' by not taking the last character of the string.

In [10]:
fileHandle = open("File.txt","r")
for line in fileHandle:
    print(line[:-1])
fileHandle.close()

Hi!
Welcome to the python course.
Enjoy!


#### Readline

readline() function allows us to read just one line, which will deppend on the pointer's position.

In [11]:
fileHandle = open("File.txt","r")
print(fileHandle.readline())
fileHandle.close()

Hi!



Therefore, it is possible to read the first two lines by running:

In [12]:
fileHandle = open("File.txt","r")
print(fileHandle.readline())
print(fileHandle.readline())
fileHandle.close()

Hi!

Welcome to the python course.



#### Readlines

This functions returns a list containing the lines of the file in order:

In [13]:
fileHandle = open("File.txt","r")
print(fileHandle.readlines())
fileHandle.close()

['Hi!\n', 'Welcome to the python course.\n', 'Enjoy!\n']


If we wanted to print only a specific line in the file, let's say the second line, we could use the following instruction.

In [14]:
fileHandle = open("File.txt","r")
print(fileHandle.readlines()[1])
fileHandle.close()

Welcome to the python course.



### Append

Each time an existing file is opened with "w" mode, its content is completely overwritten. To avoid it, we use "a" mode. As an example, we are going to modify the file using "w", an then we are going to add another line without deleteing anything. 

In [15]:
print("Result in writting mode:")
#Overwrite file
fileHandle = open("File.txt","w") 
fileHandle.write("Everything has been erased!\n")
fileHandle.close()
#Print result of "w"
fileHandle = open("File.txt","r")
print(fileHandle.read())
fileHandle.close()

print("Result with append mode:")
#Use of "a"
fileHandle = open("File.txt","a")
fileHandle.write("Use 'a' parameter to avoid overwriting it!")
fileHandle.close()
#Print result of "a"
fileHandle = open("File.txt","r")
print(fileHandle.read())
fileHandle.close()

Result in writting mode:
Everything has been erased!

Result with append mode:
Everything has been erased!
Use 'a' parameter to avoid overwriting it!


If we want to create a new file to write and make sure that we do not overwrite it, it is possible to use 'x':

In [16]:
#Should return error if file already exists
fileHandle = open("File.txt","x") 
fileHandle.write("Let's try x mode\n")
fileHandle.close()

FileExistsError: [Errno 17] File exists: 'File.txt'

In [17]:
fileHandle = open("File2.txt","x") 
fileHandle.write("Let's try x mode\n")
fileHandle.close()

### Delete a file or directory

Sometimes we may want to delete an existing file or directory because it is no longer needed. As before, the $\texttt{os}$ library helps us with that:

In [18]:
import os

try:
    os.remove("File2.txt")

except IOError:
    print("File not found.")

Similarly, an EMPTY directory can be deleted. Firstly create an empty one. To observe the changes along the code, there exists a function that allows us to know the subdirectories and files existing at a given path (current by default): $\texttt{os.listdir}$

In [19]:
import os

print("Before creating the empty directory:\n")
print(str(os.listdir())+'\n')
os.mkdir("empty_directory")

print("After creating the empty directory:\n")
print(os.listdir())


Before creating the empty directory:

['File.txt', 'aux_directory', 'SesionFive.ipynb', '.ipynb_checkpoints']

After creating the empty directory:

['File.txt', 'empty_directory', 'aux_directory', 'SesionFive.ipynb', '.ipynb_checkpoints']


See that the directory should have been created and ther run:

In [20]:
os.rmdir("empty_directory")
print(str(os.listdir()))

A possible way to delete a non-empty directory could be by using $\texttt{shutil}$ librarry:

In [22]:
import shutil

dir_path = str(pathlib.Path().absolute()) + "/aux_directory"

shutil.rmtree(dir_path)

### Tell and Seek

Files save data in a sequential way, meaning that every time it is automatically written at the end of it. Despite that, there exists some methods that allow us to choose the position of the pointer and start writing in another place: seek(), and another function to know the exact position of the pointer: tell().

There are many ocasions in which we could need an instruction as seek(). For example, let's supose we want to read a specific part of a file. A possible way to do it would be:

In [48]:
fileHandle = open("File.txt","r")
fileHandle.seek(len(fileHandle.readlines()[0]))
print(fileHandle.read(17))
print(fileHandle.tell())
fileHandle.close()

Use 'a' parameter
45


### With

In [1]:
with open("File.txt", "r") as fileHandle:
    print(fileHandle.read())

Everything has been erased!
Use 'a' parameter to avoid overwriting it!


## Numpy

With numpy it is possible to read files in order to extract their information and build arrays to work with. Basically, we can work upon text files or .npy and .npz files, which turn to be more efficient when it comes to loading data from them. 

### Reading data from .txt files: .write() and .loadtxt()

Let's see text files first. To do so, let's create a new file in wich we will introduce example values. Important: for now, the values must be written as strings and not as integer, floats, etc. as they will be written in text files and the only datatype suported in them are $\texttt{char}$ or $\texttt{strings}$.

In [4]:
import numpy as np

In [9]:
with open("np.txt","w") as npHandle:
    npHandle.write("1 2 3")

Instead of $\texttt{read}()$ function, numpy uses $\texttt{loadtxt}()$ to read information from text files.

In [10]:
np.loadtxt("np.txt")

array([1., 2., 3.])

Also possible with multidimensional arrays, where the '\n' character will indicate by default the end of a row.

In [7]:
with open("np.txt","w") as npHandle:
    npHandle.write("1 0 0\n0 1 0\n0 0 1\n")

np.loadtxt("np.txt")

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Notice how the data is written in np.txt and how the loadtxt function reads it. We can change it by modifying some parameters. For example, imagine we want an array of integers and the data we recieve comes with de delimiter ',':

In [61]:
with open("npex.txt","w") as npHandle:
    npHandle.write("1,0,0\n0,1,0\n0,0,1\n")

np.loadtxt("npex.txt", dtype='int', delimiter=',')

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

#### savetxt()

Until now we have been writing 'arrays' as strings in a text file, but with $\texttt{savetxt}()$ we can also save numpy arrays in a .txt or .csv file and they will be automatically converted to strings:

In [15]:
a = np.array([1,2,3])

np.savetxt("np.txt", a, delimiter=',') #default is ','

np.loadtxt("np.txt")

array([1., 2., 3.])

There are more useful parameters we can use to adapt the data to what we want. Let's see some of them.

##### skiprows and max_rows

In [16]:
np.savetxt("np.txt", np.identity(4))
np.loadtxt("np.txt")

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [17]:
np.loadtxt("np.txt", skiprows=1, max_rows=2)

array([[0., 1., 0., 0.],
       [0., 0., 1., 0.]])

##### usecols

In [78]:
np.loadtxt("npex.txt", usecols=[0,3])

array([[1., 0.],
       [0., 0.],
       [0., 0.],
       [0., 1.]])

##### ndmin (0 as default)

In [87]:
npHandle = open("npex.txt","w")
npHandle.write("1 2 3 4\n")
npHandle.close()

In [89]:
np.loadtxt("npex.txt")

array([1., 2., 3., 4.])

In [93]:
np.loadtxt("npex.txt", ndmin=2)

(1, 4)

### Reading data from .npy or .npz files

A .npy file is a binary file from which we extract the data we want to study, such as .txt files or .csv. The difference though, is that when talking about big datasets, .npy files result to be much more faster as they are binary files. An example of use is with datasets that are prepared to be used in machine learning algorithms.

Let's see how they work.

#### save() and load()

A numpy array can be saved into a .npy file by using $\texttt{save}()$ function, and it can be loaded with $\texttt{load}()$.

In [42]:
a = np.random.randint(low = 0, high = 100, size = (4,6))

np.save("npyfile.npy",a)
np.load("npyfile.npy")

array([[83,  7, 25,  0, 32, 33],
       [68, 10, 60, 38, 43, 77],
       [90, 91, 48, 41, 70, 67],
       [56, 83, 53, 94, 78, 39]])

On the other hand, we save data into a compressed .npz file with:

In [43]:
a = np.random.randint(low = 0, high = 100, size = (4,6))
b = np.random.randint(low = 0, high = 100, size = (4,6))

np.savez_compressed("npzfile.npz",a,b)
np.load("npzfile.npz")

<numpy.lib.npyio.NpzFile at 0x7f2ed4692af0>

As you can see, it does not print the array as expected. The reason of this is that $\texttt{load}()$ function for .npz files return a dictionary of arrays. To acces each of them, we can use $\textit{dict_data['arr_i']}$, where 'i' stands for the ith array:

In [44]:
data_dict = np.load("npzfile.npz")
print("First array:\n")
print(data_dict["arr_0"])

print("\nSecond array:\n")
print(data_dict["arr_1"])

First array:

[[92 42 81 13 29 87]
 [76 59  3 15 84 67]
 [ 9 26 84 29 27 88]
 [80 71 28 49  2 68]]

Second array:

[[66  2  0 27 13 33]
 [18 24 76 39 39 48]
 [37 65 94 49 81 31]
 [55 78 20 84 83 93]]


If you want to, you can see the difference by executing the next cells.

In [36]:
from time import time

In [37]:
N = 10000000  # random datapoints
with open('data.txt', 'w') as data:
    for _ in range(N):
        data.write(str(10*np.random.random())+',')
data.close()

In [38]:
start = time()

with open('data.txt', 'r') as data:
    string_data = data.read()
    
end = time()
 
list_data = string_data.split(',')
list_data.pop()
data_array = np.array(list_data, dtype=float).reshape(10000, 1000)


print("### 10 million points of data ###")
print("\nData summary:\n", data_array)
print("\nData shape:\n", data_array.shape)
print(f"\nTime to read: {round(end-start,5)} seconds.")

### 10 million points of data ###

Data summary:
 [[4.09144984 3.84302398 7.16627551 ... 2.01428061 2.79600872 3.30568972]
 [2.56943462 1.89302601 0.32476652 ... 0.82489118 5.30315169 1.1540871 ]
 [7.37108472 0.90150474 6.49690544 ... 7.12834269 7.51508654 9.27592637]
 ...
 [9.36692545 2.33595017 1.21974856 ... 2.17260358 1.37939516 3.70853589]
 [8.78659462 1.00940435 0.25049989 ... 3.405163   5.45185069 6.97280555]
 [2.71625771 0.0726352  9.87130631 ... 3.27120981 3.12439974 4.9982651 ]]

Data shape:
 (10000, 1000)

Time to read: 0.12279 seconds.


Let's see the .npy version

In [40]:
np.save('data.npy', data_array)

In [41]:
start=time()

data_array = np.load('data.npy')

end=time()

print("### 10 million points of data ###")
print("\nData summary:\n", data_array)
print("\nData shape:\n", data_array.shape)
print(f"\nTime to read: {round(end-start,5)} seconds.")

### 10 million points of data ###

Data summary:
 [[4.09144984 3.84302398 7.16627551 ... 2.01428061 2.79600872 3.30568972]
 [2.56943462 1.89302601 0.32476652 ... 0.82489118 5.30315169 1.1540871 ]
 [7.37108472 0.90150474 6.49690544 ... 7.12834269 7.51508654 9.27592637]
 ...
 [9.36692545 2.33595017 1.21974856 ... 2.17260358 1.37939516 3.70853589]
 [8.78659462 1.00940435 0.25049989 ... 3.405163   5.45185069 6.97280555]
 [2.71625771 0.0726352  9.87130631 ... 3.27120981 3.12439974 4.9982651 ]]

Data shape:
 (10000, 1000)

Time to read: 0.0135 seconds.


For further information about numpy files, you can visit:

https://towardsdatascience.com/why-you-should-start-using-npy-file-more-often-df2a13cc0161

## Further topics

Apart from all functions and topics we've seen in the course, Python is a programming language that has many more utilities. Some of them could be:
- Oriented Object Programming
- Assertions and error controls
- Map and lambda functions
- Machine and Deep learning
- ...