# P08 - File Operation

## Syllabus
2.3.4	Store data in and retrieve data from serial and sequential text files.

## Understanding Goals

At the end of this chapter, you should be able to:
- Read data from a text file and process the data meaningfully
- Write data into a text file


## Section 1 - Introduction to File Operations

As we have learnt many useful algorithms and data types so far, sometimes it is important for us to store our processed data into a file or retrieve them for our usage. In this chapter, we are going to discuss and learn how to manipulate simple text files.

### _1.1 Access modes_

There are a few access modes available, whenever we are opening a new file, we would need to indicate our purpose by choosing the appropriate access mode. The default option is "r".

<table class="table table-bordered" style="width:90%">
    <!-- Header Row -->
    <tr>
        <th style="width:10%; text-align:left">Mode</th>
        <th style="width:90%; text-align:left">Description</th>
    </tr>
    <tr>
        <td style="text-align:left">r</td>
        <td style="text-align:left">Opens a file for reading only. The file pointer is placed at the beginning of the file. This is the default mode.</td>
    </tr>
    <tr>
        <td style="text-align:left">rb</td>
        <td style="text-align:left">Opens a file for reading only in binary format. The file pointer is placed at the beginning of the file.</td>
    </tr>
    <tr>
        <td style="text-align:left">r+</td>
        <td style="text-align:left">Opens a file for both reading and writing. The file pointer placed at the beginning of the file.</td>
    </tr>
    <tr>
        <td style="text-align:left">w</td>
        <td style="text-align:left">Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.</td>
    </tr>
    <tr>
        <td style="text-align:left">a</td>
        <td style="text-align:left">Opens a file for appending. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing.</td>
    </tr>
</table>

### _1.2 Opening a file_

There are 2 ways to open a file in python. Let's take a look at the differences.

\* For the following example, you do not need to worry about the `read()` function, we will discuss this in 1.3.

#### ~ Example ~

In [3]:
# 1st method

# file_obj = open("file_name" [, access mode])
f = open("data_files/file_hello_world.txt", "r")

# processing statements
print(f.read())

# method 1 must be accompanied with a file_obj.close() function call to properly close the file.
f.close()

# print(f.read())

Hello World!


In [4]:
# 2nd method

# with open("file_name" [, access mode]) as file_obj:
with open("data_files/file_hello_world.txt", "r") as f:
    # processing statements
    print(f.read())

# method 2 does not require the .close() function call, as the `with` statement already take this into consideration
    
# print(f.read())

Hello World!


**With the above observations, we NORMALLY would recommend us in using the `with open()` method as we do not need to worry about closing the file.**

**HOWEVER...**
goofy ah alevel requirement to manually close file

## Section 2 - File Reading Operations


### _2.1 Reading contents from a file_

Let's take a closer look at how to read contents from a file.

We can do so by using the `read()` and the `readline()` functions.

#### ~ Example ~

In [2]:
# `read()` with `split()` function

with open("data_files/file_01.txt", "r") as f:
    # processing statements
    content = f.read()
    
print(content)
data = content.split("\n")
print(data)

# effectively content is a str containing '\n' as new line characters
# like this '12\n32\n134\n123\n21\n12\n5\n6\n21'
# uncomment the following line to check. u may search online about `repr` function on your own
print(repr(content))

12
32
134
123
21
12
5
6
21
['12', '32', '134', '123', '21', '12', '5', '6', '21']
'12\n32\n134\n123\n21\n12\n5\n6\n21'


In [5]:
# `readline()` with `while` loop

with open("data_files/file_01.txt", "r") as f:
    # processing statements
    line = f.readline()
    data = []
    while line: # while `line` is not empty, we will continue reading
        data.append(line)
        line = f.readline()
        
print(data)

# `line` now is an empty string, as we reached to the end of the file
# uncomment the following line to check
# print(repr(line))

['12\n', '32\n', '134\n', '123\n', '21\n', '12\n', '5\n', '6\n', '21']


In [15]:
# `readline()` with `for` loop

with open("data_files/file_01.txt", "r") as f:
    # processing statements
    data = []
    
    for line in f:
        data.append(line)
        
print(data)

['12\n', '32\n', '134\n', '123\n', '21\n', '12\n', '5\n', '6\n', '21']


So which way is better? `read()` or `readline()`?

We acknowledge that `read()` will directly give us a single string, and we can just use a simple `split()` function to cast it to a list containing `str` values. However, when the file size become very large, for example, twice as large as a machine memory, then reading the file content directly using `read()` may not be effecient.

**Hence, we would recommend us to use the `readline()` with loop way of implementation.**

### _2.2 Process contents read from file_

Once we read contents from the file into a list of strings, we would now need to process the contents meaningfully.

#### ~ Example ~

In [14]:
# file_01

# read file content into a list of strings
with open("data_files/file_01.txt", "r") as f:
    # processing statements
    raw_data = []
    
    for line in f:
        raw_data.append(line)

print("raw data: ", raw_data)

# remove `\n` in strings
processed_strings = []
for raw_s in raw_data:
    if raw_s[-1] == "\n":
        processed_strings.append(raw_s[:-1])
    else:
        processed_strings.append(raw_s)
        
print("processed list of strings: ", processed_strings)

raw data:  ['12\n', '32\n', '134\n', '123\n', '21\n', '12\n', '5\n', '6\n', '21']
processed list of strings:  ['12', '32', '134', '123', '21', '12', '5', '6', '21']


In [7]:
# file_02

# read file content into a list of strings
with open("data_files/file_02.txt", "r") as f:
    # processing statements
    raw_data = []
    
    for line in f:
        raw_data.append(line)

print("raw data: ", raw_data)
        
# remove `\n` in strings, and remove empty lines
processed_strings = []
for raw_s in raw_data:
    if raw_s != "\n":
        if raw_s[-1] == "\n":
            processed_strings.append(raw_s[:-1])
        else:
            processed_strings.append(raw_s)
        
print("processed list of strings: ", processed_strings)

raw data:  ['\n', '12\n', '32\n', '\n', '134\n', '123\n', '21\n', '12\n', '5\n', '6\n', '21\n', '\n', '\n', '\n']
processed list of strings:  ['12', '32', '134', '123', '21', '12', '5', '6', '21']


In [9]:
# file_03

# read file content into a list of strings
with open("data_files/file_03.csv", "r") as f:
    # processing statements
    raw_data = []
    
    for line in f:
        raw_data.append(line)

print("raw data: ", raw_data)

# remove `\n` in strings, and remove empty lines
processed_strings = []
for raw_s in raw_data:
    if raw_s != "\n":
        if raw_s[-1] == "\n":
            processed_strings.append(raw_s[:-1])
        else:
            processed_strings.append(raw_s)
            
print("processed list of strings: ", processed_strings)

# convert strings to sublists
header_string = processed_strings.pop(0)
headers = header_string.split(", ")

processed_data = []
for s in processed_strings:
    processed_data.append(s.split(", "))
        
print("heads: ", headers)
print("processed data: ", processed_data)

raw data:  ['Name, Class, Score\n', 'Xiao Ming, 20J13, 89\n', 'Xiao Hong, 20J13, 76\n', 'Xiao Qiang, 20J12, 56\n', '\n']
processed list of strings:  ['Name, Class, Score', 'Xiao Ming, 20J13, 89', 'Xiao Hong, 20J13, 76', 'Xiao Qiang, 20J12, 56']
heads:  ['Name', 'Class', 'Score']
processed data:  [['Xiao Ming', '20J13', '89'], ['Xiao Hong', '20J13', '76'], ['Xiao Qiang', '20J12', '56']]


#### - Small Challenge -

If you have noticed, now the "file_03.csv" has a standard format of a space `" "` after every comma `","`. 

However, in daily lives, this may not always be the case. Take a look at "file_03s.csv", how can we process the file into the above format then?

#### Hint

You may consider to use the `strip()` function. If we have a string `s` which contains `"  haha "`, calling `s.strip()` will remove the leading and trailing spaces and return `"haha"` directly.

In [5]:
# Your answers here

file_path = "data_files/file_03.csv"
with open(file_path, "r") as f:
    raw_data = []
    for row in f:
        raw_data.append(row)

# print(raw_data)

processed_data = []
for row in raw_data:
    if row == "\n": # skip empty rows
        continue
    temp_row = row.split(",")
    processed_row = []
    for ele in temp_row:
        processed_row.append(ele.strip())
    processed_data.append(processed_row)
print(processed_data)

[['Name', 'Class', 'Score'], ['Xiao Ming', '20J13', '89'], ['Xiao Hong', '20J13', '76'], ['Xiao Qiang', '20J12', '56']]


## Section 3 - File Writing Operations


### _3.1 Write contents to files_

We can create new files with access mode set to `"w"` even if the file doesn't exist. We can then use the `write()` function to write strings into files.

#### ~ Example ~

In [10]:
# Write contents into a new file
with open("data_files/file_04.txt", "w") as f:
    f.write("Hello World!\n")
    f.write("My name is Xiao Ming, nice to meet you!")
    
# Open "file_04.txt" to check its content

In [11]:
# Opening an existing file with access mode "w" will overwrite all existing contents
with open("data_files/file_04.txt", "w") as f:
    f.write("Name, Class, Score\n")
    
# Open "file_04.txt" to check its content

### _3.2 Append contents to files_

We can append contents to a file by using the access mode `"a"`.

#### ~ Example ~

In [12]:
# Writing header line
with open("data_files/file_04.txt", "w") as f:
    f.write("Name, Class, Score\n")

# Opening an existing file with access mode "w" will overwrite all existing contents

results = [['Xiao Ming', '20J13', '89'], ['Xiao Hong', '20J13', '76'], ['Xiao Qiang', '20J12', '56']]

with open("data_files/file_04.txt", "a") as f:
    for personal_result in results:
        for s in personal_result:
            f.write(s + ", ")
        f.write("\n")
    
# Open "file_04.txt" to check its content

#### - Small Challenge -

If you have noticed, now the "file_04.txt" has a "," at the end of each line of the personal result.
How can we change the above code to avoid this problem?

In [15]:
# Your answers here

# Writing header line
with open("data_files/file_04.txt", "w") as f:
    f.write("Name, Class, Score\n")

# Opening an existing file with access mode "w" will overwrite all existing contents

results = [['Xiao Ming', '20J13', '89'], [
    'Xiao Hong', '20J13', '76'], ['Xiao Qiang', '20J12', '56']]

with open("data_files/file_04.txt", "a") as f:
    for personal_result in results:
        f.write(", ".join(personal_result))
        f.write("\n")

# Open "file_04.txt" to check its content

## Section 4 - Other File Operations Tips

### _4.1 `csv` library_

The `csv` library is a very useful library to process **well-formatted** csv files.

#### ~ Example ~

In [16]:
import csv

filename = "data_files/file_05.csv"

# initializing the titles(fields) and rows list 
headers = [] 
processed_data = [] 

# reading csv file 
with open(filename, 'r') as csvfile: 
    # creating a csv reader object 
    csvreader = csv.reader(csvfile) 

    # extracting field names through first row 
    headers = next(csvreader) 

    # extracting each data row one by one 
    for row in csvreader: 
        processed_data.append(row)
        
print("heads: ", headers)
print("processed data: ", processed_data)

heads:  ['Name', 'Class', 'Score']
processed data:  [['Xiao Ming', '20J13', '89'], ['Xiao Hong', '20J13', '76'], ['Xiao Qiang', '20J12', '56']]


### _4.2 `try-except` block_

Sometimes, we may be unsure if the file exist in the directory, and it would be good for us to create a `try-except` block to handle the potential errors. This will prevent the programme from terminating due to the `IOError`.

#### ~ Example ~

In [17]:
try:
    with open("somefile.txt") as f:
        print(f.read())
except IOError:
    print("File not found or path is incorrect")

File not found or path is incorrect


### _4.3 `os.path` library_

When building large applications, we may create multiple python files in a complex folder structure. If a file operation is called not under correct folder level, sometimes we may face an `IOError` even if the file exists. 

**Hence, the following method is recommended as a good practice by using the absolute path of a data file.**

#### ~ Example ~

In [21]:
import os

#! [jupyter notebook specific] Use "" inside the `dirname()`
basedir = os.path.abspath(os.path.dirname(""))

#! for normal implementation, we should use the following line instead!
# basedir = os.path.abspath(os.path.dirname(__file__))


# os.path.join to combine the directory with the file name to form the abs path of the data file
data_file_name = os.path.join(basedir, "static/raw_data/my_data.txt")

def read_data_file(file_name):
    try:
        with open(file_name) as f:
            return f.read()
    except IOError:
        print("File not found or path is incorrect.")
        
print(read_data_file(data_file_name))

Some data.
