# File IO


## What is a file?

- A file is a named location on disk to store related information. 

- A file consists of a contiguous set of bytes used to store data organized in a specific format.


## Why do we use files?

- Files are used to permanently store data in a **non-volatile** memory (e.g. hard disk).

-  We use files for the future use of data. We can reuse the information after turning the computer on and off or store variables onto file which can by used by the another runtime execution of a Python

- Random Access Memory (RAM) is **volatile**, data stored in RAM does get lost when the computer is turned off!



## How is a file structured?

1. **Header**:  metadata about the contents of the file (file name, size, type, creation date, etc.)

2. **Data**: contents of the file as written by the creator or editor

3. **End of file (EOF)**:Special character that indicates the end of the file


## How are file operations performed?

1. Open the file
2. Perform a read or write operation on the file
3. Close the file

##  1. Open the file

- Python offers a **built-in open() function** to open a file. 
- Function returns a **file object**

In [None]:
f = open("say_hello.txt")    # open file in current directory with file object "f"

In [None]:
help(open)

In [None]:
help(f)

<div class="alert alert-block alert-warning">
    <b>Warning:</b> You should always make sure that an open file is properly <b>closed</b>. 
</div>

- Files are **usually** closed after the execution or termination of an application or a script 
- Non-closed files + non-expected orunwanted behavior can lead to **resource** or memory **leaks**.  

## 3. Close the file

Use the close method of the file object.

In [None]:
f.close()

Good Python practice: Ensure the code behaves in a well-defined manner and reduces any unwanted behavior.

Two ways:

1. **try-finally** block (error-handling)
2. **with** statement 


In [None]:
# First way using try-finally block

f_try_finally = open('say_hello.txt')
try:
    pass # Do something here.
finally:
    f_try_finally.close()
    

In [None]:
# First way using try-finally block

f_try_finally = open()
try:
    pass # Do something here.
finally:
    f_try_finally.close()

In [None]:
# Second way
# With statement automatically takes care of closing the file 
# once it leaves the with block, even in cases of error 

with open('say_hello.txt') as f_with:
    pass


In [None]:
# Let us check the state of the file we are checking
with open('say_hello.txt') as f_with:
    print(f_with.closed) 
print(f_with.closed)

# 2. Perform operations

A detailed look at the open function:

open(**file**, **mode**='r', **buffering**=-1, **encoding**=None, **errors**=None, **newline**=None, **closefd**=True, **opener**=None)

The key **mode** with string of characters defining the **reading** or **writing** behavior.

| Character | Meaning |
|  :-: |  :- | 
| 'r' | open for reading (default) | 
| 'w' | open for writing, truncating the file first) | 
| 'x' | open for exclusive creation, failing if the file already exists | 
| 'a' | open for writing, appending to the end of the file if it exists | 
| 'b' | binary mode | 
| 't' | text mode (default) | 
| '+' | open for updating (reading and writing) | 



In [None]:
# Let us try to read the first line using the first method

method_1 = open('say_hello.txt')
try:
    line = method_1.readline()
    print(line)
finally:
    method_1.close()

In [None]:
# Let us try to read the first line using the second method

with open('say_hello.txt') as method_2:
    line = method_2.readline()
    print(line)


In [None]:
with open('say_hello.txt') as method_2:
    line = method_2.readline()
    print(iter(method_2)) # The object is an iterable if the built-in method does not raise an error
                          #Windows OS uses cp1252 encoding, but Linux uses utf-8 

In [None]:
# You can use the readline method to iterate over method_2 file object
with open('say_hello.txt') as method_2:
    print(method_2.readline()) # Using readline()
    print(method_2.readline()) # increments the content
    print(method_2.readline())

In [None]:
# Using the iterator property, you can use a for loop to iterate over the file object content
with open('say_hello.txt') as method_2:
    for line in method_2:
        print(line)

In [None]:
# Alternatively, you can break down the lines into a list
with open('say_hello.txt') as method_2:
    output_list = method_2.readlines()
    print(output_list)                # Notably, the newline character is contained
    
for item in output_list:
    print(item.strip())

In [None]:
with open('say_hello.txt') as method_2:
    output_list = method_2.read()
    print(output_list)  
print(type(output_list))

# methods:
# read: entire text
# readline: just the current line
# readlines: list containing every line

In [None]:
print(help(item.strip))

In [None]:
with open('utf-8.txt',encoding='utf-8') as utf_reader:
    read_text = utf_reader.read()

print(read_text)
print(type(read_text))

In [None]:
# What happens if we use the wrong encoding or do not know it ahead of time
with open('say_hello.txt', mode="rb") as reading_in_binary:
    binary_text = reading_in_binary.read()
print(binary_text)
print(binary_text.hex())

#### How to we read the binaries?

- Check the encoding 
- We did not use an encoding key => we are using the default => cp-1252 encoding
- Look up proper encoding table
- Find information on encoding size: "single-byte character encoding" (wikipedia)

<img src="images/cp-1252.png" width= 550>


In [None]:
# Let us move on and check how to write files.
with open('store_information.dat', mode='w') as f:
    f.write('this is important data')


In [None]:
# What happens if I just write onto the file again?
with open('store_information.dat', mode='wb') as f:
    f.write(b'Oops, I overwrote the data') # Binary string required!

In [None]:
# How to I append data onto a file?
with open('store_information.dat', mode='ab') as f:
    f.write(b'\nI am attaching the data') # Binary string required!

In [None]:
# When you want to store a lot of data, you can save space via binary storage
with open("binfile.bin","wb") as binary_file:
    X = [1,  2,  3,  4,  5]
    Y = [1,  4,  9, 16, 25]
    arr = bytearray(X)+bytearray(Y)
    print(type(arr),arr)
    binary_file.write(arr)
    
    

In [None]:
hex(25)

In [None]:
# Let us read the binary while without the binary mode
with open("binfile.bin","r") as binary_file:
    text = binary_file.read()
print(text)

In [None]:
# Now let us read with the binary mode
with open("binfile.bin","rb") as binary_file:
    text = binary_file.read()
print(text) # You get hex value
print(text.decode()) # You get decoded representation

In [None]:
print(help(text)) # Decoded representation?

In [None]:
# Better handling binary information with struct package from standard library
import struct # https://docs.python.org/2/library/struct.html

with open("binfile.bin","rb") as binary_file:
    binary_string = binary_file.read()

# The struct library is useful when you have the exact knowledge on the binary format
# We stored 10 units of data, each containing 1 byte of information

print(struct.unpack(10*"s",binary_string))  #  s: string (1 byte units)

data_tuple = struct.unpack(10*"b",binary_string)
print(struct.unpack(10*"b",binary_string))  #  b: signed char (1 byte units)

print(struct.unpack( 5*"h",binary_string))  #  h: short integer (2 byte units)

print(type(data_tuple))

X = list(data_tuple[0:5])
Y = list(data_tuple[5:10])
print(X,Y)


In [None]:
print(5*"h")


In [None]:
# Now let us use struct.pack and write binary files again
import struct

X = [1,  2,  3,  4,  5]
Y = [1,  4,  9, 16, 25]

with open("binfile2.bin","wb") as binary_file:    # first asterisk (*) for string multiplication
    binary_file.write(struct.pack(10*'b', *(X+Y)))# second asterisk(*) for list unpacking

In [None]:
# Now let us use struct.pack and write binary files again, 
# this time we are going to specify our format:
# 1 byte character 'X', double precision float values for X, 1 byte for character 'Y', "" Y

# Now let us use struct.pack and write binary files again

X = ["X" , 1.,  2.,  3.,  4.,  5.]
Y = ["Y" , 1.,  4.,  9., 16., 25.]
X[0] = bytes("X".encode("utf-8")) # utf-8 encoding require, or else error
Y[0] = bytes("Y".encode("utf-8"))

n = len(X)-1 # We disregard "X" or "Y"
print(*(X+Y))

with open("binfile2.bin","wb") as binary_file:  
    fmt = 's'+n*'d'+'s'+n*'d'  # Our specified format
    print(fmt)
    binary_file.write(struct.pack(fmt, *(X+Y)))

In [None]:
import sys

# Check endianness of system
print(sys.byteorder)

# Check on https://hexed.it/  (C:\Users\dean.emmett.smith\PycharmProjects\jupyter_test\lecture-11)

 **Endianness**: order or sequence of bytes of a word of digital data in computer memory

<img src="images/endianness.png" width=500>


In [None]:
# Analageously,there is a "writelines" method
# Let us write a file in an existing folder named "folder"

with open(".\\folder\hello_nrw.txt",mode="w") as file: 
    file.writelines(["Hello\n","ReDi NRW", "!!!"])
    print(file.readable())
    print(file.writable())
    # print(file.tell()) # Display current file location.



### Insights on how to work with some data formats

- csv  (today)
- json (today)
- xslx
- xml, html
- png, jpg, svg
- wav, mp4

In [None]:
# Let us read a csv (comma separated values)
with open('.\\folder\grocery_list.csv') as csv_file: # Why did I use a double backslash?
    file_content = csv_file.read()
print(file_content)

In [None]:
# Script to calculate total expense from list and check of budged of 30 € has been exceeded or not
with open(".\\folder\grocery_list.csv") as csv_file:
    price = []
    
    # Break down lineinto parts with ";" separator
    raw_labels = csv_file.readline().split(";")
    print("raw labels",raw_labels)
    
    # strip removes trailing spaces and newlines
    labels = [ label.strip() for label in raw_labels] 
    print("processed labels", labels)
    
    key = "price"
    
    # Check if key price is contained in labels
    if key in labels:
        
        # get list index of 'price'
        price_idx = labels.index("price")
        print("list index of prices",price_idx)
        
        prices = []

        for line in csv_file:
            
            # Break down line into parts with ";" separator
            raw_data = line.split(";")
            
            # "strip" removes trailing spaces and newlines
            data = [item.strip() for item in raw_data]
            
            # Retrieve the price value 
            price_value_comma = data[price_idx]
            
            # Replace German "," with "." demical number convention
            price_value = price_value_comma.replace(",",".")
            
            prices.append(float(price_value))
        
        total_expenses = sum(prices)
        print("My groceries cost",str(total_expenses), "€")
        
        # Check if budget has been exceeded
        budget = 30
        if total_expenses >= budget:
            print("I paid too much, but it was worth the chocolate. :)")
        elif 0 < total_expenses < budget:
            print("I bought the chocolate and have",str(round(budget-total_expenses,2)),"€ left.:D")
        else:
            print("Oops. How did that happen? :o")
        

### JSON file format

- Abbreviation for **J**ava**S**cript **O**bject **N**otation ("self-describing" and easy to understand)

- A lightweight format for storing and transporting (meta-)data

- JSON is often used when data is sent from a server to a web page

#### Format rules
- Data is in key/value pairs
- Data is separated by commas
- Curly braces hold objects
- Square brackets hold arrays

#### Example in JSON vs XML

Data for a

JSON:

In [None]:
{

    "widget": {
        "debug": "on",
        "window": {
            "title": "Sample Konfabulator Widget",
            "name": "main_window",
            "width": 500,
            "height": 500
        },
        "image": { 
            "src": "Images/Sun.png",
            "name": "sun1",
            "hOffset": 250,
            "vOffset": 250,
            "alignment": "center"
        },
        "text": {
            "data": "Click Here",
            "size": 36,
            "style": "bold",
            "name": "text1",
            "hOffset": 250,
            "vOffset": 100,
            "alignment": "center",
            "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"
        }
    }

}  

XML:

In [None]:
<widget>
    <debug>on</debug>
    <window title="Sample Konfabulator Widget">
        <name>main_window</name>
        <width>500</width>
        <height>500</height>
    </window>
    <image src="Images/Sun.png" name="sun1">
        <hOffset>250</hOffset>
        <vOffset>250</vOffset>
        <alignment>center</alignment>
    </image>
    <text data="Click Here" size="36" style="bold">
        <name>text1</name>
        <hOffset>250</hOffset>
        <vOffset>100</vOffset>
        <alignment>center</alignment>
        <onMouseUp>
            sun1.opacity = (sun1.opacity / 100) * 90;
        </onMouseUp>
    </text>
</widget>

In [None]:
# Read json file to python dictionary with loads method
import json

with open(".//folder/widget_metadata.json") as json_file:
    # 1. Read content as string
    json_string = json_file.read()
    print("JSON string (formatted):\n")
    print(json_string)
    # 2.
    dictionary = json.loads(json_string)

print("Dictionary version (less formatted):\n")
print(dictionary)

In [None]:
# Write Python list-dictionary to json file with dumps method

short_grocery_list = [{'item': 'eggs', "price": 1.3, "availability": True}, 
                     {"item": "chocolate", "price": 5.7, "availability": False}]

grocery_json_string = json.dumps(short_grocery_list)
print(grocery_json_string)

with open(".//folder/grocery_metadata.json",mode="w") as json_file:
    json_file.write(grocery_json_string)

**Exercises**:

- Write a function "generate_temp" to add an empty file "temp.tmp" using the "x" mode option in a directory "temp". The function should use proper error handling (no failures) and used return strings  "temp file created" or "file already existing" respective of the case. Execute the function twice an print each return value.

- Write a script which stores the grocery list data of "expenses.csv"
- The script should then store data in csv format "expenses_swapped_format.csv", but with the rows and colums swapped
- Write a script to read the file "grocery_list.csv" and print out the <ins>total</ins> expense per  type using csv format:
  
  type 1,       type 2, ...
  
  expense 1, expense 2, ...
  
  
- Store the content to file "expsenses.csv" using this format given above
- Store the content in a json structured dictionary, iterate over list and print each key value pair  per dictionary entry:
  
  [{"item": item_1_name, "price": price_1, "category": type_1}, {"item": item_2_name, "price": price_2 "category": type_2}, ...]
  
  
- Store the format into a in json file "expenses.json"
- Store the content in a binary file "expenses.bin" using (big endian) memory blocks: 
  
  |item 1| price 1| item 2 | price 2 | ...
  
  Format: 20 byte charts for per item key, add padding spaces if necessary, and a single precision float value per price.
- Print the number of bytes required for the buffer. Research if there is a method.
- Write a script to open "wasser.xyz" using the first method. The file format specification can be found in the link https://de.wikipedia.org/wiki/Xyz-Format
- install numpy https://numpy.org/install/
- Iterating over each (relevant coordinate) row in the file,

  Store the values of the fifth column in numpy array Q
  
  Store value of rows two through four in a 2D numpy array R, R should look like
  
  R = array([[0.000000, 0.000000, 0.000000],
  
           [0.9611, 0.000000, 0.000000],
      
           [-0.224986, 0.934448, 0.000000]])
  
- Assign the product (*) of Q and R to P
- Create file "wasser_analytik.dat"
- Write entire content of "wasser.xyz" to file "wasser_analytik.dat"
- Append line \*\*\*Electrostatic Properties\*\*\*
- Append line "Net charge": sum(Q) 

  (Summing all partial charges in Q should give zero net charge)
  
  
- Append line "Dipole moment": numpy.linalg.norm(sum(P)) 
  
  (Summing up each dipole moment contribution in P and determining the magnitude of this should give a non-zero value)
  
  https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
  
  
- Append an empty line (often important in files)
