# 2. File Handling in Python

Artificial Intelligence massively relies on data, and data needs to be stored somewhere. 
This is done through files of various formats. 
Whether a file contains a dataset for training a model or the results of your AI algorithms, 
file handling is quite a crucial skill to possess!

In this chapter, we dive into the practical world of file handling in Python. Efficiently working with files is a crucial skill for any data scientist, engineer, or researcher. We will explore how to read from and write to different types of files, enabling you to manipulate data. 

## 2.1 Text Files
One of the most common ways to store data is by storing them in a text file. 
Unlike binary files, which may contain data in a format only interpretable by specific software, text 
files hold *human-readable* content.

**Key Characteristics of Text Files:**
- **Plain Text**: Text files consist of unformatted plain text. They do not include any special 
formatting, styles, or embedded objects commonly found in word processor documents.
- **Encoding**: Text files are often encoded using character encodings like UTF-8 or ASCII. Encoding refers to the method used to represent characters as sequences of bytes. It defines a mapping between the characters in a character set (like ASCII, UTF-8, or UTF-16) and their binary representations. It is not too important to know what they mean for now, just keep in mind that if you open a dataset and it looks it contains many weird-looking characters, it most likely means it uses a different encoding.
- **Line-based Structure**: Information in a text file is organized into lines, with each line typically representing a separate piece of information or record. Lines are usually delimited by newline characters (`\n`).

### 2.1.1 Text Files Formats

**Plain Text File (.txt)**

A `.txt` file is a type of computer file that stores plain text information without any formatting or styles. It is a simple, human-readable text document that can be opened and edited with a basic text editor.

Example:


**Comma-Separated Values File (.csv)**

A `.csv` file is a plain text file format commonly used to store tabular data, where each line represents a row of values, and fields within each row are separated by commas. It is widely used for its simplicity and ease of import/export with spreadsheet software and databases.

Example:

**JavaScript Object Notation File (.json)**


A `.json` file is a lightweight data interchange format commonly used for storing and exchanging structured data in a human-readable format using key-value pairs. In AI, `.json` files are frequently employed for configuring settings, storing parameters for machine learning models, or exchanging data between different components of an AI system, due to their simplicity and ease of parsing, but also to the flexibility allowed by their hierarchical structure.

Example:

In your journey exploring and working with Python and AI, you will most certainly encounter more filetypes. But these 3 are some of the most common ones.



### 2.1.2 Reading Text Files

Python provides simple and effective ways to read the contents of text files. We'll learn how to open a file, read its content, and perform operations such as reading line by line or reading the entire file at once.

The following syntax allows to open a file for reading (`'r'`) purposes:  

```python
with open(filename, 'r') as file:
    content = file.read()
    file.close()
```

Run the following example:

In [None]:
with open ('assets/readfile.txt', 'r') as file:
    content = file.read()
    file.close() # it is good practice to close the file

print(content)

You can open the `readfile.txt` file in the assets folder if you want to check its contents. 
We can check the type of the content variable like this:

In [None]:
print(type(content))

Note that Python reads the file as a string. That means we can find individual characters using:

In [None]:
print(content[0:12])

It is also possible to read text files line by line by using the `readlines()` function. See the following example:

In [None]:
with open ('assets/readfile.txt', 'r') as file:
    lines = file.readlines()
    file.close()

print(f'First line: {lines[0]}')
print(f'Second line: {lines[1]}')
print(f'Third line: {lines[2]}')

**Example: Reading a CSV**

A very common way to store data is using a CSV file. It is a very simple way to store data similar to how an excel file looks. Each **row** gets separated by a new line, each comma separates the **columns**. A `.csv` also generally contains a **header** which has some information and the names of the various columns. Here is an example:

```
EPPLER 67 AIRFOIL

x,y
0.00084,0.00369
0.00638,0.01188
0.01639,0.02086
0.03072,0.03016
0.0493,0.03945
0.07199,0.04847
0.09862,0.05701
```

This full file can be seen in `assets/eppler-airfoil.csv`. We can read such a file using the `readlines()` function:

In [None]:
with open ('assets/eppler-airfoil.csv', 'r') as file:
    lines = file.readlines()
    file.close()

header = lines[:3]
rows = lines[3:]

print(header)
print(rows)

We can now convert these two lists of values into data using the `strip()` (for getting rid of the `\n`) and `split()` (for splitting the commas) function:

In [None]:
rows = [row.strip() for row in rows] # Using list comprehension to remove the \n

rows = [row.split(',') for row in rows] # Using list comprehension to split the rows on the commas

print(rows)

Each value can now be accessed using:

In [None]:
print(rows[0][1]) # first line, second column (y-coordinate)

As you may see, this is not a very fast or elegant syntax to open a CSV file. In fact, it takes quite a few steps to open the data. Luckily, a built-in module exists in Python to make our lives (and our codes!) easier:

In [None]:
import csv

with open ('assets/eppler-airfoil.csv', 'r') as file:
    reader = csv.reader(file) # Create a reader object
    rows = list(reader) # Convert the reader object to a list
    file.close()

rows = rows[3:] # Remove the header
print(rows)

Here we see an example of the csv module that makes the code more readable. It also automatically takes care of parsing, delimiters and errors. The full documentation can be found here: https://docs.python.org/3/library/csv.html.

## 2.2 Files with Images
Another very useful skill for people working in machine learning is reading images. One of the most important fields within AI is computer vision. In order to train models on these images, they need to be converted and stored into data files. In this section, we will cover how to read these types of files.

Before reading images, it is important to realize how images are stored. An image is a visual representation of information, and it is composed of individual picture elements, commonly known as **pixels**. Pixels are the smallest unit of an image, and each pixel holds information about the color of a tiny square in the overall image.

Colors in digital images are created by combining different intensities of three primary colors: **Red**, **Green**, and **Blue**. This color model is known as *RGB*. Each pixel in an RGB image has three values representing the intensity of Red, Green, and Blue, respectively. The combination of these three colors in varying intensities produces a wide spectrum of colors.

For example, an RGB triplet `(200, 50, 100)` represents a pixel with a moderate intensity of red (`200`), a low intensity of green (`50`), and a moderate intensity of blue (`100`). Combining these three intensities results in a specific color for that pixel. The maximum value a RGB value regularly can take is 256 (1 byte).

### *2.2.1 Image Files Formats*
An image format is a standardized way of representing and storing visual information in a digital form. It defines how data about colors, pixels, and other image characteristics are organized and encoded, allowing computers and software to interpret and display the visual content.

Different image formats have varying characteristics, including compression methods, color representations, and support for features like transparency and animation. The choice of image format depends on the specific requirements of the application and the type of visual content being stored or transmitted. 

The most popular image formats are **JPEG and PNG**:

- **JPEG** is a popular image compression standard developed by the *Joint Photographic Experts Group*. It is a lossy compression format, meaning it reduces file size by discarding some information that the human eye may not easily notice. JPEG is well-suited for photographs and images with complex color gradients. However, due to the lossy compression, repeatedly saving a JPEG image can result in a degradation of image quality.

- **PNG** is another widely-used image format, but unlike JPEG, it uses *lossless* compression. This means that no information is lost during compression, resulting in higher image quality. PNG is well-suited for images with sharp edges, transparency, or simple graphics. It supports a full range of colors and has become a standard for web graphics, logos, and images where preserving fine details is crucial.

### 2.2.4 Reading Images
Similarly to `csv` files, there are many ways you can read images. However, reading images becomes 
a lot easier when using a package. 

The following block shows how to read a `jpeg` file:

In [None]:
from PIL import Image

# Specify the path to your JPEG file
jpeg_file_path = 'assets/image.jpg'

# Open the JPEG file
with Image.open(jpeg_file_path) as img:
    # Display information about the image
    print(f"Image format: {img.format}")
    print(f"Image mode: {img.mode}")
    print(f"Image resolution: {img.size}")

    # You can perform various operations on the image here
    # For example, you can show the image with
    display(img)

    # Note: in a regular Python script, the line img.show() could have also been used instead of
    # display(img)

The Python image library takes care of opening various image formats. The full documentation can be found here: https://pillow.readthedocs.io/en/stable/reference/Image.html. 

In order for it to be useful to us to apply machine learning to the images, we need to convert it to numerical values. 
Using the `numpy` package (more details in the next notebook), we can convert it into an `array` (a datatype very similar to a list). 

Running the codeblock below first prints the array shape. The array is a 3 dimensional matrix which is 1080x1920x3. The 1080x1920 is the pixel resolution of the image, the 3 are the RGB values. The second thing it prints is an abbreviated display of the array itself. We can see the RGB values (which range from 0-99). 

In [None]:
import numpy as np

with Image.open(jpeg_file_path) as img:
    # Display information about the image
    array = np.array(img)
    print(f"Array shape: {array.shape}")

print(array)

## 2.3 Practice File Handling
The best way to practice your file handling skills to to try it yourself. 
1. Open the image in the `assets` folder and load it using `Pillow`. 
2. Apply a Gaussian Blur filter with radius 10 using `Pillow` (how to do this is shown in the [Pillow documentation](https://pillow.readthedocs.io/en/stable/reference/ImageFilter.html)). 
3. Save the 3D matrix in a `csv` file. Then, save the `csv` file in the `assets` folder.

Hint: you have to convert from the NumPy array to a list using the `array.tolist()` function, only then it can be saved as a CSV.

In [None]:
# Implement your solution here.

## Solutions

#### Gaussian Blur filter

In [None]:
# Answer cell

import numpy as np
from PIL import ImageFilter
import csv

jpeg_file_path = 'assets/image.jpg'

with Image.open(jpeg_file_path) as img:
    # Display information about the image
    img1 = img.filter(ImageFilter.GaussianBlur(radius=10))
    image_array = np.array(img1)

# Convert to a list
image_list = image_array.tolist()

# Write the list to a CSV file
with open('assets/image.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(image_list)
    file.close()