# 9.1 An Introduction to Files

In this section we're going to learn all about files as a background to working with them.  In this submodlue you'll learn:

- What, where, and how files are made.
- File Paths.
- Line Endings and why their important.
- Character encodings used for files.

## What is a File?

Remember back in module 1 where we discussed the basic components of what makes a computer?  Files are created by moving data from primary memory into secondary memory.  Since files are located in "Secondary Memory" that means they'll persist (or aren't lost) beyond program executions or power cycles.

<div>
<img src="attachment:image.png" width="800" />
</div>

### Anatomy of a File

Files come in all sorts of shapes and sizes.  But the purpose of every file is to store data for later use.  Most of the time we think of files as things that are human readable such as text files, json, csv, or even docx.  But almost everything that is persistent in a computer is a file like .mp3, .exe, .png, etc.  Formally speaking, a **file** is a contiuous collection of bytes used to record data for storage on a device.  Most files have this format to them:

- **[Header/Signature (Optional)](https://en.wikipedia.org/wiki/List_of_file_signatures):** A "signature" to indicate the file type.  This is optional and dependent of the file type.
- **Contents:** The data desired to be stored within the file.
- **End of File (EOF):** A special character that indicates the end of the file.

<img src="attachment:image.png" width=300/>

### File Types

Every file has a type associated with it.  File types are typically designated by a file extension such as .txt, .csv, .json and so forth.  When a file has a specific extension then that's an indication to the computer and programs to translate that file into something that the user can understand and work with.  Let's take a look at the Portable Network Graphics (PNG) file format.  From [wikipedia](https://en.wikipedia.org/wiki/Portable_Network_Graphics) we can find out the format for the file.  Looking at the file side by side with it's hexdump (a raw data inspector) we can see the header `89 50 4E 47 0D 0A 1A 0A`.  If we keep inspecting the data we'll see all of the contents of the file that's used to render the image.

![image.png](attachment:image.png)

## File Paths

Files are accessed by navigating to their file path.  Just like a "path" or trail in real life, a file path is the steps you must take to access the file.  File paths have three major components to them:

![image.png](attachment:image.png)

1. **Folder path:** The folder tree that describes where on the file system the file is.
2. **File name:** The name of the file.
3. **File extension:** The file extension that determines the file type.

<div class="alert alert-info">
    <b>Note:</b> Linux/Unix and Mac OS use backward slashes (<code>&#92;</code>) and Windows usese forward slashes (<code>&#47;</code>) in between folders.
</div>

### Absolute vs. Relative Paths

There are two types of paths:

1. **Relative Path:** A file path that is in relation to the current working directory (cwd).
2. **Absolute Path:** A file path that is in relation to the root system folder.

The **current working directory or cwd** is the current location with which the system is working in.

### Absolute Paths

Absolute paths, also called full paths, is the exact path that is required to traverse to get to the file.  Let's go over an example.  Given the following folder structure:
```text
/
│
├── path/
|   │
│   ├── to/
│   │   └── real_life_doodle.gif
│   │
│   └── tale_darth_plageous.txt
|
└── pokedex.csv
```

How would we reference the file `real_life_doodle.gif` using absolute pathing?

 `/path/to/real_life_doodle.gif`

### Relative Paths

Relative paths are paths that are relative to the current working directory.  Let's go over an example.  Given the following folder structure:

```text
/
│
├── path/
|   │
│   ├── to/ ← Your current working directory (cwd) is here
│   │   └── real_life_doodle.gif
│   │
│   └── tale_darth_plageous.txt
|
└── pokedex.csv
```

Assuming that you're currently in the `to/` folder, how would we reference the `real_life_doodle.gif`?

`real_life_doodle.gif`

Same situation, what about `tale_darth_plageous.txt`?

```text
/
│
├── path/
|   │
│   ├── to/ ← Your current working directory (cwd) is here
│   │   └── real_life_doodle.gif
│   │
│   └── tale_darth_plageous.txt
|
└── pokedex.csv
```

`../tale_darth_plageous.txt`

## Line Endings

Time for a small history lesson!  During the height of it's usage, Morse code operators used different combinations of characters to indicate specical characters such as spaces and new lines.  Soon, teleprinters, were invented to automatically take the electrical signals and print out the Morse Code results.  It became apparent for a need to standarize these white space characters.  

The International Organization for Standards (ISO) and American Standards Association (ASA) a predessor to American National Standards Institute (ANSI) came up with two different standards.  The ASA stated that a end of line character sequence was the Carriage Return (`CR` or `\r`) combined with Line Feed (`LF` or `\n`) resulting in `CR+LF` or `\r\n`.  On the other hand, the ISO standard allowed for either `CRLF` or just `LF`.  The Carriage Return operation would place the cursor back to the beginning of the current line, while the Line Feed would progress the cursor to the next line.

Later, as computers were started to be created, they adopted the different standards.  MS-DOS which was later used within Windows, used the CRLF strict usage of the ANSI specification.  On the other hand the Multics OS used just the LF character as defined by the ISO standard.  Multics would go on to inspire Unix, which in turn was the basis for Linux and modern macOS.  

So, why does this all matter?  Well Windows uses `CRLF` or `\r\n` for its line endings while Unix, Linux, and modern macOS operating systems use just the `LF` or `\n` for its line endings.  This can lead to some really interesting side effects when working with files from different operating systems.

When creating a file in Windows, you'll see a file that looks like this:

```text
Pug\r\n
Jack Russell Terrier\r\n
English Springer Spaniel\r\n
German Shepherd\r\n
Staffordshire Bull Terrier\r\n
Cavalier King Charles Spaniel\r\n
Golden Retriever\r\n
West Highland White Terrier\r\n
Boxer\r\n
Border Terrier\r\n
```

But! Open that file in Linux/mac OS and you'll see something different:

```text
Pug\r
\n
Jack Russell Terrier\r
\n
English Springer Spaniel\r
\n
German Shepherd\r
\n
Staffordshire Bull Terrier\r
\n
Cavalier King Charles Spaniel\r
\n
Golden Retriever\r
\n
West Highland White Terrier\r
\n
Boxer\r
\n
Border Terrier\r
\n
```

That's because these operating system interpret the `\r` and `\n` as a end of line character.

## Character Encodings

Another common problem that you may face is the encoding of the byte data. An encoding is a translation from byte data to human readable characters. This is typically done by assigning a numerical value to represent a character. The two most common encodings are the <a href="https://www.ascii-code.com/">ASCII</a> and <a href="https://unicode.org/">UNICODE</a> Formats. <a href="https://en.wikipedia.org/wiki/ASCII">ASCII can only store 128 characters</a>, while <a href="https://en.wikipedia.org/wiki/Unicode">Unicode can contain up to 1,114,112 characters</a>.

ASCII is actually a subset of Unicode (UTF-8), meaning that ASCII and Unicode share the same numerical to character values. It’s important to note that parsing a file with the incorrect character encoding can lead to failures or misrepresentation of the character. For example, if a file was created using the UTF-8 encoding, and you try to parse it using the ASCII encoding, if there is a character that is outside of those 128 values, then an error will be thrown.