<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#File-Loading-in-Python" data-toc-modified-id="File-Loading-in-Python-1">File Loading in Python</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#File-loading" data-toc-modified-id="File-loading-3">File loading</a></span></li><li><span><a href="#Using-with:" data-toc-modified-id="Using-with:-4">Using <code>with</code>:</a></span></li><li><span><a href="#Why-use-with?" data-toc-modified-id="Why-use-with?-5">Why use <code>with</code>?</a></span></li><li><span><a href="#with-/-context-managers-in-PyTorch" data-toc-modified-id="with-/-context-managers-in-PyTorch-6"><code>with</code> / context managers in PyTorch</a></span></li><li><span><a href="#3-Parts-of-Working-with-Files" data-toc-modified-id="3-Parts-of-Working-with-Files-7">3 Parts of Working with Files</a></span></li><li><span><a href="#Different-ways-to-read-data-into-memory" data-toc-modified-id="Different-ways-to-read-data-into-memory-8">Different ways to read data into memory</a></span></li><li><span><a href="#How-to-navigate-the-file-system" data-toc-modified-id="How-to-navigate-the-file-system-9">How to navigate the file system</a></span></li><li><span><a href="#pathlib-module" data-toc-modified-id="pathlib-module-10"><code>pathlib</code> module</a></span></li><li><span><a href="#Data-scientists-primarily-needs-to-load-files" data-toc-modified-id="Data-scientists-primarily-needs-to-load-files-11">Data scientists primarily needs to load files</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-12">Takeaways</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-13">Bonus Material</a></span></li><li><span><a href="#Why-is--everything-not-in-memory?" data-toc-modified-id="Why-is--everything-not-in-memory?-14">Why is  everything not in memory?</a></span></li><li><span><a href="#Path-components" data-toc-modified-id="Path-components-15">Path components</a></span></li><li><span><a href="#Writing-files" data-toc-modified-id="Writing-files-16">Writing files</a></span></li><li><span><a href="#Further-Study" data-toc-modified-id="Further-Study-17">Further Study</a></span></li></ul></div>

<center><h2>File Loading in Python</h2></center>


<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Load files using as a `with` context manager.
- Process the contents of the files.
- Programmatically load files with the `Path` class.

In [5]:
# All data is just bytes. 
# We must tell the computer how to handle those bytes.

# Hexdump shows the hexidecimal (base 16)
# `hexdump` is for mac os
# -n is a flag to show length in bytes
! hexdump -n 96 one_fish.txt

# All data is just bytes. 
! hexdump -n 96 i_am_not_text.jpg

0000000 4f 6e 65 20 46 69 73 68 2c 20 54 77 6f 20 46 69
0000010 73 68 2c 20 52 65 64 20 46 69 73 68 2c 20 42 6c
0000020 75 65 20 46 69 73 68 0a 42 79 20 44 72 2e 20 53
0000030 65 75 73 73 2e 0a 4f 6e 65 20 66 69 73 68 0a 54
0000040 77 6f 20 66 69 73 68 0a 52 65 64 20 66 69 73 68
0000050 0a 42 6c 75 65 20 66 69 73 68 2e 0a 42 6c 61 63
0000060
0000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 01
0000010 00 01 00 00 ff db 00 43 00 06 04 05 06 05 04 06
0000020 06 05 06 07 07 06 08 0a 10 0a 0a 09 09 0a 14 0e
0000030 0f 0c 10 17 14 18 18 17 14 16 16 1a 1d 25 1f 1a
0000040 1b 23 1c 16 16 20 2c 20 23 26 27 29 2a 29 19 1f
0000050 2d 30 2d 28 30 25 28 29 28 ff db 00 43 01 07 07
0000060


In [6]:
# `xxd` is for linux

# ! xxd -l 96 one_fish.txt

# ! xxd -l 96 i_am_not_text.jpg

File loading
------

Files are stored data. 

Let's grab some data:

In [7]:
reset -fs

In [8]:
with open("students.txt") as f:
    names = f.read()

print(names[:99])

Alex
Amee
Aneri
Annie
Audrey
Kai
Binaya 
Boliang
Chris
Nguyen
Catie
Daniel 
Dash 
Efrem
Elyse
Emre



Using `with`:
-----

In Python, we use context managers to handle files. 

- `with` automatically handles opening, processing, and closing a file when used with `open`.
- `open` is the function, what we are doing
- "students.txt" is the file name
- `as` is alias or nickname
- `f` is the file handler. Technically a buffered text stream of data
- `f.read()` is a function that reads the entire file


Why use `with`?
----

1. It automatically keeps track of what is happening with the file and does the correct thing.

1. Guaranteed to close the file no matter how the block exits.

`with` / context managers in PyTorch
-----

The Deep Learning framework PyTorch using context managers in training neural networks

```python
with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after updating weights
    w1.grad.zero_()
    w2.grad.zero_()
```

3 Parts of Working with Files
-----

1. File name
    
    The file name has to be the location of the file according to the file system. It needs to be exact location and name, something like "./data/training_data.txt"
    
    
1. File handler

    `f` is a "file object" that is really just a handle or descriptor that the operating system gives us. It's a unique identifier and how the operating system likes to identify a file that we work with. 

    **The file object is not the filename and is also not the file itself on the disk.** 

    It's really just a descriptor and a reference to the file.

1. File contents

    Python loads the file contents as a string. The file contents can be processed as a string and be cast to other data types.

Different ways to read data into memory
----

In [10]:
filename = "students.txt"

In [30]:
# A single line

with open(filename) as f:    
    one_line = f.readline()

one_line

'import random\n'

In [12]:
# Get rid of line break

with open(filename) as f:    
    single_line = f.readline().strip() 

single_line

'Alex'

In [13]:
# Read file and remove \n from each line
with open(filename) as f:
    names = f.read().splitlines()
names

['Alex',
 'Amee',
 'Aneri',
 'Annie',
 'Audrey',
 'Kai',
 'Binaya ',
 'Boliang',
 'Chris',
 'Nguyen',
 'Catie',
 'Daniel ',
 'Dash ',
 'Efrem',
 'Elyse',
 'Emre',
 'Eriko',
 'Evan',
 'Eileen',
 'Flora',
 'Grant',
 'Hashneet',
 'Danny',
 'Huidong',
 'Veronica',
 'Janson',
 'Jared',
 'Jiahui',
 'Jordan',
 'Josh',
 'Kexin',
 'Kyle',
 'Lucia',
 'Matt ',
 'Michelle',
 'Ming',
 'Moh',
 'Nicolas',
 'Okeefe ',
 'Yue',
 'April',
 'Phillip ',
 'Rahul',
 'Sherry',
 'Adam',
 'Stephen',
 'Hsuanyu',
 'Veeral',
 'Tony',
 'Siwei',
 'Somya',
 'Sophie',
 'Sophie ',
 'Tako',
 'Tao',
 'Tiance',
 'Tian',
 'Trevor',
 'Luke',
 'Victor',
 'Victor',
 'Vaishnavi',
 'Wenyao',
 'Yi',
 'Yingtong',
 'Yixuan',
 'Yueling',
 'Zach',
 'Zixi']

In [14]:
# Read line-by-line
with open(filename) as f:    
    for line in f:        
        print(line.strip())
        break

Alex


In [15]:
# Read in numbers
filename = 'primes.txt'
with open(filename) as f: 
    primes = f.read()

primes

'2\n3\n5\n7\n11\n13\n17\n19\n23\n29\n31\n37\n41\n43\n47\n53\n59\n61\n67\n71\n73\n79\n83\n89\n97\n101\n103\n107\n109\n113\n127\n131\n137\n139\n149\n151\n157\n163\n167\n173\n179\n181\n191\n193\n197\n199'

In [16]:
# A better way to read in numbers
with open(filename) as f: 
    primes = {int(x) for x in f} # Use a comprehension to cast each item (remeber sets are awesome!)

primes

{2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
 23,
 29,
 31,
 37,
 41,
 43,
 47,
 53,
 59,
 61,
 67,
 71,
 73,
 79,
 83,
 89,
 97,
 101,
 103,
 107,
 109,
 113,
 127,
 131,
 137,
 139,
 149,
 151,
 157,
 163,
 167,
 173,
 179,
 181,
 191,
 193,
 197,
 199}

How to navigate the file system
-----

| Symbol | Meaning |  
|:-------:|:------:|
| `.` |  the current directory   |
| `..` | up one directory level |  
| `~` | your home directory |  


```python
path =  "./"       # Relative path to current directory   
path = "./data/"   # Relative path to subfolder   
path = "../data/"   # Relative path to up one directory level and into a folder called data   

path = "~/Desktop" # Use alias for home directory then go to Desktop folder

path = "~/Desktop" # Relative path           
path = "/Users/brian/Desktop" # Absolute path / hard coded
```

Relative paths will work on other people's machines (including the cloud).

Absolute paths are easier to debug.

In general, let's use relative paths rather than absolute paths.

In [28]:
with open("../data/one_fish.txt") as f:
    one_fish = f.read()

print(one_fish[:103])

One Fish, Two Fish, Red Fish, Blue Fish
By Dr. Seuss.
One fish
Two fish
Red fish
Blue fish.
Black fish



`pathlib` module
-----

Let's let Python do the work for us.

In [18]:
from pathlib import Path

`Path` will let us write code that works across different operating systems (mac, linux, and windows)

In [19]:
help(Path)

In [20]:
# Easy to read in a complete file with path 
Path("students.txt").read_text()

'Alex\nAmee\nAneri\nAnnie\nAudrey\nKai\nBinaya \nBoliang\nChris\nNguyen\nCatie\nDaniel \nDash \nEfrem\nElyse\nEmre\nEriko\nEvan\nEileen\nFlora\nGrant\nHashneet\nDanny\nHuidong\nVeronica\nJanson\nJared\nJiahui\nJordan\nJosh\nKexin\nKyle\nLucia\nMatt \nMichelle\nMing\nMoh\nNicolas\nOkeefe \nYue\nApril\nPhillip \nRahul\nSherry\nAdam\nStephen\nHsuanyu\nVeeral\nTony\nSiwei\nSomya\nSophie\nSophie \nTako\nTao\nTiance\nTian\nTrevor\nLuke\nVictor\nVictor\nVaishnavi\nWenyao\nYi\nYingtong\nYixuan\nYueling\nZach\nZixi'

In [21]:
# Define the path as up one level in the file directory, then into a folder named data
path = Path("../data/")

In [22]:
# Kinda like a special str
type(path)

pathlib.PosixPath

In [31]:
# glob is a method that performs wildcard pattern matching on filenames
# * operator in pathnames is a wildcard, aka everything
# "**/" means search for sub within each subdirectory
# '*.*' means for all filenames and all file types in the directory

# from pathlib import Path

# path = Path("../")

# for filename in path.glob('**/*.*'):
#     print(filename)

Data scientists primarily needs to load files
-----

Most data science work is loading and processing a collection of files for analysis.

Data scientist rarely have to write text files. Most of the time data scientist writes out data as:

- Comma-separated values (CSV) file
- Dataframe
- JSON 
- Python object using [`pickle`](https://docs.python.org/3/library/pickle.html)

<center><h2>Takeaways</h2></center>


- We need to read (input) and write (output) data to files.
- Typically reading just plain text files.
- It is most common to use a `with open()` block
- `Path` enables programmatic file access across all operating systems.

Bonus Material
----

Why is  everything not in memory?
----

Why do we have a separation between memory and storage?

There is __no__ a single computing substrate that is fast and nonvolatile and cheap.

Path components
-----

- name: the file name without any directory

- parent: the directory if a file or the parent directory if directory

- stem: the file name without the suffix

- suffix: the file extension

- anchor: the part of the path before the directories

Writing files
-----


3 modes:

1. `r` - read, default

1. `w` - write, delete was already there

1. `a` - append, concatenate


In [24]:
# filename = 'temp.txt'

# with open(filename, 'w') as f:
#     f.write("hi!")

In [25]:
# One-liner to get random line
import random

print(random.choice(open("students.txt").readlines()))

Victor



Further Study
------

- Real Python
    - https://realpython.com/courses/reading-and-writing-files-python/
    - https://realpython.com/courses/practical-recipes-files/
    - https://realpython.com/python-pathlib/
- Stackabuse
    - https://stackabuse.com/read-a-file-line-by-line-in-python/
    - https://stackabuse.com/file-handling-in-python/
    - https://stackabuse.com/reading-files-with-python/
    - https://stackabuse.com/writing-files-using-python/
    - https://stackabuse.com/reading-and-writing-lists-to-a-file-in-python/
    - https://stackabuse.com/how-to-create-move-and-delete-files-in-python/
