<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-1">Learning Outcomes</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#By-the-end-of-this-session,-you-should-be-able-to:" data-toc-modified-id="By-the-end-of-this-session,-you-should-be-able-to:-1.0.1">By the end of this session, you should be able to:</a></span></li></ul></li></ul></li><li><span><a href="#File-Handling-in-Python" data-toc-modified-id="File-Handling-in-Python-2">File Handling in Python</a></span></li><li><span><a href="#Let's-break-down-that-command" data-toc-modified-id="Let's-break-down-that-command-3">Let's break down that command</a></span></li><li><span><a href="#Why-use-with?" data-toc-modified-id="Why-use-with?-4">Why use <code>with</code>?</a></span></li><li><span><a href="#3-Aspects-of-Working-with-Files" data-toc-modified-id="3-Aspects-of-Working-with-Files-5">3 Aspects of Working with Files</a></span></li><li><span><a href="#Different-ways-to-read-data" data-toc-modified-id="Different-ways-to-read-data-6">Different ways to read data</a></span></li><li><span><a href="#Remember-navigating-file-system-…" data-toc-modified-id="Remember-navigating-file-system-…-7">Remember navigating file system …</a></span></li><li><span><a href="#pathlib-module" data-toc-modified-id="pathlib-module-8"><code>pathlib</code> module</a></span></li><li><span><a href="#Student-Activities" data-toc-modified-id="Student-Activities-9">Student Activities</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-10">Summary</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-11">Bonus Material</a></span></li><li><span><a href="#Why-is--everything-not-in-memory-all-the-time?" data-toc-modified-id="Why-is--everything-not-in-memory-all-the-time?-12">Why is  everything not in memory all the time?</a></span></li><li><span><a href="#Path-components" data-toc-modified-id="Path-components-13">Path components</a></span></li><li><span><a href="#Writing-files" data-toc-modified-id="Writing-files-14">Writing files</a></span></li><li><span><a href="#Further-Study" data-toc-modified-id="Further-Study-15">Further Study</a></span></li></ul></div>

<center><h2>Learning Outcomes</h2></center>

#### By the end of this session, you should be able to:

- Read a variety of data files into with Python.
- Use `with` to manage file handling.
- Use `pathlib` to programmatically handle paths.


File Handling in Python
-----

Thus far we have only created data in memory and had to recreate it each time we wanted to use it. It is straightfoward in Python to load and save data to persistent it from session to session.

In Data Science, files are stored data. 

Let's grab some data:

In [2]:
reset -fs

In [3]:
with open('students.txt') as f:
    names = f.read()

print(names[:99])

Kamron
Mourya
Nithish
Shane 
Himanshu
Min  
Jiaqi
Sihan
Xiao 
Lisa 
Ruizhe
Wenjie
Alan 
Lea 
Jacob



Let's break down that command
-----

- `with` automatically handles opening, processing, and closing a file when used with `open`.
- `open` is the function to handle files.
- "students.txt" is the file name.
- `as` is alias or nickname.
- `f` is the file handler. Technically, a buffered text stream of data.
- `f.read()` is the method that reads the entire file at once.

In Python, we use context managers to handle files. `with` is the keyword for context managers.

You'll see context managers again in TensorFlow.

Why use `with`?
----

1. It automatically keeps track of what is happening with the file and does the correct thing.

1. Guaranteed to close the file no matter how the block exits.

3 Aspects of Working with Files
-----

1. File contents
1. File name
1. File object 

`f` is a "file object" that is really just a handle or descriptor that the operating system gives us. It's a unique identifier and how the operating system likes to identify a file that we work with. **The file object is not the filename and is also not the file itself on the disk.** It's really just a descriptor and a reference to the file.

In [4]:
whos

Variable   Type             Data/Info
-------------------------------------
f          TextIOWrapper    <_io.TextIOWrapper name='<...>ode='r' encoding='UTF-8'>
names      str              Kamron\nMourya\nNithish\n<...>gwen\nZiyang \nYunzheng\n


In [5]:
# Let's check out the attributes and methods of this file object
# f.<tab>

Different ways to read data
----

In [6]:
filename = 'students.txt'

# Single line
with open(filename) as f:    
    one_line = f.readline()

one_line

'Kamron\n'

In [7]:
# Get rid of linebreak with fluent interface
with open(filename) as f:    
    single_line = f.readline().strip() 

single_line

'Kamron'

In [8]:
# Read file and remove \n from each line
with open(filename) as f:
    names = f.read().splitlines()
names[:10]

['Kamron',
 'Mourya',
 'Nithish',
 'Shane ',
 'Himanshu',
 'Min  ',
 'Jiaqi',
 'Sihan',
 'Xiao ',
 'Lisa ']

In [9]:
# Read line-by-line
with open(filename) as f:    
    for line in f:        
        print(line.strip())
        break

Kamron


In [10]:
# Read in numbers
filename = 'primes.txt'
with open(filename) as f: 
    primes = f.read()

primes

'2\n3\n5\n7\n11\n13\n17\n19\n23\n29\n31\n37\n41\n43\n47\n53\n59\n61\n67\n71\n73\n79\n83\n89\n97\n101\n103\n107\n109\n113\n127\n131\n137\n139\n149\n151\n157\n163\n167\n173\n179\n181\n191\n193\n197\n199'

In [11]:
# Read in numbers
with open(filename) as f: 
    primes = f.readlines()

primes[:10]

['2\n', '3\n', '5\n', '7\n', '11\n', '13\n', '17\n', '19\n', '23\n', '29\n']

In [12]:
with open(filename) as f:
    primes = f.read().splitlines()
primes[:10]

['2', '3', '5', '7', '11', '13', '17', '19', '23', '29']

In [13]:
with open(filename) as f: 
    primes = {int(x) for x in f} # Use a comprehension (remeber sets are awesome!)

primes

{2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
 23,
 29,
 31,
 37,
 41,
 43,
 47,
 53,
 59,
 61,
 67,
 71,
 73,
 79,
 83,
 89,
 97,
 101,
 103,
 107,
 109,
 113,
 127,
 131,
 137,
 139,
 149,
 151,
 157,
 163,
 167,
 173,
 179,
 181,
 191,
 193,
 197,
 199}

Remember navigating file system …
-----

Python uses the same syntax as the command line:

`.` - current directory     
`..` - parent directory  
`~` - home directory  

If possible, let's use relative paths rather than absolute paths.

- path =  "./"       # Relative path to current directory  
- path = "./data/"   # Relative path to subfolder  
- path = "../data/"  # Relative path up a directory to subfolder  

In [14]:
# Open file in another relative directory
with open("../data/one_fish.txt") as f:
    one_fish = f.read()

print(one_fish[:90])

One Fish, Two Fish, Red Fish, Blue Fish
By Dr. Seuss.
One fish
Two fish
Red fish
Blue fish


`pathlib` module
-----

Let's let Python do the work for us for programmatically generating paths.

In [15]:
from pathlib import Path

`Path` will let us write code that works everywhere (Unix and Windows systems)

In [16]:
Path?

In [17]:
path = Path('../data/')

In [18]:
# Kinda like a special str
type(path)

pathlib.PosixPath

In [19]:
# Concatenate directories programmatically
path = Path('../')
folder = "data"
filename = "one_fish.txt"
q = path / folder / filename

with open(q) as f:
    one_fish = f.read()
print(one_fish[:90])

One Fish, Two Fish, Red Fish, Blue Fish
By Dr. Seuss.
One fish
Two fish
Red fish
Blue fish


Student Activities
------

1. Find out what is your current directory.
1. Define an instance of `pathlib.Path` that points to the data for this course.
1. Load `success_by_emily_dickinson.txt`
1. Load `primes.txt` 

In [20]:
path = Path('../data/')
with open( path / "success_by_emily_dickinson.txt") as f:
    success = f.read().splitlines() 
success

['Success is counted sweetest',
 'By those who never succeed.',
 'To comprehend a nectar',
 'Requires sorest need.',
 '',
 'Not one of all the purple host',
 'Who took the flag to-day',
 'Can tell the definition,',
 'So clear, of victory,',
 '',
 'As he, defeated, dying,',
 'On whose forbidden ear',
 'The distant strains of triumph',
 'Break, agonized and clear!']

In [21]:
with open(path / folder / "primes.txt") as f: 
    primes = {int(x) for x in f} 

primes

{2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
 23,
 29,
 31,
 37,
 41,
 43,
 47,
 53,
 59,
 61,
 67,
 71,
 73,
 79,
 83,
 89,
 97,
 101,
 103,
 107,
 109,
 113,
 127,
 131,
 137,
 139,
 149,
 151,
 157,
 163,
 167,
 173,
 179,
 181,
 191,
 193,
 197,
 199}

In [37]:
path = Path("/")
# All files of certain type in current directory
for filename in path.glob('*.*'):
    print(filename)

/.HFS+ Private Directory Data
/.Spotlight-V100
/.DS_Store
/.PKInstallSandboxManager-SystemSoftware
/installer.failurerequests
/.file
/.Trashes
/.fseventsd
/.DocumentRevisions-V100
/.vol
/.dbfseventsd


In [34]:
# All files of certain type in current directory or all subdirectories
for filename in path.glob('**/*.md'):
    print(filename)

In [None]:
# All files in current directory or all subdirectories
for filename in path.glob('**/*.*'):
    print(filename)

Read [pathlib docs](https://docs.python.org/3/library/pathlib.html) for more ideas.

Summary
------

- When reading data from plain text files, use `with` to automatically open and close files.
- There are handful of file reading commands: `read`, `readline`, `readlines`.
- Use `pathlib` to define pathnames that will work across operating systems.

Bonus Material
----

Why is  everything not in memory all the time?
----

Why do we have a separation between memory and storage?

There is __no__ a single computing substrate that is fast and nonvolatile and cheap.

Local computers can be turned off. Cloud computers can crash. Persisting data is a good idea.

Path components
-----

- name: the file name without any directory

- parent: the directory if a file or the parent directory if directory

- stem: the file name without the suffix

- suffix: the file extension

- anchor: the part of the path before the directories

Writing files
-----


3 modes:

1. `r` - read, default

1. `w` - write, delete was already there

1. `a` - append, concatenate


In [24]:
# filename = 'temp.txt'

# with open(filename, 'w') as f:
#     f.write("hi!")

In [25]:
# with open(filename, 's') as f:
#     f.write("there!")

In [26]:
# One-liner to get random line
import random

print(random.choice(open("students.txt").readlines()))

Ziyang 



Further Study
------

- https://realpython.com/courses/reading-and-writing-files-python/
- https://stackabuse.com/read-a-file-line-by-line-in-python/
- https://stackabuse.com/file-handling-in-python/
- https://stackabuse.com/reading-files-with-python/
- https://stackabuse.com/writing-files-using-python/
- https://stackabuse.com/reading-and-writing-lists-to-a-file-in-python/
- https://stackabuse.com/how-to-create-move-and-delete-files-in-python/
