In [4]:
import os
import sys
import pandas as pd

# File Management 2

## Pandas File Access

We've already looked at our main tool for reading data from disk, the file read/write functionality in Pandas. When reading datasets this is normally all we need. Inside the read_csv function the Pandas people have either created all the file access stuff that they need to read a CSV or, more likely, they repurposed and extended some os library functions to do the work for them.

In [5]:
df = pd.read_csv("../data/chipotle.tsv", sep="\t")
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


## OS and File Read/Write

We can also use the `os` module to do some basic file management. OS is a library that allows us to interact with the operating system on our local computer. Recall that our Python programs run inside an environment setup by the Python install on our computer. This means that as we work we are "inside" that separate environment and we can't directly interact with the underlying computer. The os library, and functions such as read_csv from other libraries, are a tool that allows us to bridge this gap. The os library that is presented to us gives us an assortment of commands to do things like delete files or change the directory we're using. The library's functions are then translated into the correct actions for the actual computer, and passed on to that computer, when the code is run in the Python environment. This also allows for Python code to be portable, or able to run with few to no changes, on different types of computers - I am using a Mac, most of you are probably using a PC, and we can also use a Unix/Linux based system like Google Colab. The code we write can work on all of those environments because of this abstraction, and in each case, the actions triggered by the os module will be different depending on the underlying operating system of the machine. In practice, many of the user-friendly libraries that we might use to access files or folders is built on top of the os module, so we often avoid needing to get into the weeds ourselves.

We can use the `os` module to do things like:
<ul>
    <li>Get the current working directory</li>
    <li>Change the current working directory</li>
    <li>Get a list of files in a directory</li>
    <li>Create a new directory</li>
    <li>Remove a directory</li>
    <li>Remove a file</li>
</ul>

First though, let's get some info about our system. Everyone will get totally different results here - I'm on a MacBook Air running MacOS, and I assume most of you have some variety of a PC running Windows. If you've ever seen any of those website things that tells you, "You are running Windows 10 in Edmonton, Ab..." this is a similar idea. The os.uname() function reaches out to the computer and retrieves some of it's identifying information for us. 

In [6]:
##for a in os.uname():
##    print(a)

We can also get much more info by looking at the environment variables. These are variables that are set in the operating system and are available to any program running on the computer. These admittedly not frequently relevant to us doing some data science work, but it is a good example of connecting to the system. One thing that can matter to us is the PATH variable, which (roughly) lists all the places the system will look for programs to run. This is how the system knows where to find Python, for example - its in one of those places - as are all the programs installed on your machine. 

In [7]:
os.environ

environ{'ALLUSERSPROFILE': 'C:\\ProgramData',
        'AMDRMPATH': 'C:\\Program Files\\AMD\\RyzenMaster\\',
        'AMDRMSDKPATH': 'C:\\Program Files\\AMD\\RyzenMasterSDK\\',
        'APPDATA': 'C:\\Users\\Kier Vincent\\AppData\\Roaming',
        'ASW_SERVER_STATE': '1',
        'CHROME_CRASHPAD_PIPE_NAME': '\\\\.\\pipe\\crashpad_20412_RICIPBPFHTPZDCPY',
        'CHROME_RESTART': 'Google Chrome|Whoa! Google Chrome has crashed. Relaunch now?|LEFT_TO_RIGHT',
        'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files',
        'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common Files',
        'COMMONPROGRAMW6432': 'C:\\Program Files\\Common Files',
        'COMPUTERNAME': 'DESKTOP-5OCOO89',
        'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe',
        'CONDA_ALLOW_SOFTLINKS': 'false',
        'CONDA_DEFAULT_ENV': 'base',
        'CONDA_EXE': 'C:\\Anaconda3\\Scripts\\conda.exe',
        'CONDA_PREFIX': 'C:\\Anaconda3',
        'CONDA_PROMPT_MODIFIER': '(base) ',
        'CONDA

In [8]:
os.environ["PATH"].split(":") 

['c',
 '\\Anaconda3;C',
 '\\Anaconda3;C',
 '\\Anaconda3\\Library\\mingw-w64\\bin;C',
 '\\Anaconda3\\Library\\usr\\bin;C',
 '\\Anaconda3\\Library\\bin;C',
 '\\Anaconda3\\Scripts;C',
 '\\Anaconda3\\bin;C',
 '\\Anaconda3\\condabin;C',
 '\\Program Files\\Google\\Chrome\\Application;C',
 '\\Program Files (x86)\\Razer\\ChromaBroadcast\\bin;C',
 '\\Program Files\\Razer\\ChromaBroadcast\\bin;C',
 '\\Program Files (x86)\\Razer Chroma SDK\\bin;C',
 '\\Program Files\\Razer Chroma SDK\\bin;C',
 '\\WINDOWS\\system32;C',
 '\\WINDOWS;C',
 '\\WINDOWS\\System32\\Wbem;C',
 '\\WINDOWS\\System32\\WindowsPowerShell\\v1.0;C',
 '\\WINDOWS\\System32\\OpenSSH;C',
 '\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C',
 '\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C',
 '\\WINDOWS\\system32\\config\\systemprofile\\AppData\\Local\\Microsoft\\WindowsApps;C',
 '\\Users\\Kier Vincent\\AppData\\Local\\Microsoft\\WindowsApps;C',
 '\\WINDOWS\\system32;C',
 '\\WINDOWS;C',
 '\\WINDOWS\\System32\\Wbem;C',
 

The shutil module is a partner to the os one, and provides some other functions. Among those, we can see disk usage. 

In [9]:
import shutil
shutil.disk_usage("/")

usage(total=999006662656, used=424339136512, free=574667526144)

## Folder Management

The OS library also provides for an assortment of folder management functions. We can use these to create, rename, and delete folders.

When using large datasets it is very common to have our data distributed over several folders. For example, if we are doing image recognition we might have a folder for "dogs", another with "cars", and another with "rutabagas". To construct our dataset we need to navigate over all of these folders and read in the files, using os and/or some similar libraries. 

Some common folder actions are:
<ul>
<li> os.mkdir() - create a new folder </li>
<li> os.rename() - rename a folder </li>
<li> os.rmdir() - remove a folder </li>
<li> os.getcwd() - get the current working directory </li>
<li> os.chdir() - change the current working directory </li>
<li> os.listdir() - list the contents of a directory </li>
</ul>

In [10]:
os.getcwd()

'd:\\NAIT - Data Analytics\\Fall 2024\\DATA3550 - Data Programming Fundamentals\\Main Repository\\Programming_Basics_for_Machine_Learning_Students\\workbooks'

<b>Note:</b> when using these functions, you need to be careful about where you are in the file system. In particular, the folder location doesn't reset itself automatically if you rerun everything. We need to restart the environment to reset the working directory, or navigate ourselves back to the correct location. The above command got the current working directory back from the Python environment, which in turn got it from the operating system. If we make a change to that directory, then rerun the above cell, we aren't "reset" to the original where we were when we started the program. To do that, we'd need to restart the environment, which would kill this current world in which our program is running, and generate a brand new one. 

Here I'm going to capture the current folder name and then move one level up, and back down. 

In [11]:
org_fold = os.getcwd()
tmp = os.getcwd().split("/")[-1]
print(tmp)
os.chdir("../")
os.getcwd()
os.chdir(tmp)

d:\NAIT - Data Analytics\Fall 2024\DATA3550 - Data Programming Fundamentals\Main Repository\Programming_Basics_for_Machine_Learning_Students\workbooks


#### Handling File System Data

We can also capture some of this information, and use it as a variable in our code that we can use to navigate the file system. For example, when grabbing the working directory above we stored it as a string and used the split command to break it into a hierarchy of folder names. Below we can pull the files in a folder and that data is returned in a list. If we have code where we need to jump around from folder to folder, we can use this list to help us navigate. For example, if we have a folder structure like this:

```
data
    - dogs
        - dog1.jpg
        - dog2.jpg
        - dog3.jpg
    - cats
        - cat1.jpg
        - cat2.jpg
        - cat3.jpg
    - rutabagas
        - rutabaga1.jpg
        - rutabaga2.jpg
        - rutabaga3.jpg
```
Our "base" folder is the data folder, and the subfolders inside are where we'll likely need to do all of our work. We can keep this map of the folder structure in some data structure, then we can use that info to move around. For example, we can find out where we are, visit each subfolder to do some work, and then move back to the base folder.

In [12]:
# Capture file list
# Print 5
#<b>Bonus:</b> at some point when we looked at strings, someone asked about putting a newline in an f string. I was wrong, it can work like this, no workarounds needed. 
file_list = os.listdir()
print(f"Number of flies: {len(file_list)}\n")
for file in range(5):
    print(file_list[file])


Number of flies: 71

000_Setup.ipynb
001a_python.py
001a_python_example.py
001_programming_basics.ipynb
001_programming_basics_sol.ipynb


### Get Your Walk On

![Walk](../images/walk.gif "Walk")

Manually navigating folder structures is pretty useful, but it also sucks and is error prone. The os.walk() function makes this a bit easier for us by doing the "walking" from one folder to another seamlessly. 

The os.walk() function returns a generator object, which is a special type of object that we can use to iterate over the contents of a folder. The generator object is a tuple, with the first element being the current folder, the second element being a list of subfolders, and the third element being a list of files. We can use this to iterate over the contents of a folder, and then iterate over the contents of each subfolder, and so on.

The walk function returns a 3 part tuple for us:
<ul>
    <li> The current folder </li>
    <li> A list of subfolders </li>
    <li> A list of files </li>
</ul>

We can also use a flag to set if our walker will go top down or bottom up. The default is top down, which means we start at the top level folder and work our way down to the files. If we set the flag to False, we start at the bottom and work our way up. In practice, we pertty much always want top down - we have a folder inside our repository, such as "data" or "logs", and we want to navigate into it. Going the other direction moves "up" and if we leave our repository, that's the computer at large, and it could be anything. This may be useful if you're making an installed application, but most things that we're working on are contained to a specific location, so drilling down is normal. 

In [13]:
walker = os.walk("../")
moonwalker = os.walk("../", topdown=False)

In [14]:
for root, directories, files, in walker:
    print(f"Root: {root}")
    print(f"Directories: {directories}")
    print(f"Files: {files}")

Root: ../
Directories: ['.git', '.github', '.vscode', 'data', 'images', 'reference_material', 'supplementary_practice_workbooks', 'workbooks']
Files: ['.gitignore', 'README.md']
Root: ../.git
Directories: ['hooks', 'info', 'logs', 'objects', 'refs']
Files: ['COMMIT_EDITMSG', 'config', 'description', 'FETCH_HEAD', 'HEAD', 'index', 'ORIG_HEAD', 'packed-refs']
Root: ../.git\hooks
Directories: []
Files: ['applypatch-msg.sample', 'commit-msg.sample', 'fsmonitor-watchman.sample', 'post-update.sample', 'pre-applypatch.sample', 'pre-commit.sample', 'pre-merge-commit.sample', 'pre-push.sample', 'pre-rebase.sample', 'pre-receive.sample', 'prepare-commit-msg.sample', 'push-to-checkout.sample', 'sendemail-validate.sample', 'update.sample']
Root: ../.git\info
Directories: []
Files: ['exclude']
Root: ../.git\logs
Directories: ['refs']
Files: ['HEAD']
Root: ../.git\logs\refs
Directories: ['heads', 'remotes']
Files: []
Root: ../.git\logs\refs\heads
Directories: []
Files: ['main']
Root: ../.git\logs\re

In [15]:
for root, directories, files, in moonwalker:
    print(f"Root: {root}")
    print(f"Directories: {directories}")
    print(f"Files: {files}")

Root: ../.git\hooks
Directories: []
Files: ['applypatch-msg.sample', 'commit-msg.sample', 'fsmonitor-watchman.sample', 'post-update.sample', 'pre-applypatch.sample', 'pre-commit.sample', 'pre-merge-commit.sample', 'pre-push.sample', 'pre-rebase.sample', 'pre-receive.sample', 'prepare-commit-msg.sample', 'push-to-checkout.sample', 'sendemail-validate.sample', 'update.sample']
Root: ../.git\info
Directories: []
Files: ['exclude']
Root: ../.git\logs\refs\heads
Directories: []
Files: ['main']
Root: ../.git\logs\refs\remotes\origin
Directories: []
Files: ['main']
Root: ../.git\logs\refs\remotes\upstream
Directories: []
Files: ['HEAD', 'main']
Root: ../.git\logs\refs\remotes
Directories: ['origin', 'upstream']
Files: []
Root: ../.git\logs\refs
Directories: ['heads', 'remotes']
Files: []
Root: ../.git\logs
Directories: ['refs']
Files: ['HEAD']
Root: ../.git\objects\01
Directories: []
Files: ['be988b50c7657c7702d90d9b1f8f8579c958e6']
Root: ../.git\objects\02
Directories: []
Files: ['74a81f8ee9

## Exercise

Attempt to list the folders and files in the level above where this notebook is located. Create a list of folders and a list of files. Return them in a tuple.

In [34]:
def genList(path="../"):
    folders = []
    files = []
    for root, directories, files in os.walk(path):
        #folders.append(root)
        folders.append(directories)
        files.append(files)
    return folders, files

In [35]:
genList()[0]

[['.git',
  '.github',
  '.vscode',
  'data',
  'images',
  'reference_material',
  'supplementary_practice_workbooks',
  'workbooks'],
 ['hooks', 'info', 'logs', 'objects', 'refs'],
 [],
 [],
 ['refs'],
 ['heads', 'remotes'],
 [],
 ['origin', 'upstream'],
 [],
 [],
 ['01',
  '02',
  '06',
  '09',
  '0a',
  '0b',
  '0d',
  '0f',
  '13',
  '15',
  '16',
  '1c',
  '1d',
  '1e',
  '21',
  '24',
  '2b',
  '2d',
  '2f',
  '32',
  '34',
  '39',
  '40',
  '42',
  '44',
  '45',
  '49',
  '50',
  '51',
  '52',
  '56',
  '5c',
  '5d',
  '5e',
  '60',
  '61',
  '63',
  '64',
  '66',
  '6a',
  '6f',
  '72',
  '74',
  '77',
  '78',
  '7a',
  '7e',
  '81',
  '82',
  '83',
  '85',
  '86',
  '8c',
  '90',
  '92',
  '94',
  '97',
  '99',
  '9b',
  '9e',
  '9f',
  'a4',
  'a6',
  'a7',
  'a9',
  'aa',
  'ac',
  'b1',
  'b6',
  'b9',
  'ba',
  'bc',
  'bf',
  'c2',
  'c7',
  'cb',
  'ce',
  'cf',
  'd0',
  'd2',
  'd3',
  'd4',
  'd6',
  'd7',
  'd9',
  'e0',
  'e1',
  'e2',
  'e4',
  'e9',
  'ea',
  'ee

In [18]:
genList()[1]

['helper.cpython-312.pyc', [...]]

#### Exists

We can also check if a file or folder exists. This is useful if we're going to be creating a new file or folder, and we want to make sure we don't overwrite something that already exists. 

In [19]:
print(os.path.exists("chipotle.tsv"))
print(os.path.exists("mad_bitcoin_keys.tsv"))

True
False


### File Paths

When navigating file systems we necessarily need to deal with file paths, or the locations in the file structure where each file is located. There are two types of file paths, absolute and relative.

![Paths](../images/paths.png "Paths")

#### Absolute Paths

An absolute path is the full path to a file or folder. It starts at the root of the file system and lists each folder in the hierarchy, separated by a slash. For example, on my Mac, the root of the file system is "/". If I want to get to my home folder, I need to go to "/Users/". If I want to get to my Documents folder, I need to go to "/Users/akeem/Documents". If I want to get to my "downloads" folder, I need to go to "/Users/akeem/Downloads".

Absolute file paths will rarely be relevant for us, it is relevant if you are writing lower level programs, that deal more directly with a file system. Note that different operating systems have different file systems, so the absolute path on a Mac is different than the absolute path on a PC - you can't rely on any particular configuration. 

<b>Note:</b> those environment variables that we mentioned above are some of the things that tell programs where to look for stuff, they indicate the root directories of things like the user's home folder, or the location of the program files.

#### Relative Paths

Relative paths are relative to the current working directory. If I am in the root directory of this repository, the relative path to each workbook is workbooks\workbook_name.ipynb. If I am in the workbooks folder, the relative path to each workbook is just the name of the workbook. If I am in the data folder, the relative path to each workbook is ../workbooks/workbook_name.ipynb.

When working within a repository we use relative paths to refer to other files. This allows us to pass around the repository, or move it to a different machine, and things still work. For example, all the data and image references in these workbooks uses a relative path, so it works just fine on my Mac and your different computers without any issue. One important value when using relative paths is ../ which means "go up one level", so if we are in the "workbooks" folder here, we can use ../ to get to the root of the repository.

### Joining Paths

When we need to deal with file paths we can use the os.path.join() function to join together the parts of the path. This is useful because it will automatically handle the differences between operating systems for us. For example, on a Mac the path separator is "/", while on a PC it is "\". The os.path.join() function will automatically use the correct separator for the operating system we are using. 

For example, to get the absolute path to a file, we can use the os.path.join() function to join together the current working directory and the relative path to the file.

In [20]:
file_name = "016_file_management_2.ipynb"
fold_path = os.getcwd()
print(f"File path: {os.path.join(fold_path, file_name)}")

File path: d:\NAIT - Data Analytics\Fall 2024\DATA3550 - Data Programming Fundamentals\Main Repository\Programming_Basics_for_Machine_Learning_Students\workbooks\016_file_management_2.ipynb


## Reading and Writing Text Files

Now that we have an idea what is in our folders, we can start recklessly changing things. For example, we can use the os write functions to make a new CSV file and write some data to it. One thing that is reveled to us here that might not be visible is you're used to Windows machines is a look at what is a "text" file. Text files are .txt, but also .csv, .tsv, .py, etc... meaning all of these types of files is made up of just plain text, and we can edit them in any text editor on a computer - the file extension doesn't dictate it, that's for our convenience. The structure of our code will:
<ul>
<li> Open a connection to the file, if it doesn't exist this will create it. </li>
<li> Perform the contents of the loop - writing some data to a file for 100 lines. </li>
<li> When that writing task is complete, it will close the connection automatically thanks to the with. </li>
</ul>

In the open() function call to connect to the file we provide the second argument that defines what type of access we get to the file we are opening:
<ul>
<b><li> 'r' - read only </li>
<li> 'w' - write only </li>
<li> 'a' - append to the end of the file </li>
<li> 'r+' - read and write </li></b>
<li> 'w+' - write and read, overwrites existing files</li>
</ul>

These are mostly pretty simple and self-explanatory, with the exception of the distinction between r+ and w. The 'r+' option is a bit more complicated. This will open the file for reading and writing, but it will not create the file if it doesn't exist. If you try to open a file that doesn't exist with 'r+' you will get an error. The `w` option will create the file if it doesn't exist, but it will also overwrite the file if it does exist. So if we are making something brand new, we want `w`, but if we are attempting to update an existing file, we want `r+`. This is an easy place to make an error, so we should be careful. We can also check to see if the file we want to make already exists, then make a decision. `W+` is another weird option, it is read/write, but will overwrite the file if it exists.

![File Permissions](../images/file_permissions.png "File Permissions" )

Choosing the level of access of a file that we open is important in terms of writing our code to prevent errors. We want to open the file with the least impactful level of access that we need to have to do what we want. So if we are just reading data from a file, opening it as read only will prevent us accidentally changing that file in any way, as we don't even have the ability to do so. If we want a brand-new file, opening it as write only will prevent any old data that may have been hanging around from persisting. The more flexible the level of access, the more options we have for what we can do, the more likely it is that we may do something unintentional.

<b>Note:</b> this will go to our current directory, wherever the pointer is, so if you got rid of that line to reset the locations, it would spit this file out to whatever folder you happen to be in.  

#### Create a CSV with Fibonacci Numbers

Here we'll make a CSV, fib.csv, that has two coulmns - an index and the fibonacci number at that index. 

In [21]:
# write fibonacci series to a file
end_fib = 30

def fib(n):
    if n < 2:
        return n
    else:
        return fib(n-2) + fib(n-1)

with open("fib.csv", "w") as f:
    f.write("Index,Fibonacci Series\n")
    i = 0
    while i < end_fib:
        f.write(f"{i},{fib(i)}\n")
        #print(f"{i},{fib(i)}")
        i += 1
print("Done")

Done


#### Reading Our File

Now that our file is written, we can read it and see what we got. We need to specify the `r` here, since we are only reading. When reading from a file, there's a few main options:
<ul>
<li> read() - read the entire file into a single string </li>
<li> readline() - read the file one line at a time </li>
<li> readlines() - read the file into a list of strings, one per line </li>
</ul>

The first, read, takes in the entire file into one string. This is fine for small files, but for things that are large it is unruly. For most things of size we probably want to navigate the file one line at a time, using the readline() option. There's also a common shortcut that we can do with a for loop that does this for us easily:
    
    ``` for line in file: ```   

If we are reading in a large file, one line at a time is a better choice. Depending on what we are doing, we may be able to process our data and "deal with it" - whether that be saving it to another file or loading it into some dataset - on the fly. When we get to neural networks towards the end of machine learning, we'll try to read enough data so that our processor is busy - so the computer is never waiting for either data or a free processor. Loading batches allows us to make the most of the power of our computer, as we can minimize the amount of time any part of it spends waiting for something else to finish.

In [22]:
with open("fib.csv", "r") as f:
   for line in f:
      print(line)

Index,Fibonacci Series

0,0

1,1

2,1

3,2

4,3

5,5

6,8

7,13

8,21

9,34

10,55

11,89

12,144

13,233

14,377

15,610

16,987

17,1597

18,2584

19,4181

20,6765

21,10946

22,17711

23,28657

24,46368

25,75025

26,121393

27,196418

28,317811

29,514229



In [23]:
with open("fib.csv", "r") as f:
    fib_list = f.readlines()
print(fib_list)

['Index,Fibonacci Series\n', '0,0\n', '1,1\n', '2,1\n', '3,2\n', '4,3\n', '5,5\n', '6,8\n', '7,13\n', '8,21\n', '9,34\n', '10,55\n', '11,89\n', '12,144\n', '13,233\n', '14,377\n', '15,610\n', '16,987\n', '17,1597\n', '18,2584\n', '19,4181\n', '20,6765\n', '21,10946\n', '22,17711\n', '23,28657\n', '24,46368\n', '25,75025\n', '26,121393\n', '27,196418\n', '28,317811\n', '29,514229\n']


We can also read it into a CSV, for fun!

In [24]:
df2 = pd.read_csv("fib.csv", index_col=None)
df2.head()

Unnamed: 0,Index,Fibonacci Series
0,0,0
1,1,1
2,2,1
3,3,2
4,4,3


#### Seek

When in a file, we can use the seek() function to move around. This is useful if we want to go back to the beginning of the file, or jump to a specific location. The seek() function takes in two arguments, the first is the location we want to go to, and the second is the reference point. The reference point can be 0, 1, or 2.
<ul>
<li>0 is the beginning of the file</li>
<li>1 is the current location</li>
<li>2 is the end of the file</li>
</ul>

So if we want to go to the beginning of the file, we can use seek(0,0). If we want to go to the end of the file, we can use seek(0,2). If we want to go to the 100th character in the file, we can use seek(100,0). If we want to go to the 100th character from the end of the file, we can use seek(-100,2).

Note that in the examples below I've changed a couple of things - the open uses "rb" or "read binary" instead of "r" or "read". This is to allow us to do the offsets from the end or the current position - the system needs to deal with the binary file. Second, I've added that decode statement to the print statement. This is because the file is binary, so we need to decode it to get it into a string that we can print nicely. Note the last one - I didn't change the encoding. Text encoding is something we need to deal with, but something that is minor - we basically need to make sure that we are using the encoding that matches the file, or we need to re-encode it. 

#### Tell

We can also use the tell() function to find out where we are in the file. This is useful if we want to jump around, but we don't know where we are. We can use tell() to find out where we are, then use seek() to go somewhere else.

<b>Note:</b> the readline calls advance the pointer, so each readline is moving us ahead in the file from where the seek() command dropped us off. 

In [25]:
# Open fib and seek/tell
with open("fib.csv", "rb") as f:
    print("At the stat of the file:")
    print("Position", f.tell())
    print(f.readline().decode("utf-8"))

    print("At position 150:")
    f.seek(150, 0)
    print("Position", f.tell())
    print(f.readline().decode("utf-8"))

    print("At position 20, relative to the current position:")
    f.seek(0, os.SEEK_CUR)
    print("Position", f.tell())
    print(f.readline().decode("utf-8"))

    print("At position 5:")
    f.seek(5, 0)
    print("Position", f.tell())
    print(f.readline().decode("utf-8"))

    print("At the end:")
    f.seek(0, os.SEEK_END)
    print("Position", f.tell())
    print(f.readline().decode("utf-8"))

    print("Moving back from the end:")
    f.seek(-5, 2)
    print("Position", f.tell())
    print(f.readline())

At the stat of the file:
Position 0
Index,Fibonacci Series

At position 150:
Position 150
9,4181

At position 20, relative to the current position:
Position 158
20,6765

At position 5:
Position 5
,Fibonacci Series

At the end:
Position 261

Moving back from the end:
Position 256
b'229\r\n'


#### Appending a File and the In-File Pointer

Another of the options above that is a little odd is the `a`, for append. This will open the file and add new data to the end of it, but it will not overwrite the existing data. This is useful if we want to add new data to an existing file, but we don't want to lose the old data. This is very useful for things like logs - we likely have a pretty substantial amount of data accumulated in a big text file and we want to add new stuff without losing the old data or having to deal with the old data at all. We can use this to basically tack some entries onto the end of a file easily.

This issue of opening an existing file normally vs appending seems pretty minor, but it can have larger performance implications than we might expect. For example, server logs can be many, many GB of text that lists errors or warnings going back years. We want to keep the log, and we also want to add today's entries. Opening a 2GB file "normally", navigating to the end, then spitting it back out can be slow, appending directly to the end of that same file is fast. This is because just like navigating a file system, a text file itself has a pointer that maintains your position - think of it as an invisible cursor just like we have in any program where we type. Append puts that position cursor directly at the end, and just starts writing. Personally, I once had a job where I remade a little program that went to approximately 150 servers, grabbed their log file, and looked for last night's entry at the end of the file to see if a backup failed. By changing it from opening the files normally, to appending (roughly, the language wasn't Python), I cut the runtime from 4 to 5 hours to about <10 minutes - without making any actual improvements to the logic of the code, just by jumping directly to the end of the text. When someone wrote the original, all the log files were probably tiny, as the system was new, so it didn't matter for performance; as things grew, this became an issue.  

<b>Note:</b> the "it's slow to open a large file and write to the end" thing is obviously a common issue for computers in general. File access packages know this, and are built to be fast no matter what. This idea is still true, just less true than it is with older software.

In [26]:
with open("fib.csv", "a") as f:
    print(f.tell())

261


## Exercise

Complete the spotReader function to take in a file and a spot in that file, and return that line. 

In [36]:
def spotReader(filepath, spot):
    ret_val = ""
    with open(filepath, "r") as f:
        f.seek(spot, 0)
        ret_val = f.readline()
    return ret_val

In [39]:
spotReader("fib.csv", 150)

'9,4181\n'

### Remote Data

We can also use some code to programmatically download data from the internet. This can save us from having to download large files, but it can also help us to build automated pipelines for getting data. 

One thing that I've worked on a lot in industry is importing data from other systems into LMS systems like Moodle. A common process to do this is for the other system to export a CSV file to a specific location on a file server, then a script that we created will grab the new file from the pre-defined location and feed it into our import work. Applications for personal use are also broad - we could automate downloads of files that are regularly updated. 

For pretty much any data source that we might want to be able to access, there is likely a library that will do so. So we can access FTP servers, different file share protocols, and so on - we just need to look up the correct tool for whichever data source we want to access.

<b>Note:</b> there are many libraries that download files, they're pretty much interchangeable. I'm using `urllib.request` here because it's built into Python and is the "basic standard", but you could also use `requests` or `wget` or `curl` or any number of other libraries. For this, and the others, look at the documentation to see the options and how to use the functions - they are generally similar to this, just provide a URL and a destination.

In [29]:
# Download
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
import urllib.request
try:
    urllib.request.urlretrieve(url, 'chipotle.tsv')
except:
    print("Error downloading file")


### Compression

Many files that we deal with, particularly when downloading datasets, may be compressed. Several libraries provide tools for us to programmatically deal with these files, including decompressing them and moving their files into our working directory.

<b>Note:</b> Pandas read_csv function can also read compressed files directly, so you can load my_data.zip or similar with no interim decompression step.

In [30]:
# Compress some files
import zipfile

# Zip first 5 files in file_list
with zipfile.ZipFile('file_list.zip', 'w') as myzip:
    for file in range(5):
        myzip.write(file_list[file])

Decompress into a folder

In [31]:
# Decompress file_list.zip into file_list folder
with zipfile.ZipFile('file_list.zip', 'r') as myzip:
    myzip.extractall('file_list')

## Exercise

Create the function ipynb crawler. When given a director, assemble and return a list of all .ipynb files in that directory and all subdirectories. 

In [32]:
def ipynbCrawler(directory_targ="../"):
    """
    Crawl a directory and return a list of all ipynb files
    """
    ipynb_list = []
    for root, directories, files in os.walk(directory_targ):
        for file in files:
            if file.endswith(".ipynb"):
                ipynb_list.append(os.path.join(root, file))
    return ipynb_list

In [33]:
ipynbCrawler()

['../supplementary_practice_workbooks\\data_dictionary.ipynb',
 '../supplementary_practice_workbooks\\Lecture02_basics.ipynb',
 '../supplementary_practice_workbooks\\Lecture02_workbook.ipynb',
 '../supplementary_practice_workbooks\\Lecture03_datatypes.ipynb',
 '../supplementary_practice_workbooks\\Lecture03_workbook.ipynb',
 '../supplementary_practice_workbooks\\Lecture04_arrays_files.ipynb',
 '../supplementary_practice_workbooks\\Lecture04_workbook.ipynb',
 '../supplementary_practice_workbooks\\Lecture05_conditions_loops.ipynb',
 '../supplementary_practice_workbooks\\Lecture05_workbook.ipynb',
 '../supplementary_practice_workbooks\\Lecture06_numpy.ipynb',
 '../supplementary_practice_workbooks\\Lecture06_workbook.ipynb',
 '../supplementary_practice_workbooks\\Lecture07-08_Pandas.ipynb',
 '../supplementary_practice_workbooks\\Lecture07-08_workbook.ipynb',
 '../supplementary_practice_workbooks\\Lecture09_data_wrangling_pandas_I.ipynb',
 '../supplementary_practice_workbooks\\Lecture10_dat