Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [2]:
# The os module is perfect for filesystem operations like "walking" throught directories and files
# Although there are many ways of achieving the same effect, a good way to loop over the filesystem is using `os.walk()`
import os
for root, directories, files in os.walk('.'):
    for _file in files:
        print(f"File found: {_file}")


File found: exercise.ipynb
File found: large-files.ipynb
File found: README.md
File found: COMMIT_EDITMSG
File found: config
File found: description
File found: HEAD
File found: index
File found: packed-refs
File found: applypatch-msg.sample
File found: commit-msg.sample
File found: fsmonitor-watchman.sample
File found: post-update.sample
File found: pre-applypatch.sample
File found: pre-commit.sample
File found: pre-merge-commit.sample
File found: pre-push.sample
File found: pre-rebase.sample
File found: pre-receive.sample
File found: prepare-commit-msg.sample
File found: update.sample
File found: exclude
File found: HEAD
File found: main
File found: HEAD
File found: main
File found: 559a080b5b52c9570bb07d937a56ffc8636d09
File found: 6e65b0cd95355100f1825baf5281bf43e645fb
File found: ca46bbe3e099ece07cb45a12150034d3f814ef
File found: ff4b7885550548caed283aee4499f40576fa8a
File found: a5480406e468eefaffd72a48735c0f8d1e1ef0
File found: d7ac5ef872ab345f18f50174f042f5a6553e60
File found: 

In [3]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


File found: .\exercise.ipynb
File found: .\large-files.ipynb
File found: .\README.md
File found: .\.git\COMMIT_EDITMSG
File found: .\.git\config
File found: .\.git\description
File found: .\.git\HEAD
File found: .\.git\index
File found: .\.git\packed-refs
File found: .\.git\hooks\applypatch-msg.sample
File found: .\.git\hooks\commit-msg.sample
File found: .\.git\hooks\fsmonitor-watchman.sample
File found: .\.git\hooks\post-update.sample
File found: .\.git\hooks\pre-applypatch.sample
File found: .\.git\hooks\pre-commit.sample
File found: .\.git\hooks\pre-merge-commit.sample
File found: .\.git\hooks\pre-push.sample
File found: .\.git\hooks\pre-rebase.sample
File found: .\.git\hooks\pre-receive.sample
File found: .\.git\hooks\prepare-commit-msg.sample
File found: .\.git\hooks\update.sample
File found: .\.git\info\exclude
File found: .\.git\logs\HEAD
File found: .\.git\logs\refs\heads\main
File found: .\.git\logs\refs\remotes\origin\HEAD
File found: .\.git\logs\refs\remotes\origin\main
Fil

So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [7]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 146403b - File: .\exercise.ipynb
Size: 26174b - File: .\large-files.ipynb
Size: 228b - File: .\README.md
Size: 14b - File: .\.git\COMMIT_EDITMSG
Size: 303b - File: .\.git\config
Size: 73b - File: .\.git\description
Size: 21b - File: .\.git\HEAD
Size: 297b - File: .\.git\index
Size: 112b - File: .\.git\packed-refs
Size: 478b - File: .\.git\hooks\applypatch-msg.sample
Size: 896b - File: .\.git\hooks\commit-msg.sample
Size: 4655b - File: .\.git\hooks\fsmonitor-watchman.sample
Size: 189b - File: .\.git\hooks\post-update.sample
Size: 424b - File: .\.git\hooks\pre-applypatch.sample
Size: 1643b - File: .\.git\hooks\pre-commit.sample
Size: 416b - File: .\.git\hooks\pre-merge-commit.sample
Size: 1348b - File: .\.git\hooks\pre-push.sample
Size: 4898b - File: .\.git\hooks\pre-rebase.sample
Size: 544b - File: .\.git\hooks\pre-receive.sample
Size: 1492b - File: .\.git\hooks\prepare-commit-msg.sample
Size: 3635b - File: .\.git\hooks\update.sample
Size: 240b - File: .\.git\info\exclude
Size: 70

In [8]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{'.\\exercise.ipynb': 146403, '.\\large-files.ipynb': 26174, '.\\README.md': 228, '.\\.git\\COMMIT_EDITMSG': 14, '.\\.git\\config': 303, '.\\.git\\description': 73, '.\\.git\\HEAD': 21, '.\\.git\\index': 297, '.\\.git\\packed-refs': 112, '.\\.git\\hooks\\applypatch-msg.sample': 478, '.\\.git\\hooks\\commit-msg.sample': 896, '.\\.git\\hooks\\fsmonitor-watchman.sample': 4655, '.\\.git\\hooks\\post-update.sample': 189, '.\\.git\\hooks\\pre-applypatch.sample': 424, '.\\.git\\hooks\\pre-commit.sample': 1643, '.\\.git\\hooks\\pre-merge-commit.sample': 416, '.\\.git\\hooks\\pre-push.sample': 1348, '.\\.git\\hooks\\pre-rebase.sample': 4898, '.\\.git\\hooks\\pre-receive.sample': 544, '.\\.git\\hooks\\prepare-commit-msg.sample': 1492, '.\\.git\\hooks\\update.sample': 3635, '.\\.git\\info\\exclude': 240, '.\\.git\\logs\\HEAD': 700, '.\\.git\\logs\\refs\\heads\\main': 700, '.\\.git\\logs\\refs\\remotes\\origin\\HEAD': 205, '.\\.git\\logs\\refs\\remotes\\origin\\main': 465, '.\\.git\\objects\\14\\5

**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [19]:
items_shown = 0

for path, size in sorted(file_metadata.items(), key=lambda x:x[1], reverse=True):
    if items_shown > 4:
        break
    print(f"Size: {size} Path: {path}")
    items_shown += 1


Size: 146403 Path: .\exercise.ipynb
Size: 29009 Path: .\.git\objects\ab\126ce4f73ddbafcc17f4b25bd4fa296bfa9298
Size: 26174 Path: .\large-files.ipynb
Size: 5343 Path: .\.ipynb_checkpoints\large-files-checkpoint.ipynb
Size: 4898 Path: .\.git\hooks\pre-rebase.sample


There is a lot happening in the previous cell. `sorted()` is a built-in function that can sort iterables like Python dictionaries. In this case, we need to sort by the _value_. This is done using the `key` parameter which accepts a `lambda`.
`lambda` allows to represent a function in a single line without defining it. That `lambda` expression is the same as defining a function like:

```python
def by_value(x):
    return x[1]
```

`x` represents two items, the path and the size. The function is returning only the size because that is what we want to sort with. Try changing the `lambda` expression to use `x[0]` instead and see what happens.

**Exercise:** Try using a function instead of a `lambda` function and achieve the same result