Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [5]:
# The os module is perfect for filesystem operations like "walking" throught directories and files
# Although there are many ways of achieving the same effect, a good way to loop over the filesystem is using `os.walk()`
import os
for root, directories, files in os.walk('.'):
    for _file in files:
        print(f"File found: {_file}")


File found: README.md
File found: .gitignore
File found: config
File found: HEAD
File found: index
File found: packed-refs
File found: pack-734108524c45bcbe06adc07ac04de3821b9505bd.pack
File found: pack-734108524c45bcbe06adc07ac04de3821b9505bd.idx
File found: HEAD
File found: main
File found: HEAD
File found: main
File found: HEAD


In [6]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


File found: ./README.md
File found: ./.gitignore
File found: ./.git/config
File found: ./.git/HEAD
File found: ./.git/index
File found: ./.git/packed-refs
File found: ./.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.pack
File found: ./.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.idx
File found: ./.git/logs/HEAD
File found: ./.git/logs/refs/heads/main
File found: ./.git/logs/refs/remotes/origin/HEAD
File found: ./.git/refs/heads/main
File found: ./.git/refs/remotes/origin/HEAD


So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [8]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 61b - File: ./README.md
Size: 1799b - File: ./.gitignore
Size: 328b - File: ./.git/config
Size: 21b - File: ./.git/HEAD
Size: 217b - File: ./.git/index
Size: 112b - File: ./.git/packed-refs
Size: 1654b - File: ./.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.pack
Size: 1184b - File: ./.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.idx
Size: 186b - File: ./.git/logs/HEAD
Size: 186b - File: ./.git/logs/refs/heads/main
Size: 186b - File: ./.git/logs/refs/remotes/origin/HEAD
Size: 41b - File: ./.git/refs/heads/main
Size: 30b - File: ./.git/refs/remotes/origin/HEAD


In [1]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{'./large-files.ipynb': 5358, './README.md': 61, './.gitignore': 1799, './.git/config': 328, './.git/HEAD': 21, './.git/index': 217, './.git/packed-refs': 112, './.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.pack': 1654, './.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.idx': 1184, './.git/logs/HEAD': 186, './.git/logs/refs/heads/main': 186, './.git/logs/refs/remotes/origin/HEAD': 186, './.git/refs/heads/main': 41, './.git/refs/remotes/origin/HEAD': 30}


**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [26]:
items_shown = 0
    
for path, size in sorted(file_metadata.items(), key=lambda x:x[1], reverse=True):
    if items_shown > 4:
        break
    print(f"Size: {size} Path: {path}")
    items_shown += 1


Size: 5358 Path: ./large-files.ipynb
Size: 1799 Path: ./.gitignore
Size: 1654 Path: ./.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.pack
Size: 1184 Path: ./.git/objects/pack/pack-734108524c45bcbe06adc07ac04de3821b9505bd.idx
Size: 328 Path: ./.git/config


There is a lot happening in the previous cell. `sorted()` is a built-in function that can sort iterables like Python dictionaries. In this case, we need to sort by the _value_. This is done using the `key` parameter which accepts a `lambda`.
`lambda` allows to represent a function in a single line without defining it. That `lambda` expression is the same as defining a function like:

```python
def by_value(x):
    return x[1]
```

`x` represents two items, the path and the size. The function is returning only the size because that is what we want to sort with. Try changing the `lambda` expression to use `x[0]` instead and see what happens.

**Exercise:** Try using a function instead of a `lambda` function and achieve the same result