Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [2]:
# The os module is perfect for filesystem operations like "walking" throught directories and files
# Although there are many ways of achieving the same effect, a good way to loop over the filesystem is using `os.walk()`
import os
for root, directories, files in os.walk('.'):
    for _file in files:
        print(f"File found: {_file}")


File found: Adriane_Calaboni_CV.pdf
File found: betweenoper.sql
File found: carnetcovid-ana.pdf
File found: carnetcovid-nerida.pdf
File found: CartaPresentacion-Natura2023-NeridaHV.pdf
File found: Certificado_javacoursera.pdf
File found: Certificado_POO1coursera.pdf
File found: circulo-desafio-colisiones.html
File found: countdistinc.sql
File found: CURRÍCULUM VITAE_2023-Enero.docx.pdf
File found: CV Nerida HV 2022.docx
File found: CV Nerida HV 2022.pdf
File found: CV Nerida HV 2023.pdf
File found: dateformat.sql
File found: Default.rdp
File found: desktop.ini
File found: fivegroupfunction.sql
File found: incondition.sql
File found: lab1sql.sql
File found: lab2sql.sql
File found: lab3sql.sql
File found: lab4sql.sql
File found: lab5sql.sql
File found: large-files.ipynb
File found: likecondition.sql
File found: limitclause.sql
File found: looping-data-structures.ipynb
File found: orderby.sql
File found: querying-databases.ipynb
File found: reportecasa2021.pdf
File found: sqlite-operation

In [3]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


File found: .\Adriane_Calaboni_CV.pdf
File found: .\betweenoper.sql
File found: .\carnetcovid-ana.pdf
File found: .\carnetcovid-nerida.pdf
File found: .\CartaPresentacion-Natura2023-NeridaHV.pdf
File found: .\Certificado_javacoursera.pdf
File found: .\Certificado_POO1coursera.pdf
File found: .\circulo-desafio-colisiones.html
File found: .\countdistinc.sql
File found: .\CURRÍCULUM VITAE_2023-Enero.docx.pdf
File found: .\CV Nerida HV 2022.docx
File found: .\CV Nerida HV 2022.pdf
File found: .\CV Nerida HV 2023.pdf
File found: .\dateformat.sql
File found: .\Default.rdp
File found: .\desktop.ini
File found: .\fivegroupfunction.sql
File found: .\incondition.sql
File found: .\lab1sql.sql
File found: .\lab2sql.sql
File found: .\lab3sql.sql
File found: .\lab4sql.sql
File found: .\lab5sql.sql
File found: .\large-files.ipynb
File found: .\likecondition.sql
File found: .\limitclause.sql
File found: .\looping-data-structures.ipynb
File found: .\orderby.sql
File found: .\querying-databases.ipynb
Fi

So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [10]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 159225b - File: .\Adriane_Calaboni_CV.pdf
Size: 459b - File: .\betweenoper.sql
Size: 168563b - File: .\carnetcovid-ana.pdf
Size: 168688b - File: .\carnetcovid-nerida.pdf
Size: 154141b - File: .\CartaPresentacion-Natura2023-NeridaHV.pdf
Size: 321600b - File: .\Certificado_javacoursera.pdf
Size: 417881b - File: .\Certificado_POO1coursera.pdf
Size: 1854b - File: .\circulo-desafio-colisiones.html
Size: 696b - File: .\countdistinc.sql
Size: 170052b - File: .\CURRÍCULUM VITAE_2023-Enero.docx.pdf
Size: 40216b - File: .\CV Nerida HV 2022.docx
Size: 255361b - File: .\CV Nerida HV 2022.pdf
Size: 177399b - File: .\CV Nerida HV 2023.pdf
Size: 2284b - File: .\dateformat.sql
Size: 0b - File: .\Default.rdp
Size: 402b - File: .\desktop.ini
Size: 1725b - File: .\fivegroupfunction.sql
Size: 564b - File: .\incondition.sql
Size: 1289b - File: .\lab1sql.sql
Size: 1559b - File: .\lab2sql.sql
Size: 858b - File: .\lab3sql.sql
Size: 1204b - File: .\lab4sql.sql
Size: 1241b - File: .\lab5sql.sql
Size: 2085

In [9]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{'.\\Adriane_Calaboni_CV.pdf': 159225, '.\\betweenoper.sql': 459, '.\\carnetcovid-ana.pdf': 168563, '.\\carnetcovid-nerida.pdf': 168688, '.\\CartaPresentacion-Natura2023-NeridaHV.pdf': 154141, '.\\Certificado_javacoursera.pdf': 321600, '.\\Certificado_POO1coursera.pdf': 417881, '.\\circulo-desafio-colisiones.html': 1854, '.\\countdistinc.sql': 696, '.\\CURRÍCULUM VITAE_2023-Enero.docx.pdf': 170052, '.\\CV Nerida HV 2022.docx': 40216, '.\\CV Nerida HV 2022.pdf': 255361, '.\\CV Nerida HV 2023.pdf': 177399, '.\\dateformat.sql': 2284, '.\\Default.rdp': 0, '.\\desktop.ini': 402, '.\\fivegroupfunction.sql': 1725, '.\\incondition.sql': 564, '.\\lab1sql.sql': 1289, '.\\lab2sql.sql': 1559, '.\\lab3sql.sql': 858, '.\\lab4sql.sql': 1204, '.\\lab5sql.sql': 1241, '.\\large-files.ipynb': 207671, '.\\likecondition.sql': 756, '.\\limitclause.sql': 325, '.\\looping-data-structures.ipynb': 27504, '.\\orderby.sql': 822, '.\\querying-databases.ipynb': 16548, '.\\reportecasa2021.pdf': 20997, '.\\sqlite-ope

**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [45]:
items_shown = 0 

for path, size in sorted(file_metadata.items(), key = lambda  x:x[1] , reverse=True):
    if size <=100:
        if items_shown > 20:
            break
    print(f"Size: {size}b Path: {path}")
    items_shown += 1
    

Size: 13551615b Path: .\GitHub\mapping-data\sample_data\wine-ratings.csv
Size: 3329302b Path: .\GitHub\mapping-data\.git\objects\pack\pack-ffd21a8f07b814c099608eb290ff26fba263d080.pack
Size: 643868b Path: .\GitHub\mapping-data\exercise_modified.ipynb
Size: 417881b Path: .\Certificado_POO1coursera.pdf
Size: 355744b Path: .\GitHub\mapping-data\sample_data\wine-ratings.json
Size: 322315b Path: .\java_exercises\serie06\.metadata\.plugins\org.eclipse.e4.workbench\workbench.xmi
Size: 321600b Path: .\Certificado_javacoursera.pdf
Size: 315675b Path: .\GitHub\mapping-data\sample_data\wine-ratings-small.csv
Size: 255361b Path: .\CV Nerida HV 2022.pdf
Size: 245484b Path: .\GitHub\mapping-data\sample_data\wine_rating2.json
Size: 207671b Path: .\large-files.ipynb
Size: 177399b Path: .\CV Nerida HV 2023.pdf
Size: 170052b Path: .\CURRÍCULUM VITAE_2023-Enero.docx.pdf
Size: 168688b Path: .\carnetcovid-nerida.pdf
Size: 168563b Path: .\carnetcovid-ana.pdf
Size: 159383b Path: .\GitHub\mapping-data\.git\ob

In [53]:
items_shown = 0 

for path, size in sorted(file_metadata.items(), key = lambda  x:x[1] , reverse=True):
    while items_shown > 20:
        if size <=100:
            print(f"Size: {size}b Path: {path}")
        items_shown += 1
    break
items_shown

0

There is a lot happening in the previous cell. `sorted()` is a built-in function that can sort iterables like Python dictionaries. In this case, we need to sort by the _value_. This is done using the `key` parameter which accepts a `lambda`.
`lambda` allows to represent a function in a single line without defining it. That `lambda` expression is the same as defining a function like:

```python
def by_value(x):
    return x[1]
```

`x` represents two items, the path and the size. The function is returning only the size because that is what we want to sort with. Try changing the `lambda` expression to use `x[0]` instead and see what happens.

**Exercise:** Try using a function instead of a `lambda` function and achieve the same result