Using the Python tools we have discussed in class (ie, os.path, os, shutil, and specifically os.walk, hashlib, and csv module), create a file inventory. Your inventory should be of the networked labs `data` folder. Your inventory should include the path to each file, the file name, file extension, file size, last modified time, file integrity information (at least one hash digest). Think of this as a sort of deposit list or transfer list that would be necessary to share with someone else as a representation of a digital collection. You have to know the things that are expected in the collection (your list), you have to know what the things are (your metadata), and how they relate to each other (in practice you would want to have more metadata, like unique identifiers, etc, but the paths give an idea of how things are related and would be a solid start). 

Your assignment submission should include:

the python script you write (preferably uploaded to a repository on your personal github site, but may be submitted as a file);
an inventory in CSV (if sharing a repo, share link to repo). There is some room for variation in the column names, but the elements noted above should give you a pretty clear structure; 
an accompanying README document that explains the files, their source (provenance), and the metadata fields in the CSV (what they mean, what the datatypes are, where the data comes from or how it was derived), when and how the information was created. Guide for creating README style documents are online, I recommend this one from Cornell Research Data Service, https://data.research.cornell.edu/content/readme

In [2]:
import os
from datetime import datetime
import hashlib
import csv

In [3]:
path_to_datadir = os.path.join('networked-services-labs-main', 'data')


In [4]:
def get_checksum(filePath, checksum_type):
    '''This is a helper function to create a checksum.
    In this example we will focus on MD5, which can be used to check data integrity.

    The filePath value argument be a string representing a valid path.
    The checksum_type argument should be a valid type of checksum.

    The function returns the string of characters for an MD5 or SHA256 checksum.
    The is function only allows you to create MD5 or SHA 256 and will result in an error for other types.'''
    checksum_type = checksum_type.lower().replace(' ', '')

    with open(filePath, 'rb') as f:
        bytes = f.read()
        if checksum_type == 'md5':
            hash_string = hashlib.md5(bytes).hexdigest()
        elif checksum_type == 'sha256':
            hash_string = hashlib.sha256(bytes).hexdigest()
        else:
            Raise('{} is not a hash function supported by this program. You must ask for MD5.')
    return hash_string

In [5]:
file_list = list()
headers = ['filename', 'folder', 'extension', 'size(bytes)', 'absolute_path', 'modification_time', 'md5_checksum', 'sha256_checksum'] 

for folderName, subfolders, filenames in os.walk(path_to_datadir):
    for file in filenames:
        filename = file
        folder = folderName
        path = os.path.join(folderName, file)
        extension = os.path.splitext(os.path.join(folderName, file))[1]
        absolutePath = os.path.abspath(os.path.join(folderName, file))
        size = os.path.getsize(os.path.join(folderName, file))
        modification_time = datetime.strftime(datetime.fromtimestamp(os.path.getmtime(os.path.join(folderName, file))), "%Y-%m-%dT%H:%M:%S")
        md5_checksum = get_checksum(os.path.join(folderName, file), 'md5')
        sha256_checksum = get_checksum(os.path.join(folderName, file), 'sha256')
        file_info = [
            filename,
            folder,
            extension,
            size,
            absolutePath,
            modification_time,
            md5_checksum,
            sha256_checksum
        ]

        file_list.append(file_info)

with open('file-metadata-manifest-from-list.csv', 'w', newline="") as csvfile:
    fileManifest = csv.writer(csvfile)
    print('Writing file manifest CSV')
    fileManifest.writerow(headers)
    for file in file_list:
            print('adding', file[0])
            fileManifest.writerow(file)
    print('Write manifest')



#    print(f" File name: {filename}\n  Stored in {folder} folder\n  Path: {path}\n  Absolute Path: {absolute_path}\n  File size (bytes): {size}")




Writing file manifest CSV
adding animals.csv
adding mbox-short.txt
adding 08-12-1997-items.xls
adding 08-12-1997-items.xlsx
adding books-on-shelves12-3-2002.txt
adding diary-04-23-19.doc
adding diary-04-23-20.docx
adding observations-03-30-2018.csv
adding sightings-202203.jpg
adding html-with-script.html
adding script.js
adding style.css
adding cubane.pdb
adding ethane.pdb
adding methane.pdb
adding octane.pdb
adding pentane.pdb
adding propane.pdb
adding 2014-01-31_JA-africa.tsv
adding 2014-01-31_JA-america.tsv
adding 2014-02-02_JA-britain.tsv
adding 201403160_01_text.json
adding 33504-0.txt
adding 829-0.txt
adding diary.html
adding pg514.txt
adding web-files-small-metadata.csv
adding 000727.ram
adding 11-3250JohnsonvFolinoEtAl.wma
adding mj_telework_exchange_final_100710.mp3
adding NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3
adding 1005107061.tif
adding 13080t.jpg
adding k7989-7x.jpg
adding m237a2f.gif
adding orca.via_.moc_.noaa_.jpg
adding 01-1480.pdf
adding Chapter03.pdf
adding fil