For our first example, let's download a csv from the internet, and perform some
data processing on it.

In [7]:
from data_as_code import Recipe

r = Recipe()
r.begin()

We've started our recipe using defaults. This will cause the package to store all
files in a temporary directory that was created for this recipe. This includes the
csv file that we download in the next step.

In [8]:
from data_as_code import GetHTTP

url = 'https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-size-bands-csv.csv'
GetHTTP(r, url, name='survey')

survey: 1.16MB [00:00, 28.0MB/s]                 


Downloading from URL:
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-size-bands-csv.csv


<data_as_code.step.GetHTTP at 0x7f1ed0dee520>

In the section above, we used the `GetHTTP` class and provided it with the URL to
our csv file. We also provided it with a name, since the name of the file from the
URL is very long (annual-enterprise-survey-2019-financial-year-provisional-size-bands-csv.csv).

In [9]:
import csv
from data_as_code import Step, InputArtifact
from pathlib import Path

class CustomStep(Step):
    survey = InputArtifact('survey')

    def process(self) -> Path:
        path = Path(r.wd, 'wages.csv')
        with path.open('w', newline='') as wf:
            writer = csv.writer(wf)
            with self.survey.file_path.open(newline='') as cf:
                for ix, row in enumerate(csv.reader(cf)):
                    if ix == 0:
                        writer.writerow(row)
                    elif row[4] == 'Salaries and wages paid':
                        try:
                            writer.writerow(row[:4] + [int(row[5]) * int(1e6)] + ['DOLLARS'])
                        except ValueError:
                            pass
        return path

CustomStep(r, 'wages')
print(r.get_artifact('wages').file_hash.hexdigest())

e31f7e994b2205375f814213265df1d7cf47ebfeced35bea1e68f3a883a130ff
