Deployment

Are you shy about your small data? In this world of distributed file systems, stream processing, and cluster computing you may think that the best way to do your data science projects is to pay a bunch of money for some machines in the cloud.

But in reality you just want to run a webscraper once a day to populate a dataset with the latest stock market prices. Dataland is prefect for all your small data needs and can be run using only free tier services.

Deployment

Dataland is designed to run on a gcloud free-tier f1 micro instance (0.6 GB ram).
It combines the lazy evaluation of and a locally cached google cloud storage bucket in order to perform operations on larger than memory datasets.

Features

Dataland has 3 noteworthy tools to accelerate development:

Storage, is an abstraction layer for a gcloud bucket which handles local caching to reduce the network cost of the system. While providing a backup of datasets that can be accessed by a dataland scheduler or a personal laptop.
Jobs, is the framework for writing pipelines to be run one a fixed schedule in order to update and modify datasets.
Notification, provides and interface for email notifications to be sent from any job via the mailgun free tier.

Writing a Job

There are two important concepts here:

Operation is an atomic action that can be applied to a dataframe. The two main types of operations are the:

AppendOperation, which adds data to an existing dataframe
TransformOperation, which takes in one or many dataframe and outputs a new dataframe

Job contains a pipeline of operations which are run in order. Operations will be applied to the state of the dataset after its previous changes are applied.

from dataland.scheduler import Job, schedule

job = Job(
    sched=schedule.every(1).day.at('09:00'),
    operations=[
      # Pipeline of Operations
    ]
)

A job can be defined as above using the following and once placed int the jobs/ of this project will automatically be picked up and run the next time the dataland/schedule.py is run (Hot reloading is in WIP)

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.jupyter		.jupyter
config		config
dataland		dataland
jobs		jobs
projects		projects
scripts		scripts
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
Justfile		Justfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deployment

Features

Writing a Job

About

Releases

Packages

Languages

NikhilPeri/data-land

Folders and files

Latest commit

History

Repository files navigation

Deployment

Features

Writing a Job

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages