# Sustainable (Small) Data Science

> Tips and tricks to set up, execute, and conclude a data science project for humans

---

Asura Enkhbayar, [ScholCommLab](https://www.scholcommlab.ca/), Simon Fraser University

Twitter: [@bubblbu_](https://twitter.com/bubblbu_)
Github: [bubblbu](https://github.com/bubblbu)
email: asura.enkhbayar@gmail.com

## Outline

1. Introduction
1. Why _sustainabile_ data science?
1. Types of considerations for sustainable data practices
1. Tips and tricks

---

Repository: https://github.com/Bubblbu/sustainable-data-sci

### About me

- I am a PhD student
- I usually work on multiple projects at the same time
- I often work in interdisciplinary settings
- I often am the "data" person

### About you

[https://pollev.com/asuraenkhbayar391](https://pollev.com/asuraenkhbayar391)

## Why _sustainable_ data science?

- What's wrong with open science or reproducible science?

- We rarely talk about the non-financial cost of open and reproducible practices
    - Cost of labor
    - Cost of training and education
    - Maintenance cost
    - Opportunity cost

- Sustainable data science also considers the **affordability** of data science practices

<center><img src='../sustainable_data_science.png'></center>

### Unsustainable data science

neglects the costs of reproducible and open science

- Reproducibility/openness as a status; sometimes only addressed after the fact
- Ignoring the technical limitations of open source tools
- Relying on individual committment and enthusiasm which can feed into exploitative conditions and burnout

## Considerations for sustainable data practices

### What challenges have you encountered while working with data?
 
https://PollEv.com/free_text_polls/LMWIZtWYsXaPXsuBQthxP/respond

### Project requirements

- What kind of data are you working with?
- What is the scope of the project?
- Who will need to access the data at which stages of research?

### Labor and skills

- I am a PhD student
    - Not a full-time researcher
- I often work in interdisciplinary settings
    - Collaborations are common
    - I often need to communicate results at different stages of the research    
- I usually work on multiple projects at the same time
    - Learning new tools is common
- I have a background in engineering
    - Not a trained software developer, data architect, or data scientist
    - Data collection, cleaning, processing, and analysis

### Tools and processes

- Is it FOSS?
- Is it versatile?
- Can it be quickly tested?
- Dpes it have a community of users?
- Can it be replaced?

### What tools are you interested in using in you future projects?

https://PollEv.com/free_text_polls/tYVBuilMTG7mxwB4c5xCE/respond

## Tools, tips, processes for sustainable data science

https://github.com/Bubblbu/sustainable-data-sci

### Setting up projects

#### Project structure

- Reuse folder structures (e.g., [Cookiecutter data science template](https://drivendata.github.io/cookiecutter-data-science/))

- My own project structure: https://github.com/ScholCommLab/altmetric-news-quality

### Executing projects

#### Utility
- Progress bars: [tqdm](https://tqdm.github.io/)
- APIs: [Postman](https://www.postman.com/)

#### Development

- Stick to a method of dependency management
    - For Python I use: [Poetry](https://python-poetry.org/), [pyenv](https://github.com/pyenv/pyenv), [pipx](https://github.com/pypa/pipx)
- Utilize notebooks and interactive development
- Serious development in notebooks: [nbdev](https://nbdev.fast.ai/)

#### Research process

- Github Wiki as a research log: https://github.com/ScholCommLab/altmetric-news-quality/wiki

### Completing projects

#### Publications

- Separate data work from scholarly articles. e.g.: https://github.com/ScholCommLab/facebook-hidden-engagement
    1. Versioned code repository with DOI (Github + Zenodo)
    1. Versioned data repository with DOI (Dataverse)
    1. Versioned repository with code/data to reproduce article (Github + Zenodo)

# Thanks, y'all