This repo contains materials to study and learn about principled data processing, including:
onboarding
materials to help you get setup with the right tools and proceduresdemo-tasks
to walk-through how to use some of those tools in practicetemplates
to serve as outlines for routine files like Makefiles and scripts in python or Rchecklists
to refer to as you work and contribute to projects (updating a repo, writing a script, adding a new task)languages
and language-specific tips to consider when writing scripts (like scalability, missingness)notebooks
to walk-through various topics in context of a specific language (ie. set operations in python)
There are a few repos outside of this one that house various tools and/or guidance that may be useful.
- sample-project
- This is a dummy repo to test out git functionality like cloning, pushing, and pull requesting
- resource-utils/faqs
- There's a few help articles related to HRDAG workflow, in particular:
data-hacking-on-server.md
includes instructions for making ssh keys and running Jupyter notebookssafe-logout.txt
instructions for safely disconnecting from eleanor and notebooksdata-work-faq.txt
questions we've asked ourselves enough to write down for others
- There's a few help articles related to HRDAG workflow, in particular:
- resource-utils/notes
- There's a useful document from a previous intern
internship_notes_2016.md
includes some walk-throughs, suggested tools, frequently used commands
- There's a useful document from a previous intern
- gnutools
- A useful guide to using GNU tools more effectively with examples
- record-hash-comparisons
- An introduction and overview of creating unique identifiers with hashes
- form-extraction
- A place for some common tools and code we use to extract info from different kinds of forms
- tool-suite
- A home for some tools related to performance improvements and benchmarking
- dotfiles
- An example all kinds of dotfiles you might want to explore and use in your working environment, like
vimrc
,bash_profile
,zshrc
, andgitconfig
- An example all kinds of dotfiles you might want to explore and use in your working environment, like
on vim:
on git:
- Collaboration with version control
- Version control for transparency and collaboration
- Interactive: Learn Git Branching
on workflow:
- Filenames and data science project organization, Integrated development environments
- Introduction to testing code for data science
- Non-interactive scripts
- data analysis pipelines
- Reproducible reports
- Automated testing and continuous integration
- Using Statistics to Assess Lethal Violence in Civil and Inter-State War (ask Megan for the PDF if you can't download this one for free!)
- Processing scanned documents for investigations of police violence