Storytelling with Data
If you are a student enrolled in the class (or if you'd like to following along!) you should start by signing up for the course's Slack account (you need to join using your @dartmouth.edu email address).
Next, run through this GitHub tutorial. Then fork this repository so that you can contribute!
All code for this course should be written in Python and organized in a Jupyter notebook. Any data you analyze must be shareable with all other students in the course, and ideally it should be shareable with the public. All code and other student-generated materials will be shared publically.
We will use Docker as a platform for running code, managing software, etc. Docker is a tool for packaging up everything programs need to run into isolated "containers" that can run on anybody's computer. (You can always attempt to install packages that we use in the course via other means, but Docker is the only supported method.) To get Docker set up on your machine, following these instructions.
In addition to the GitHub tutorial above, if you're new to computer programming, Python, or Jupyter notebooks, you'll want to run through some tutorials to help you get started. We'll go over the very high-level ideas in class, and you can always ask questions in class or via Slack, but you will likely need to go through these tutorials on your own time to fully experience the delicious learning benefits encapsulated within.
- Introduction to Python (beginner)
- Introduction to Git (beginner)
- Video introduction to Jupyter Notebooks (with code) (beginner)
- Learning to code with Python and Jupyter Notebooks (beginner)
Once you have the basics down, you can move on to learn about some very useful Python packages:
- Introduction to Pandas (intermediate)
- Practice your Pandas skills (intermediate)
- Getting started with Numpy (intermediate)
- Getting started with Scipy (intermediate)
- Exploring high dimensional data with HyperTools (intermediate)
Another really useful technique for doing reproducible open science in the real world is to develop unit tests. I suggest using Travis CI to automatically run your unit tests when you check in new code:
- Tools for analyzing dynamic brain patterns and brain networks (intermediate to advanced)
Where to find nice datasets
In todays "Big Data" world, there are an abundance of high-quality, free datasets to enjoy and explore. Below is a short list of websites that are great resources for data:
The data-stories folder
Each time you begin a new project, you should (in collaboration with your project team):
- Fork this repository and clone it to your computer
- Come up with a creative, fun, yet descriptive, name for your project
- Create a new project sub-folder under data-stories
- Follow the instructions here to finish setting up your project
- When you have something good or shareable, submit a pull request to merge your fork into master (the "original fork"). If your request is accepted, your project will now become part of the main repository for this course. That's called "sharing."
You can also (when the mood strikes) continue, or fork, someone else's project -- in the real world, this is called (amongst other things) "doing science." Or, as Bernard of Chartres, and more recently Isaac Newton, described it, "standing on the shoulders of giants." To stand on the shoulders of someone else's project, follow a similar procedure to what's described above:
- Create a new fork of this repository, possibly based off of someone else's fork (if they'll share it with you!)
- Decide whether your changes merit a new project name (new folder) or whether you want to add to the existing project (existing folder)
- Follow the regular project setup instructions, modifying as needed
- When you have something to share, submit a pull request so that everyone can benefit from your work. If you're feeliing extra nice, it's never a bad idea to acknowledge the original project on which your work is based (e.g. in your project's readme file), and/or to give the original project's authors a heads up that you're about to change the world using something they started!
Data science is a tricky, rewarding, and often frustrating business. Luckily for us data scientists, there are many places to get help! Examples include:
- Google-- searchable portal to of all human knowledge. Most Internet things are reachable through here, and it's a great place to start your search. You can often find code that other people have written that solves a similar problem to the one you're working on, or a tutorial that teaches you how to solve a particular class of problems.
- Wikipedia-- community-curated encyclopedia. Wikipedia is a good resource for learning about the background of a technique, looking up equations, etc. It's not a good source for tutorials.
- Slack-- course chatroom. A good place to ask questions, post ideas, etc., to other members of the class.
- The last (but hopefully not least) option if you're feeling stuck, unhappy with how things are progressing, looking for fun new ideas to revitalize your project and get you interested in science again, etc. is to come talk with me. If you're a Dartmouth person you can come to my regular office hours, email me, message me on Slack, or come visit my lab.
- Important-- chances are good that if you're feeling lost, you're not the only one! If you learn something useful, please share it via Slack!