# 01. Modern Data Analysis - Introduction

[Joses Ho](https://twitter.com/jacuzzijo), [Sangyu
Xu](https://xusangyu.com/), and [Adam
Claridge-Chang](http://www.claridgechang.net/)

As research techniques and data collection have become almost completely
digital and analysis methods grow more sophisticated, it is critical
that scientists develop three skills: data visualization, statistics,
and coding. Unfortunately, many undergraduate biology programs emphasize
the memorization of numerous facts, while failing to offer courses in
data graphics, estimation statistics, or scientific programming. In this
session, we offer a basic orientation on these topics.

## Before class

*If you encounter difficulties with the below steps, email Adam or a classmate for help. If any issues can’t be resolved, we can work on it in the class together.*

Most of the work will be done before the class.

1.  You'll need to get set up with a [version-control](https://en.wikipedia.org/wiki/Version_control) system. Go to [GitHub](https://github.com/) and get an account. Download and install [GitHub Desktop](https://desktop.github.com/).

2.  Retrieve the course materials from GitHub. Go to the course repository ("repo") at https://github.com/ACCLAB/moda. Click the green <mark style="background-color: lightgreen">Code</mark> button and then select <mark style="background-color: lightgray">Open with GitHub Desktop</mark>.

3.  To get set up with Python and [Jupyter](https://en.wikipedia.org/wiki/Project_Jupyter) notebooks, install the [Anaconda
    Distribution](https://www.anaconda.com/download/) on your laptop.

4.  Launch
    [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/user/interface.html). There are two ways to do this.
 - Open Anaconda Navigator, launch JupyterLab by clicking on it, *or…*
 - Start PowerShell (Windows) or Terminal (MacOS/Linux) and type `jupyter lab`. This written command translates into English as: “Launch the Lab version of the Jupyter application.”

5.  Download this notebook file: [A Quick Tour of The
    Notebook](https://drive.google.com/file/d/17_N5_zRva-hzHKr1g87SautWx0j-v0r0/view?usp=sharing).

6.  In the File Browser panel in JupyterLab, navigate to the folder
    where you are keeping the notebook file. Open it by double-clicking
    on the icon shown in the JupyterLab browser window.

7.  Work through the notebook. Familiarize yourself with basic Python,
    and with working in the JupyterLab environment.

8.  Read about [pandas](https://pandas.pydata.org/),
    [matplotlib](https://matplotlib.org/), and
    [seaborn](https://seaborn.pydata.org/).

9.  Read our papers on estimation statistics
    [here](https://zenodo.org/record/60156) and
    [here](https://doi.org/10.1101/377978).


## In class

10.  An overview of data analysis.

11.  A tour of the estimationstats web app.

12.  Presentation of a Jupyter
    [notebook](https://drive.google.com/file/d/1o4Ou2fHY73l6Nb7MUp2GqwtrJbcQ80Ix/view?usp=sharing)
    that introduces techniques in data analysis using Python.

13.  Try JupyterLite, an experimental web version of JupyterLab, with a
    class notebook
    [here](https://sangyu.github.io/Evidence-Session/lab?path=Notebooks%2F01.+Data+Analysis+with+Jupyter+and+Python.ipynb).

## Further practice and resources

Try using the [estimationstats.com](https://www.estimationstats.com/#/) web app to analyze your own grouped data.

Open and have a look at the sample multivariate
[data](https://docs.google.com/spreadsheets/d/1F0c5I_S9_NnLKPMQxJkEfzGfhzQeR26SgkiHSTFwKDE/edit?usp=sharing).
Go through the [introductory
notebook](https://drive.google.com/file/d/1m_l4k5ZaUc03hpvcfBd_Riy2nXYDpFXg/view?usp=sharing)
that demonstrates data analysis.

We recommend the following texts to strengthen your data-analysis and
presentation skills. They can be dipped into over the coming months or
years, and used as references. Being familiar with some or all of this
material will help you write your first-author paper/s and doctoral
thesis.

### *Key resources*

-   Estimation: Our
    [estimationstats.com](https://www.estimationstats.com/#/background)
    site has introductory information on estimation and specific types
    of
    [analyses](https://www.estimationstats.com/#/user-guide/two-independent-groups)
    and [effect
    sizes](https://www.estimationstats.com/#/about-effect-sizes).

-   Datavis: Claus Wilke’s free online
    [book](https://clauswilke.com/dataviz/index.html) is a great
    introduction to data visualization, and a style guide. It is written
    in R, which is the best language for statistics.

-   Coding: There are many online resources to learn coding. Published
    in 2021, [A Data-Centric Introduction to
    Computing](https://dcic-world.org/) uses a Python-like teaching
    language ([Pyret](https://www.pyret.org/)) to introduce key concepts
    in computer science.

### *Additional resources*

#### *Some are free, some you will need to buy or borrow from the library.*

-   Estimation: If you want to learn about estimation statistics in
    greater depth, there is Calin-Jageman and Cumming’s
    [textbook](http://thenewstatistics.com/itns/) that is well-written,
    funny, and clear. The authors also run a
    [blog](https://thenewstatistics.com/itns/).

-   Estimation: Christoph Bernard’s account of the pioneering experience
    of a major journal (*eNeuro*) recommending estimation as standard:
    the [initial
    announcement](https://www.eneuro.org/content/6/4/ENEURO.0259-19.2019),
    [author
    feedback](https://blog.eneuro.org/2021/02/discussion-est-stats-author-feedback),
    and [after one
    year](https://www.eneuro.org/content/8/2/ENEURO.0091-21.2021).

-   Coding: The paid coding tutorial [Learn Python The Hard
    Way](https://learncodethehardway.org/python/) has a good reputation,
    but there are also many free options (see
    [DCIC](https://dcic-world.org/) above) with great reviews.

-   Coding: It will help to learn to use your computer’s
    [Unix-style](https://youtu.be/tc4ROCJYbm0) command-line
    [shell](https://en.wikipedia.org/wiki/Unix_shell). This interface
    will allow you to use package managers like
    [conda](https://docs.conda.io/en/latest/) and
    [homebrew](https://brew.sh/), version-control tools like
    [git](https://git-scm.com/), and other important tools. There are
    many [books](https://www.linuxcommand.org/index.php) about the
    shell, with only minor differences between MacOS,
    [Windows](https://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/),
    and Linux.

-   Datavis: A brief guide to oral–visual data presentations
    ([talks](http://www.howtogiveatalk.com/)).

-   Datavis: A reader-funded textbook on
    [typography](http://practicaltypography.com/presentations.html),
    including for slides. Since so much communication relies on text,
    typography is an important part of the data interface.

-   Datavis: For historical perspectives, Edward Tufte’s
    [books](https://www.edwardtufte.com/tufte/) are classic texts to
    develop your design skills, and there is Friendly and Wainer’s
    [History of Data
    Visualization](https://friendly.github.io/HistDataVis/).

-   As you progress, you will want to develop your skills in areas like
    bioinformatics, image processing, and/or machine learning. The iris
    dataset is widely used for training in multivariate data analysis,
    with many online tutorials.

In [None]:
%%HTML
The social reasons to learn programming also apply to research programming.
<iframe width="600" height="400"
src="https://www.youtube.com/embed/kgicuytCkoY"
</iframe>