# Coming up with a data plan

Sections:
* Defining a data plan
* An example
* Preregistration

# 1. Defining a data plan.

Before you write a single line of code, before you even collect a single data point, you do anything else on a project you want to have a clear vision of your entire process. To clarify this vision, you want to come up with a **data plan**.

* **Data Plan:** A systematic outline of your data analysis from beginning to end.
   
A data plan should be as comprehensive as possible, outlining not only the logic of the analysis but also the assumptions and goals.

When paired with a declaration of a hypothesis and predictions, a data plan becomes a critical part of a [project registration](http://help.osf.io/m/registrations/l/524205-register-your-project). Project registration (or "preregistration") is becoming an increasingly valuable tool for getting studies published in psychology (and, to a smaller extent, neuroscience), particularly studies that fail to conform to _a priori_ predictions. We will return to the issue of project registration (or "preregistration") later in the lecture.

## Questions that your data plan should answer

Your data plan should answer six basic qusestions that an outside reader will have about your project.

<br>

### 1. What is the data set that you are working with? 

Define the data set that you are working with, including the number of observations (e.g., number of subjects), number of variables, types of variables, and how the data table will be setup.

<br>


### 2. What are the hypotheses that you are evaluating? 

Define explicitly the hypothesis being evaluated _from a data perspective_ (e.g., "There should be a correlation between daily exercise rate and student achievement scores"). Also consider carefully what null you really want to evaluate.

<br>

### 3. What statistical approaches will you need in order to test your hypotheses? 

Know whether you are looking for differences or associations, simple pairwise evaluations or hierarchical models, univariate or multivariate models, etc.

<br>

### 4. What are the assumptions of the tools you'll be using? 

Identify the data characteristics that your desired statistical methods require in order to function properly (e.g., normality, equal variances).

<br>

### 5. What is the best way to visualize the key results? 

Know what is the best way to see the relationships you'll be evaluating. Sometimes a simple scatterplot or barplot will do. Other times you may need more complex visualization strategies. (_Note:_ the visualization should compliment the statistical tools used).

<br>

### 6. How will you know when you are done? 

Be specific what tests are critical for addressing your hypotheses (e.g., p-values, bootstrapped confidence intervals, Bayes Factors). 

<br>



# 2. Example data plan for class

<br>
The following was the data plan put together for the following study: 

https://www.mitpressjournals.org/doi/pdf/10.1162/NETN_a_00031 

<br>

__*Title*__: Local connectome phenotypes of individual differences

__*Background*__: Yeh et al. 2016 reported that local white matter architecture (i.e., the integrity of pathways that structurally connect disparate brain regions together) is highly unique to an individual, reflecting a combination of genetic predisposition and scultping by experiences. It remains unclear whether local white matter architecture also reflects similarities between individuals along specific dimensions (e.g., social factors, cognitive abilities, and health outcomes).

__*Hypothesis*__: Individual differences in social, cognitive, & health factors can be predicted by the integrity of white matter pathways in the brain.

__*Data*__:

*Sample:*
* N=810 (370 male, mean age 28.76)

*Predictor variables:*
* Local connectome fingerprint (433,386 elements/subject)

**Response variables:** 

_Qualitative_
* Gender
* Race
* Ethnicity
* Relationship Status

_Quantitative_
* Age (in years)
* Handedness
* Total Household Income
* Years of Education Completed
* Body Mass Index
* Mean Hematocrit Sample
* Diastolic Blood Pressure
* Systolic Blood Pressure
* Systolic-Diastolic Blood Pressure Ratio 
* Hemoglobin A1C
* Pittsburgh Sleep Quality Index
* NIH Picture Sequence Memory Test
* NIH Dimensional Change Card Sort Test
* NIH Flanker Inhibitory Control and Attention Test
* Penn Progressive Matrices: Number of Correct Responses
* Penn Progressive Matrices: Total Skipped Items
* Penn Progressive Matrices: Median Reaction Time for Correct Responses (sec)
* NIH Oral Reading Recognition Test
* NIH Picture Vocabulary Test
* NIH Toolbox Pattern Comparison Processing Speed Test
* Delay Discounting: Area Under the Curve (<200)
* Delay Discounting: Area Under the Curve (<40,000)
* Variable Short Penn Line Orientation: Total Number Correct
* Variable Short Penn Line Orientation: Median Reaction Time Divided by Expected Number of Clicks for Correct (sec)
* Variable Short Penn Line Orientation: Total Positions Off for All Trials
* Penn Word Memory Test:  Total Number of Correct Responses
* Penn Word Memory Test:  Median Reaction Time for Correct Responses (sec)
* NIH List Sorting Working Memory Test


__*Preprocessing Steps*__
* Local connectome fingerprints taken from previous study (Yeh et al. 2016).
* Base rates on all qualitative response variables
* Outlier detection on quantitative response variables (Grubb's test)
* Summary statistics on all response variables: total sample, mean, median, skew, outliers, 95% confidence intervals.
* Confirmation of low dimensional structure in predictor variable (singular value decomposition).

__*Statistical Analyses*__
* **Modeling**: The model for each response variable is ultra-high dimensional, with 433,386 feature variables and only 841 observations. We will need to integrate two high-dimensional analysis methods: __Principal Component Regression__ and __LASSO Regression__.

* **Evaluation**: 1) Hold-out test set prediction using M-fold cross-validation. 2) Significance testing on the prediction vs. observed correlation (on hold out test set) using a 10,000 iteration permuation test and bootstrapped 95% confidence interval.

* **Adjustments**: A false-discovery rate adjustment (q<0.05) will be done for multiple comparisons.

__*Figures*__
* Methods outline illustration
* Density-based clustering on the local connectome fingerprint
* Percent variance accounted for by principal components on local connectomes
* Distribution of between-subject correlations in the local connectomes
* Weight maps in brain space
* Correlation between models.



# 3. Preregistration

<br>

Why have a data plan at all? Having a clear design for all your steps and stating them up front avoids the so called "hypothetico-deductive model" of the scientific method. This model is illustrated here:

![The Research Cycle (taken from https://cos.io/rr/)](imgs/L3DestructiveCycle.png)
(Image borrowed from [Center for Open Science](https://cos.io/rr/) and adapted from [Chambers 2014](https://www.elsevier.com/reviewers-update/story/innovation-in-publishing/registered-reports-a-step-change-in-scientific-publishing))


<br>

When you take an exploratory analysis approach to _a priori_ scientific questions, the are many problems that can arise: 

* **1. P-hacking:** collecting data until analyses return significant results or selective reporting of analyses that reveal desirable outcomes.

* **2. Publication bias:** When journals reject manuscripts on the basis that they report negative or undesireable findings.

* **3. Lack of replication:** The inability to replicate previous findings in a meaningful way that impededs the elimination of false discoveries.

* **4. HARKing:** Hypothesizing after results are known, where hypotheses are revised after data anslyses, but presented as if they were the a priori hypotheses.

* **5. Low statistical power:** increases the chances of missing true discoveries and reduces the likelihood that obtained positive results are real. 

<br>

If you had a clear data plan and you released it to the community _before_ you started your project, then many of these problems can be avoided. This is the idea behind the concept of **preregistration.**

<br>

** Preregistration: ** an open research practice where researchers are required to submit their research rationale, hypotheses, design and analytic strategy to a public repository or journal prior to beginning the study.


<br>

A growing number of peer-reviewed journals have preregistration options. (For a full list, visit the [Center for Open Science](https://cos.io/rr/) website.

<br>

The general workflow for registering your research reports (aka- "data plan") is as follows.

![Workflow (taken from https://cos.io/rr/)](imgs/L3RRWorkflow.png)
(Image reprinted from [Center for Open Science](https://cos.io/rr/))

Thus your study is evaluated at two different stages: design and post-analysis.


# Preregistration Questions (from OSF)

The [Open Science Framework (OSF)](http://www.osf.io) is a popular repository for posting project registrations. A template of the key information you will have to provide in your OSF preregistration includes.

<br>

**1. Have any data been collected for this study already?**

<br>

**2. What's the main question being asked or hypothesis being tested in this study?**

<br>

**3. Describe the key dependent variable(s) specifying how they will be measured.**

<br>

**4. How many and which conditions will participants be assigned to?**

<br> 

**5. Specify exactly which analyses you will conduct to examine the main question/hypothesis.**

<br>

**6. Any secondary analyses?** 

<br> 

**7. How many observations will be collected or what will determine sample size? No need to justify decision, but be precise about exactly how the number will be determined.** 






# Types of pregistration

The example above outlines one type of preregistration (a "Registered Report"). But there are many ways to preregister your study (taken from https://www.psychologicalscience.org/observer/research-preregistration-101).

<br>

**1. Preregistration (a.k.a. unreviewed preregistration).** 

The researcher creates as detailed a description of his or her plans for a study as possible and saves those plans in a time-stamped, uneditable archive. This record can be shared with reviewers, editors, and other researchers.

<br>

**2. Registered Reports (a.k.a. reviewed preregistration).** 

The researcher submits a detailed proposal for a study to a journal before conducting the study. These registered reports have the same virtues as preregistration, but they also address the problem of publication bias because the studies are published regardless of their outcomes. Registered Reports are most useful for well-defined research domains in which reviewers can reasonably assess the likelihood that a proposed study will be informative regardless of its outcome.

<br>

**3. Registered Replication Reports (RRR).** 

A variant of Registered Reports, RRRs are focused on direct replication of one or more original findings. Many labs follow the same preregistered plan, and the results from all of those independent studies are published collectively regardless of the outcomes of individual studies. 

## Pros & Cons of preregistration

(content adapted from from the APA's ["The promise of pre-registration in psychological research"](http://www.apa.org/science/about/psa/2015/08/pre-registration.aspx).

<br>

### Pros:

* **Improved use of theory and stronger research methods due to the initial review.** Because researchers are forced to formulate a proposal for a new study without seeing results first, there is an expectation that they will do a better job of addressing research motivation and design. The result should be better studies that respond clearly to precisely formulated questions.

* **A decline in false-positive publications.** If publication decisions are based on outcome-neutral criteria — which are largely addressed before conducting the study — then researchers will not be motivated to engage in practices that increase the likelihood of making a type I error. Results from studies that utilize such practices, separate from a pre-registered analytic plan, would be publishable strictly as exploratory research.

<br>

### Cons:

* **Pre-registration could lead to undervaluing exploratory research.**  If pre-registered studies have greater merit because they are less susceptible to publication bias via such practices as p-hacking and HARKing, then studies using exploratory analyses may not be held in the esteem they presently enjoy. This could lead to a decline in funding for such projects, and the discoveries that such research can yield.

* **Negative impacts on students and young investigators.** Without research results to help indicate the value of a study, editors and reviewers for nonblind journals may rely more on researcher prestige to make decisions about accepting articles for pre-registration. 


