# Association tests between variant(s) and continuous phenotype
The phenotype data provided in the multi-omics publication is not categorical. We thus have to adapt our approach to use continuous data. In the lecture you learned that instead of $\chi^2$ test, linear or logistic regression models and correlation tests can be used. We will stick to linear correlation now.

## Imports

## Task: Load the genotype csv and phenotype spreadsheets
This is routine by now: for the genotype dataframe, use the B=0 / H=1 / D=2 / U=nan representation, and separate it from the locus information columns. As for the phenotype spreadsheets, always use the `split_cd_hfd(...)`-transformed version from now on.
Don't immediately drop rows or columns with NaN's this time. We will try to avoid wasting precious data, and work around them instead when necessary.

## Task: Implement Pearson's r-based test for correlation between genotype and phenotype

Choose a phenotype whose genetic associations you want to study, one example could be `Glucose_[mmol/L]` from the `Biochemistry` sheet. 

For every SNP, perform a Pearson's correlation test between the SNP and the chosen phenotype. Notice that your phenotype dataframe (due to the `split_cd_hfd` transformation) has two entries for each strain now, while your genotype dataframe only has one. Since the `stats.pearsonr` function takes two vectors/Series with directly comparable elements, you will have to "stretch" your genotype data to allow a direct comparison. Pandas' `reindex` method can help you with this: you can force the genotype DF index to match the phenotype DF index.

The `stats.pearsonr` function doesn't just expect two vectors of the same length, it also can't handle NaN values. Make sure you remove elements from both vectors where either of them is an NaN. Do this on a case-by-case basis using binary masks: a Series' `.isna()` method gives you a boolean vector where `True` elements stand for missing values. You can combine and invert boolean vectors with logical operators, and use the resulting mask to slice both the genotype and phenotype Series at the necessary positions.

Once you can do a Pearson's test between between a single phenotype and a single locus, you can extend your code to run it for all loci. Combine all p-values into a single Series (indexed by Locus) for the next tasks.

## Task: Plot corresponding Manhattan Plot

You can re-use your code from the previous notebook.

## Task: Plot p-value histograms before and after multiple hypothesis testing correction

Use the same Benjamini-Hochberg correction you had used previously. Did you find any significant SNPs at the 0.05 level after correction?



### Task: take the locus with the lowest p-value, and visualize the phenotype values for the B (0) mice and the D (2) mice separately
Are there any differences between them? Did you expect to see any?

# Run association tests for all phenotypes versus a single locus

Moving onto phenotype association studies, we will now test multiple phenotypes for genetic associations. To determine how much each trait is associated with a single genetic factor, we have to repeat our analyses for phenotypes of interest, and of course correcting for multiple hypothesis testing afterwards. We then have to figure out which traits (if any) are the most associated with that locus.

### Subtask: Take all phenotype sheets, transform and concatenate them
Transformation means `split_cd_hfd` in this case. What is the size of the joint phenotype sheet?

## Task: Pick a locus, and run a Pearson's test between this locus and all phenotypes
This is the mirror image of the previous task, where you fixed a phenotype, and tested it against all loci. Now do it the other way around.

As usual, do multiple testing correction on your p-values. Could you argue that we are overcorrecting them? Were we overcorrecting them previously?

## Task: Extract relevant information about effect size, significance, locus,... of the phenotype with the lowest p-value

## Task: Interpret results
Is the p-value low enough to be considered significant? If yes, can you find genes in the region, that may be causal?

## What confounding factors may have been ignored in the current approach? How could they be incorporated?
Remember: the BXD strains were bred in multiple laboratories over several decades.
http://www.genenetwork.org/mouseCross.html