# Hypothesis Group Notes - Plants & Python, 2021

## Group Members 

- Bleich, Andrew
- Herrera-Orozco, Héctor
- Hightower, Asia
- Leuenberger, Wendy
- Mendoza-Galindo, Eddy



## Hypothesis generation

### Broad topic

**We think it would help to have an overall topic to frame all other hypotheses.**

**Topic ideas:**
- Stressed versus unstressed tissues
- Regulatory genes
- Wildtype versus mutant
- Changes in gene expression
- Regulatory genes

### Collaboration

Hypotheses should be generated and updated with regular input from the computational groups. Computational groups can help us develop hypotheses by discussing the data and methods that they have available. We can help seeing how the data they are working on fit into the broad topic/other hypotheses.

### What do we need from groups?

- Heavily annotated data and code
- What are the values for each column
- Consistent names for everything
    Resource of name/nomenclature
    Consistent structure for comments/notes
- What is each group testing?

### Database questions

- What was each study in the database testing?
- Would similar studies (same experimental design) be repeated/mapped over? Or be good replication?
- Is there bias in the data filtering?
    Will need to describe methods of filtering in the manuscript
- Do we have the ability to contact authors of database if needed?
    Dan (or other professor?) have been in contact with them

### Cheat Sheet 

**Components of cheat sheet**

- Resource for names/nomenclature
- Guidelines/examples of structure for comments and notes
- Big hypotheses
- How to access the data? 
- The cheat sheet needs to be easily accessed in multiple locations.




## Hypothesis

**After analysing the data, the group discussed the possible hypothesis that could englobe the most information from the database. The conclussion was that the best approach could be with broad hypothesis, like follows.**

### Hypothesis 1

**Title:** Examining homogeny of experiment types in database
    Starting with heterogeneous data set of gene expression data 

**Hypothesis question:**  How similar are treatment types/experimental purpose based on gene expression in a tissue? 
    Treat similarly designed experiments are repeats to test variance. 
    
**Training data:** 

**Input:** gene expression of tissues from similarly designed experiments (random data from ½ of cleaned data set).

**Factor levels:**
treatment/experiment purpose (i.e. light exposure, drought, flooding, heat stress, etc)
- genotype
- Tissue

**Expectation:** similar experiments should produce similar gene expression patterns based on genotype and tissue - homogeneous groups. 

**Things to consider**

**Mapper filter(s)

**Machine learning:** do we want to look at all continuous features/variables for ranked list of features or do we want to consider filter variables that are important for the outcome?

**Test data:** 
Other random half of cleaned data set

## Hypothesis 2

**Title:** Above/below ground differences (non-transition)

**Hypothesis question:** what does the shape of gene expression data look like when above ground and below ground tissues are compared to each other excluding genes known to be involved in the transition between below ground to above ground growth (such as the hormone GA). 

**Data set:** remove seedlings/seed and whole plant samples, aggregate all above ground tissues together and all below ground tissues together, remove transitionary genes (random data from ½ of cleaned data set).
Additionally - consider seeds/seedlings and whole plant samples independent from above data set and compare expression.

**Factor level:**
- Above/below ground tissue
- genotype
- treatment

**Expectation:** similar gene expression patterns for tissues based on above/below ground, genotype, and treatment type (i.e. stressed roots may look similar by stress and stressed hypotocols may look similar to each other and roots by stress). 

**Things to consider**

**Mapper filter(s)**

**Machine learning:** do we want to look at all continuous features/variables for ranked list of features or do we want to consider filter variables that are important for the outcome?

**Test data:** 
Other random half of cleaned data set

## Hypothesis 3

**Title:** Examining gene expression of controls, particularly housekeeping genes to genes being tested.

**Hypothesis question:** for datasets that utilize similar controls, particularly housekeeping genes, are those controls preforming as expected? 

**Data set:** housekeeping genes utilized in projects (random data from ½ of cleaned data set).

**Factor level:**
housekeeping genes in wt vs. mutant samples (I.e. genotypes and treatments) vs. a random sample of test genes

**Expectation:** control (housekeeping) genes should be expressed at the same/similar levels regardless of genotype or treatment to provide a baseline to compare against and therefore should cluster together. 

**Things to consider**

**Mapper filter(s)**

**Machine learning:** do we want to look at all continuous features/variables for ranked list of features or do we want to consider filter variables that are important for the outcome?

**Test data:**
Other random half of cleaned data set

## Hypothesis 4

**Title:** Changes in regulatory and coding potential gene expression

**Hypothesis question:** how transcription profiles differ in different tissues/genotypes/stress conditions depending on the transcript type (coding or not, regulatory or functional protein)

**Data set:** divide dataset into coding transcripts, non coding transcripts (long or pre-sRNA transcripts) and transcription factors

**Factor level:** 
coding transcripts vs. non coding transcripts vs. pre-sRNA transcripts vs. transcription factors / above vs below ground tissues /  wild-type vs. mutant / stressed vs. non-stressed

**Things to consider**

**Mapper filter(s)**

**Machine learning:** 
tag each transcript by its type

**Test data:** each subset 

## Hypothesis Survey 

In order to choose the hypothesis that the whole class could work on, the group decided to make a google survey with the proposed hypothesis so the class could express their preferences. 

### Link to Google Survey

https://docs.google.com/forms/d/11Gwd9KIHXhRkYJOU6KslC4cAx3E7Br_aZn8TUMQC83s/edit?ts=618c320b

### Results from Hypothesis Survey 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Wrap up from survey 

1) Aboveground/Belowground and regulatory hypotheses are currently the most supported by about half of the class.

2) Reading through the difficulties - metadata, data messiness, modeling

3) We understand that the datasets are messy, missing data, lots to be figured out. The hypotheses help us narrow down and focus on these questions and look into some of these feasibility

4) Some of the computational groups expressed that they have found more categorical variables that were worthy to research; so the hypothesis group decided to look into de metadata and proposed some new categorical variables. 


## Link to the complete notes from the Hypothesis group 

https://docs.google.com/document/d/13_fr5Lsj7X0gnePo39X8HMG-Eb3Liq4q9F3bG4GLroE/edit#

## Tissue type categorization

We went through the tissue types to put them in above ground and below categories. We focused initially on just categorizing the above ground and below ground groups, but also added some information about whole plants and seeds at the request of the data group. As a result, we have 95% of the data categorized. 

The categorizations include 23 different types of tissue in the `Tissue` column. These tissue types standardize the names, as the raw names in the data include various spellings and shorthand. We then categorized tissues into the `VegetativeRepro` column as reproductive, vegetative or a couple others as follows: 

- Vegetative (8129 samples; 41%)
- Reproductive (1386 samples; 7%)
- Root (2092 samples, 11%)
- Hypocotyl (251 samples, 1%)
- Whole plant (6741 samples, 34%)
- Uncategorized (1043 samples, 5%) 

The final categorization is the above ground/below ground classification. This column `AboveBelow` is the column that best addresses the hypothesis. Categories include:

- Above (8973 samples; 46%)
- Below (2343 samples; 12%)
- WholePlant (6741 samples; 34%)
- Seed (510 samples; 3%)
- Uncategorized (1075 samples; 5%)

There is also a column called `Debatable` for a few raw names that we weren't certain how to categorize. It's value is `Yes` if its category was debatable, and `NA` otherwise