# Manual for phenotype analysis using phloemfinder
## loading the data

You will load your data as an object of the PhenotypeAnalysis class, so you can apply the funtions of this class to your data. Therefore, you will need to give the object a name to which you will refer when using these functions. In this example, I will call my object 'example'. To specify the name of your data file and where to find it, you enter the path to your data at 'bioassay_csv=' as done in the example.

Make sure your data is in a .csv format, uses '_' instead of ' ' (so no spaces), and is in lower case (so no capitals). The data shouls be in long format and contain columns with the following information:

* Sample identifiers, a unique ID for each sample that could for example be formatted as 'genotype_replicate'. Suggestion for column name: 'sample_id' Grouping variables, for example genotypes or treatments. Suggestion for column name: 'genotype'
* The time at which bioassay scoring was performed, for example the number of days after infection. Suggestion for column name: 'day'
* The developmental stages that were scored during the bioassay. Suggestion for column name: 'stage'
* The counts of each stage per timepoint per sample. Suggestion for column name: 'numbers'

First load the PhenotypeAnalysis class:

In [None]:
from phloemfinder.phenotype_analysis import PhenotypeAnalysis

Then load your data as object of the class:

In [None]:
example = PhenotypeAnalysis(bioassay_csv="../01_data/bioassay_daya/example_data.csv")

## preparing the data
A few steps are now required to prepare the data before you can start with the analysis.

### reshaping
If you used the column names suggested above, you can immediatly run the next funtion. Otherwise you can indicate what column names you used by changing the column names with quotation marks.

In [None]:
example.reshape_to_wide(
    sample_id='sample_id', 
    grouping_variable='genotype', 
    developmental_stages='stage', 
    count_values='number', 
    time='day')

### correcting for seperately counting of early- and late 4th instars and exuviea
With this step you can correct for counting the exuviea and early- and late 4th instars seperately. This will give you a total number that reached 4th instar stage per sample per timepoint with the name 'fourth_instar'. To check before using the function below:

* Check the names of the stages in your data and change them if nessesary in the function.
* If you counted the exuviea as a sepperate stage, keep 'seperate_exuviea' as True. Otherwise, set 'seperate_exuviea' to False.
* If you removed the late 4th instar nymphs after counting them, keep 'late_last_stage_removed' as True. Otherwise, set to False.
* If you kept the early 4th instar nymphs on the leave until developing to late 4th instar, keep 'early_last_stage_kept' as True. Otherwise, set to False.

In [None]:
example.combine_seperately_counted_versions_of_last_recorded_stage(
    exuviea='exuviea',
    late_last_stage='late_fourth_instar', 
    early_last_stage='early_fourth_instar',
    seperate_exuviea=True, 
    late_last_stage_removed=True, 
    early_last_stage_kept=True)

### calculating the cumulative development
The last step in the data preperation is the calculation of the cumulative development to each stage over time. Check the names of the stages in your data and change them if nessesary in the function. If you combined the early- and late 4th instars and exuviea in the previous step, these are seen as 1 developmental stage now. So if you counted 1st, 2nd, 3rd and 4th instars, 'n_developmental_stages' should be 4 (eggs are not included in the developmental stages). If you combined the 2nd and 3rd instars in your count, 'n_developmental_stages' should be 3.

In [None]:
example.convert_counts_to_cumulative(
    n_developmental_stages=4, 
    sample_id='sample_id', 
    eggs='eggs', 
    first_stage='first_instar', 
    second_stage='second_instar', 
    third_stage='third_instar', 
    fourth_stage='fourth_instar')

## plotting
Before plotting, you should decide in what order you want to plot the genotypes to appear (from left to right on the x-axis).


In [None]:
example.prepare_for_plotting(order_of_groups=['cv', 'wild'])

Your now ready to have a look at your data!

### boxplots per stage
It is advisable to start with boxplots, to get an overview of all stages. The top row of boxplots display the absolute number of nymphs that managed to develop to each stage over the course of the experiment per grouping variable (genotype, treatment, etc.). On the bottom row is the relative development to each stage and the hatching rate. You can decide for yourself to which stage the development should be made relative.

In [None]:
example.plot_counts_per_stage(
    grouping_variable='genotype', 
    eggs='eggs', 
    first_stage='first_instar', 
    second_stage='second_instar', 
    third_stage='third_instar', 
    fourth_stage='fourth_instar', 
    make_nymphs_relative_to='first_instar',
    absolute_x_axis_label='genotype',
    absolute_y_axis_label='counts (absolute)',
    relative_x_axis_label='genotype',
    relative_y_axis_label='relative to eggs')

### curve of development over time
If you identified the most interesting developmental stage on which you want to zoom in further, you can now include the developmental speed. By fitting the data to a 3 parameter log-logistic model (S-curve that starts at 0), you can compare the different groups in more detail. The function below returns a plot with a curve for each group and a table with the values of the parameters of the model for each group and the chi-squared.

Explenation of the table:

* slope: the slope of the curve, with the standard deviation
* maximum: the maximum value of the curve, with the standard deviation
* emt50: the EmT50, the timepoint at which 50% of nymphs has developed to the stage of interest, with the standard deviation
* reduced_chi2: the reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

In [None]:
example.plot_development_over_time_in_fitted_model(
    sample_id='sample_id', 
    grouping_variable='genotype',
    time='day',
    stage_of_ineterest='fourth_instar',
    use_relative_data=True, 
    make_nymphs_relative_to='first_instar',
    x_axis_label='days after infection',
    y_axis_label='development to 4th instar stage (relative to 1st instars)',
    predict_for_n_days=10)

### curve of hatching and survival over time
Alternatively, you could compare the total number living nymphs (so excluding the eggs) over time. The data is fitted to a log-normal model (a right skewed bell curve). The function below returns a plot with a curve for each group and a table with the values of the parameters of the model for each group and the chi-squared.

Explenation of the table:

* AUC: the area under the curve, with the standard deviation
* median: the median timepoint, with the standard deviation
* shape: the shape of the curve, with the standard deviation. A larger number is a stronger skewed curve.
* reduced_chi2: the reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

In [None]:
example.plot_survival_over_time_in_fitted_model(
    sample_id='sample_id', 
    grouping_variable='genotype',
    time='day',
    stage_of_ineterest='first_instar', 
    x_axis_label='days after infection',
    y_axis_label='total number of nymphs (relative to first instars)',
    predict_for_n_days=10)