Review of analysis presentation concepts

                                                                Biological analysis                                                                            


• Amplicon sequencing vs. Whole genome sequencing:

Amplicon sequencing focuses on amplifying and sequencing specific regions of DNA, usually using PCR to target genes or loci of interest. This approach is cost-effective and ideal for identifying genetic variation or diversity within specific genes or taxa. However, it provides only a partial view of the genome since only selected regions are analyzed. Amplicon sequencing is commonly used in microbial community studies, environmental DNA research, and population genetics when the goal is to examine known genetic markers rather than the entire genome. Whole genome sequencing (WGS), on the other hand, determines the complete DNA sequence of an organism’s genome. It provides detailed information about all genes, regulatory regions, and structural variations, making it powerful for studying evolution, adaptation, and functional genomics. WGS requires more data, time, and computational resources than amplicon sequencing, but it captures both known and new genetic information. In conservation biology, WGS helps assess genetic health, identify inbreeding or adaptive traits, and guide species management and restoration efforts.


• DNA barcoding:

DNA barcoding is a technique used to identify species based on a short, standardized region of DNA that acts like a genetic barcode. For animals, this region is often a segment of the mitochondrial cytochrome c oxidase I gene, while plants and fungi use different marker regions. By comparing the sequence from an unknown sample to a database of known barcodes, researchers can quickly determine the species or detect cryptic diversity. This method is widely used in conservation biology for monitoring biodiversity, detecting illegal wildlife trade, and identifying species from environmental samples such as soil or water. DNA barcoding is fast, accurate, and useful even when physical characteristics are damaged or missing.


• Diversity metrics - Alpha vs. beta diversity metrics:

Alpha diversity measures the diversity within a single site, habitat, or sample. It reflects both the number of species (richness) and their relative abundance (evenness) in that local community. Common alpha diversity indices include the Shannon Index, which accounts for both richness and evenness, and the Simpson Index, which emphasizes dominant species. In conservation, alpha diversity helps evaluate how diverse a single ecosystem or sample is and can indicate local environmental health or habitat quality. Beta diversity compares the diversity between different sites, habitats, or samples to measure how distinct their species compositions are. It reflects species turnover, or how much one community differs from another, using metrics such as Bray–Curtis dissimilarity or Jaccard distance. High beta diversity means communities have very different species, while low beta diversity means they are similar. In conservation work, beta diversity helps identify habitat uniqueness and prioritize areas for protection to preserve overall biodiversity.




                                                                    Data analysis                                                                              


• Extrapolation/normalizing data:
When is that done/necessary?


Extrapolation and normalizing data are techniques used to make ecological data more comparable and meaningful across samples or conditions. Normalizing data involves adjusting values to a common scale, such as converting raw species counts into proportions or rarefying samples so each has the same sequencing depth. This is necessary when datasets differ in sample size, sampling effort, or sequencing coverage, ensuring that comparisons of diversity or abundance are fair and not biased by unequal data collection. Extrapolation is used when scientists predict values beyond the observed data, such as estimating species richness if more samples were collected. It helps account for undetected species or incomplete sampling, giving a better picture of total biodiversity. Both techniques are essential in conservation research to draw valid, standardized conclusions about ecosystem diversity and health.

• Lag analysis:

Lag analysis examines the delayed relationship between two variables over time to identify how changes in one factor influence another after a certain period. In ecology and conservation, it helps detect time lags between environmental changes, such as habitat loss or temperature shifts, and species responses, like population decline or migration. This is important because ecological effects often do not appear immediately, and failing to account for lags can lead to misleading conclusions. Lag analysis can reveal patterns such as delayed recovery after restoration or gradual decline following disturbance. Understanding these delays helps conservationists design better management strategies and predict long-term ecosystem responses.


                                                                    Modeling concepts                                                                          


• Generalized mixed effect model:

A generalized mixed effect model (GLMM) is a statistical approach that extends traditional regression models by allowing both fixed effects, variables of interest that apply to all observations, and random effects, variables that account for random variation among groups, such as sites or individuals. This makes GLMMs powerful for analyzing ecological data that have hierarchical or repeated measures structures, such as observations taken from multiple locations or species over time. GLMMs can handle different types of response data, including counts, proportions, and binary outcomes, by using appropriate link functions like log or logit. In conservation biology, they are commonly used to study factors affecting species abundance, distribution, or survival while accounting for natural variability between habitats or populations.


• Linear mixed effect model:

A linear mixed effect model (LMM) is a type of regression model that combines both fixed effects and random effects to analyze data with grouped or repeated observations. Fixed effects represent consistent, predictable influences such as temperature or habitat type, while random effects capture variation due to factors like site, individual, or year that are not the main focus but still influence results. LMMs assume the response variable follows a normal (continuous) distribution, making them suitable for analyzing traits like growth rate, biomass, or nutrient levels. In conservation and ecology, LMMs are often used to study how environmental factors affect populations while accounting for non-independence among repeated measures or nested sampling designs.


• Allometric scale model:

An allometric scale model describes how biological traits change in proportion to the size of an organism, population, or system. It is based on the principle that many physiological, ecological, and metabolic processes scale predictably with body size following power-law relationships. In conservation biology, allometric models help estimate traits like population biomass, energy use, or nutrient cycling when direct measurements are difficult. These models are valuable for comparing species, predicting ecosystem function, and understanding how size influences survival, reproduction, and resource use.


• Linear vs. logistic regression:

Linear regression models the relationship between a continuous dependent variable and one or more independent variables by fitting a straight line to the data. It assumes that changes in the predictor cause proportional, continuous changes in the response, making it good for outcomes like biomass, temperature, or growth rate. The equation is  y=a+bx, where “a” is the intercept and “b” is the slope, showing how much “y” changes for each unit increase in “x”. Linear regression is useful when the response variable can take any value within a range and the relationship is approximately linear. Logistic regression, by contrast, is used when the dependent variable is categorical, typically binary, such as the presence or absence of a species. Instead of fitting a straight line, it models the probability of an event occurring using an S-shaped curve that ranges between 0 and 1. The model uses the logit link function to estimate how predictors affect the likelihood of the outcome. Logistic regression is widely used in conservation to predict habitat suitability, species occurrence, or survival probabilities.


• Non-linear regression:

Non-linear regression is a type of statistical modeling used when the relationship between variables cannot be accurately described by a straight line. Instead of assuming a constant rate of change, it fits a curved relationship, such as exponential, logistic, or power-law functions, to capture more complex biological or ecological patterns. This approach is useful when responses level off, accelerate, or follow saturation trends, such as population growth reaching carrying capacity or enzyme activity approaching a maximum rate. Non-linear regression estimates parameters through iterative methods rather than simple algebraic equations. In conservation biology, it helps model processes like species–area relationships, growth curves, or responses to environmental stress that do not follow linear patterns.


• Random Forests:

Random Forests are a machine learning method that builds many decision trees and combines their results to make more accurate predictions. Each tree is trained on a random subset of the data and features, which helps reduce overfitting and improve generalization. The model then averages the predictions from all trees (for regression) or takes a majority vote (for classification). In ecology and conservation, Random Forests are often used to predict species distributions, habitat suitability, or biodiversity patterns based on environmental variables. They are powerful because they can handle large, complex datasets with nonlinear relationships and interactions between variables without requiring strong statistical assumptions.


• Cross-validation and accuracy of models:


Cross-validation is a method used to test how well a model performs on unseen data by dividing the dataset into multiple subsets, or folds. The model is trained on some folds and tested on the remaining one, repeating the process several times so that every subset is used for testing once. This helps detect overfitting, where a model performs well on training data but poorly on new data. Cross-validation provides a more reliable estimate of how the model will perform in real-world situations. The accuracy of models refers to how closely a model’s predictions match the actual observed outcomes. Depending on the type of model, accuracy can be measured using metrics such as R^2 for regression or precision, recall, and overall accuracy for classification. In conservation research, high accuracy means the model can effectively predict ecological patterns, like species presence or habitat quality, across new locations or time periods.
