-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is logFC reporting in MRcoefs output? #31
Comments
Hi Carly, Thanks - I should mention, since this isn't an issue really it might be better to post these types of questions on support.bioconductor.org. There are a number of edX courses or intro stat books that can explain the basics to linear models. A nice intro is available here: https://leanpub.com/dataanalysisforthelifesciences/ for free. In your model, the first covariate estimate, the intercept, is the mean for Re: q2. Difference of means of log transformed data. |
Hi, Thank you for the quick response. I understand linear models. I have several basic and advanced statistics books sitting on my desk that I reference often. My question is directed towards what logFC is representing. It is not a normal output of a linear model. I now understand that the output is taking the level 1 factor and comparing it to the level 2 factor, indicating how level 2 value is compared to level 1. This is similar to the general output of a linear model, indicating what the intercept is -- level 1 -- and how the other levels compare to the intercept. My final question still remains on how is metagenomeSeq specifically calculating logFC. I find this to be a question specific for you all. I know some people use log2 fold change and I could not find what you all are using. I feel that it is a documentation issue. You are reporting a value in a table with no indication of what the value represents in any metagenomeSeq help file. That is why I am asking it here on github. I can switch over to Bioconductor if need be. Your answer is: it is the ln(mean of OTU_X at factor 2) - ln(mean of OTU_X at factor 1)?? (note: as log in R is the natural log) Is the mean of the OTU that of the normalized counts or the raw counts? From calculating it myself it appears that it is using the normalized counts, but I could be wrong. So the answer is: ln(mean of normalized(OTU_X at factor 2)) - ln(mean of normalized(OTU_X at factor 1) When I extract these values for OTU_2 I get -2.77 so not too off from the -2.71 calculated (probably due to significant figures). Then, to figure out what logFC actually means in terms of biology. I take the e^(-2.77) in the case of OTU_2 and get the value 0.0626. If I subtract 1 then that gives me the change from factor 1 to factor 2 = -0.9374. In my case, the average count (raw or normalized?) of OTU_2 decreased by 93.74% from Catoctin to Shenandoah. If the value is positive, for instance OTU_79 has a value of 2.42. I take e^(2.42) and get the value 11.246. If I subtract 1 then I get 10.246. So, the average count (raw or normalized?) of OTU_79 has increased by 1024% from Catoctin to Shenandoah. I played around with the data to figure out what the logFC is really telling me biologically, but couldn't exactly tell whether it was raw or normalized counts? In my case OTU_2 raw gave me logFC = -3.02 and OTU_2 normalized gave me logFC = -2.77. Then OTU_79 raw game me logFC = 2.40 and OTU_79 normalized gave me logFC = 2.29. Thank you and I look forward to the publication on the time series analysis. I have a time series experiment that I plan to analyze using metagenomeSeq. Still waiting for a simple biological example on how normalization works (#16). I can repost at bioconductor if that will help the practicing biologist get an explanation. Or, you mention a publication may be coming out that will help a biologist understand what they are doing to their data, Cheers, Carly |
Dear Carly, Thank you for your use of metagenomeSeq, it is much appreciated. We are always keen on improving the usability and usefulness of this package so your comments are welcome. We could of course, always use help in translating what our method does to biological statements as much as possible. So, as the Nature Methods paper describing this method explains, the inferences we produce are from a weighted linear regression on log2 transformed and normalized counts. In fact, in order to not redesign the wheel, we make use of the BioC Limma package to fit the weighted linear regression models, and the estimates reported in metagenomeSeq are directly obtained from that. So, as in that case, you can obtain fold change from 2^logFC. So, in the simple two-level factor design, it would be a difference in mean of log2 normalize counts as you write, except for the fact that weights are included to correct for zero-inflation. HTH |
Great, thank you for the reply. It is great the work you all do. That clarifies things for me as to what logFC is and also why the values I was calculating were a bit off from the values that I found in the output. The nature methods paper is dense for a biologist, especially what exactly css normalization is doing. A biological example with a small dataset say 4 samples by 8 OTUs explaining how the normalization constant, the median scaling factor, and the normalization scaling factor is calculated or chosen, and how they are related and how that actually relates to the code one uses in R. By using an example, it is much easier for a biologist to track what is happening. The written text is challenging to grasp. I am at UMD, and am happy to meet and discuss anytime. |
I wanted to add one more comment here that clarifies how fitFeatureModel calculates the logFC as shown in the MRcoefs output and how you can take the value and actually interpret biologically (as someone asked me recently about this). The calculation behind logFC for each OTU is: log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 2)) - log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 1) ######## CODE ########### Create the metagenomeSeq objectMGS_ele = newMRexperiment(counts = OTU, phenoData = ADF, featureData = TDF) p = cumNormStatFast(MGS_ele) returns normalized factors for each samplehead(normFactors(MGS_ele)) To export normalized count matrixI have not spent time figuring out how to extract the normalized counts corrected for zero-inflation, but it is straightforward to extract the normalized counts.normmybiom <- MRcounts(MGS_ele, norm = T) To use fitFeaturemodel can only do pairwise comparisons. I had three levels of interest, so did each comparison. Below is for comparing two localities. MetagenomeSeq has fitZig function that is supposed to be able to do multiple comparisons, but I got results that did not appear correct based on the data, so I just stuck with fitFeatureModelCan only be pairwise comparisons with fitFeatureModelsubset Catoctin, Shenandoah (remove Mt. Rogers samples)Park_ShenCat = MGS[, -which(pData(MGS)$Park == "Mt.Rogers")] Here are the levels: > levels(pd_shencat$Park)[1] "Catoctin" "Shenandoah" mod_shencat <- model.matrix(~Park, data = pd_shencat) y_5 <- MRcoefs(modres5, number = 30, group = 3) OUTPUT:
OTU_449 4.4101969 0.5785806 2.486900e-14 1.442402e-12 Now we want to interpret the OTUs that are significant -- updated July 2017MATHEMATICALLY -- You can then calculate the mean of the normalized counts for each OTU at each factor level (extracted in code above). BUT, what logFC is calculating is based on the mean of normalized counts corrected for zero-inflation. I did not figure out how to extract those values from the model, but I'm sure there is some way if you really wanted to. Again, the calculation behind logFC for each OTU is: log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 2)) - log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 1) I could never get the same answer in excel as metagenomeseq indicates at logFC, but I am assuming this is because I was not extracting the counts before zero-inflation correction by the model. I am NOT a developer of the package, just a user trying to figure it out. There is probably some way to do it in R. I really just did those calculations to try to figure out what the logFC was indicating and how to interpret as percent change. BIOLOGICALLY -- you can just take the calculated logFC and do a little math to make sense of the number (Percentage change = 100% x ((2^ log2 diff) - 1)). For instance OTU_19 above has a logFC = -2.42. I take the 2^(-2.42) in the case of OTU_2 and get the value 0.19 (In excel = Power(2, -2.42)). Then, subtract 1 (0.19 - 1) = change from factor 1 to factor 2 = -0.81. the average normalized zero-inflated corrected count of OTU_19 decreased by 81% from Catoctin to Shenandoah. If the value is positive, for instance OTU_79 has a value of 2.76. I take 2^(2.76) and get the value 6.77. If I subtract 1 (6.77 - 1) then I get 5.77. So, the average normalized zero-inflated corrected count of OTU_79 has increased by 577% from Catoctin to Shenandoah. |
Can I ask for clarification: @carlyrae mentions above using ln, i.e. the natural log or log(e)X to interrogate her data. Additionally, calling ' From the metagenomeSeq documentation, all functions which use or report log transformations or fold changes seem to use log(2)X (binomial log) to improve standardisation of the data; additionally @hcorrada above indicates one would use 2^logFC to obtain the original fold change (but for weightings etc.). A quick search suggests log2 is standard for fold-change expression. In the absence of an obvious convention for log(e)X in fold-change expression, could the authors confirm which logarithm is used for the reported fold-change given by MRcoefs etc ? Apologies for not asking elsewhere, thought it makes most sense to append here in light of CarlyRae's thorough workthrough above. |
It follows the conventions of the package documentation. The
documentation in the package explicitly mentions log2.
MetagenomeSeq and many other very popular differential expression
softwares use log2 in part because of the double-ing of DNA content
during the PCR process.
On 2017-03-21 08:58, handibles wrote:
Can I ask for clarification:
@carlyrae [1] mentions above using LN, i.e. the NATURAL LOG or LOG(E)X to interrogate her data. Additionally, calling '?log' explains that 'log' defaults to log(e)X in R.
From the metagenomeSeq documentation, all functions which use or report log transformations or fold changes seem to use LOG(2)X (BINOMIAL LOG) to improve standardisation of the data; additionally @hcorrada [2] above indicates one would use 2^LOGFC to obtain the original fold change (but for weightings etc.). A quick search suggests log2 is standard for fold-change expression.
In the absence of an obvious convention for log(e)X in fold-change expression, could the authors confirm which logarithm is used for the reported fold-change given by MRcoefs etc ?
Apologies for not asking elsewhere, thought it makes most sense to append here in light of CarlyRae's thorough workthrough above.
--
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub [3], or mute the thread [4].
|
Well, that might explain why my math was always a little off when I was trying to calculate the logFC by hand. I thought it was off because of the zero-inflated part that the model does as Hector had indicated when I first asked how it was calculated. I will look over this and make any corrections in my above description based on your correction about it being log2 and not ln. |
@jnpaulson @hcorrada @carlyrae
Alcaligenes 1.2377366 0.2355239 1.478256e-07 0.0000190695 |
I updated my workflow above. It would be amazing if one of the developers of the package could confirm the mathematical and biological intrepretation of logFC as I have written below. @hcorrada @jnpaulson Now we want to interpret the OTUs that are significant -- updated July 2017MATHEMATICALLY -- You can then calculate the mean of the normalized counts for each OTU at each factor level (extracted in code above). BUT, what logFC is calculating is based on the mean of normalized counts corrected for zero-inflation. I did not figure out how to extract those values from the model, but I'm sure there is some way if you really wanted to. Again, the calculation behind logFC for each OTU is: log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 2)) - log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 1) I could never get the same answer in excel as metagenomeseq indicates at logFC, but I am assuming this is because I was not extracting the counts before zero-inflation correction by the model. I am NOT a developer of the package, just a user trying to figure it out. There is probably some way to do it in R. I really just did those calculations to try to figure out what the logFC was indicating and how to interpret as percent change. BIOLOGICALLY -- you can just take the calculated logFC and do a little math to make sense of the number (Percentage change = 100% x ((2^ log2 diff) - 1)). For instance OTU_19 above has a logFC = -2.42. I take the 2^(-2.42) in the case of OTU_2 and get the value 0.19 (In excel = Power(2, -2.42)). Then, subtract 1 (0.19 - 1) = change from factor 1 to factor 2 = -0.81. the average normalized zero-inflated corrected count of OTU_19 decreased by 81% from Catoctin to Shenandoah. If the value is positive, for instance OTU_79 has a value of 2.76. I take 2^(2.76) and get the value 6.77. If I subtract 1 (6.77 - 1) then I get 5.77. So, the average normalized zero-inflated corrected count of OTU_79 has increased by 577% from Catoctin to Shenandoah. |
@carlyrae |
I only interpreted adjusted p-values < 0.05. The p-values are adjusted for multiple comparisons using FDR corrections (see MRcoefs help file in R). I filtered my original OTU file to contain OTUs that were present on 5% of individuals. Here is methods from a paper soon to be published in J of Animal Ecology. Will share doi once available.To generate normalized sequence counts, we performed variance-stabilizing normalization on the raw sequence counts (Paulson et al. 2013) using the functions cumNormStatFast and cumNorm in the package ‘metagenomeSeq’ (Paulson et al. 2015). This normalization method corrects for biases associated with uneven sequencing depth (Paulson et al. 2013; McMurdie & Holmes 2014). For bacterial OTU abundance, we interpreted normalized sequence counts as bacterial abundance because sequence counts are approximately quantitative for microbial OTU abundance (Amend et al. 2010). Hereafter, we refer to normalized sequence counts as bacterial abundance. In bacterial abundance analyses, we filtered the data to contain OTUs that occurred in 5% of samples to reduce spurious significance from OTUs with low sequence counts, and reported false discovery rate (FDR) corrected p-values. For bacterial abundance, we determined if OTUs were differentially abundant between species and between sites using zero-inflated log-normal (ZILN) mixture models (Paulson et al. 2013) in the package ‘metagenomeSeq’ (fitFeatureModel function; (Paulson et al. 2015) with the normalized sequence counts as the response variable. The ZILN mixture models remove testing biases associated with undersampling (Paulson et al. 2013). If sites within a locality had no OTUs that were differentially abundant then we pooled the sites together at the locality level to increase statistical power. Here is the top part of my code that I didn't include before:library(phyloseq) #non-rarified tree = read.tree("SMP_rooted_all_Dec2015.tre") map <- import_qiime_sample_data("biom_sides/Site.sp_compare/SMP_runall_Map_meta__Side_left____Site.sp_compare_Y__.txt") run2 <- import_biom(biom_site_sp, tree) site_sp <- merge_phyloseq(run2, map) ##remove all zeros #repair column names in tax table rank_names(site_sp) ##examine OTUs that are present on 5% of individuals, can increase depending on sampling effots. I had 62 individuals in my dataset. ##I will only work with reduced dataset #Convert sample_data to AnnotatedDataFrame #define dummy 'feature' data for OTUs, using their name Helps with |
@carlyrae Awsome! Thanks so much for your reply and the code sharing. Got it! |
Hello,
I am trying to understand what the values are that are being reported in logFC after using fitFeatureModel. My code is below.
I have a metagenomeSeq object called Park_ShenCat that looks like this:
I am assuming that these warning messages indicate that some of these OTUs do not have enough information for the model to converge so I get an error, is this correct?
Now here is my question -- For OTU_2 it says there is a -2.71 logFC from Catoctin to Shenandoah (my levels of my comparison).
I assume that is log2 fold change since that is what most RNAseq folks use, but I couldn't find any documentation of that.
Now, I can go look at OTU_2 and I know it is much more abundant at Catoctin (level 1) compared to Shenandoah (level 2), but why is logFC reporting it as a negative number?
Also, is this the log fold change that log2(sum of the normalized counts for OTU_2 at Catoctin divided by the sum of the normalized counts for OTU_2 at Shenandoah)??
Thank you for your time.
The text was updated successfully, but these errors were encountered: