Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Addition of 01-transcriptomic-analysis-prep.Rmd #83

Merged
merged 20 commits into from
Sep 12, 2019

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Aug 27, 2019

Purpose/implementation

The purpose of this draft PR is to conduct an unsupervised transcriptomic analysis using three dimensionality reduction techniques. The techniques used include PCA, t-SNE, and UMAP.

You can find the html output of this notebook here.

Issue

This draft PR addresses issue #9.

Directions for reviewers

Given the first set of visualizations produced, it was suggested that I backtrack using the Kids_First_Participant_ID and experimental_strategy information to find the sequencing strategies used for each observation. The unique strategies were found to be WGS, WXS, Panel, and RNA-Seq (which is renamed and treated as Unknown).

  1. My attempt to correct for batch effects did not produce significant results according to the visualizations. Should I remove this attempt and open a separate issue for this purpose?
  2. I have made this a draft PR because I believe the code can be further simplified. Please feel free to provide suggestions you believe may be useful on this matter.
  3. Are there any additional steps I should add to this analysis?

Results

  • Visualizations of the reduction scores for each of the dimension reduction techniques executed in this notebook are available on the html output and in the directory analyses/unsupervised-trancriptomic-analysis/plots

  • The .PDF files in the plots directory include visualizations before and after the attempt to correct for batch effects. Those beginning with meta_grid represent the data before batch correction attempt, while those beginning with no_batch_grid represent the data after the batch correction attempt.

Docker and continuous integration

Check all those that apply or remove this section if it is not applicable.

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Pull Request Check List:

  • Run a linter (not done as yet)
  • Set the seed (if applicable)
  • Comments and/or documentation up to date
  • Double checked paths
  • Spell check any Rmd file or md file
  • Restart R and run all notebooks fresh and save
  • Connect the pertinent issues on ZenHub
  • Updated Dockerfile and built Docker image successfully

@jaclyn-taroni
Copy link
Member

My attempt to correct for batch effects did not produce significant results according to the visualizations. Should I remove this attempt and open a separate issue for this purpose?

Yes, I would remove any attempts at batch correction from this notebook. The notebook is 600 lines long in its current state which will make review to find ways to simplify the code more difficult.

@jharenza
Copy link
Collaborator

Hi @cbethell! I think this is a great start and one thing we have been struggling with is that some of the RNA-Seq was a poly-A prep while the majority was stranded. I am not sure anyone has yet to figure out how to normalize across these two batches (eg I think UCSC Treehouse folks are also very interested in this). In some of our discussions, it has been suggested to subset the dataset for only those genes which will be captured in a polyA prep, and see if that helps, but you will lose a lot of data in that way.

@jaclyn-taroni
Copy link
Member

One thing we have been struggling with is that some of the RNA-Seq was a poly-A prep while the majority was stranded.

I don't believe this is currently noted in the histologies file, is that correct @jharenza? We came up with an idea, reflected in this PR, that the other experimental strategies (e.g. WXS, WGS) linked to a participant ID would tell us about RNA-seq prep based on this section of the manuscript: https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#data-generation This may be error-prone.

Can we add the prep information to the histologies file since we expect it will be important for folks analyzing the gene expression data?

@cbethell
Copy link
Contributor Author

Hi @cbethell! I think this is a great start and one thing we have been struggling with is that some of the RNA-Seq was a poly-A prep while the majority was stranded. I am not sure anyone has yet to figure out how to normalize across these two batches (eg I think UCSC Treehouse folks are also very interested in this). In some of our discussions, it has been suggested to subset the dataset for only those genes which will be captured in a polyA prep, and see if that helps, but you will lose a lot of data in that way.

Hi @jharenza, this is good to know! I do hope we can find a more conservative method of solving this issue.

@jharenza
Copy link
Collaborator

@jaclyn-taroni agreed - I will add this in a V3 of the file, along with survival and sex prediction

@jaclyn-taroni
Copy link
Member

Nice, thank you!

@cgreene
Copy link
Collaborator

cgreene commented Aug 27, 2019

ribo-deplete vs poly-a normalization came up on twitter this morning too: https://twitter.com/GreeneScientist/status/1166297770833121280?s=20

@jharenza
Copy link
Collaborator

@jaclyn-taroni I created a V3 clinical file to include predicted sex, overall survival, and RNA library type, which you can access immediately here if it will help with this analysis - we will add this to a V3 release, which may come sooner than extra variant calls cc: @yuankunzhu

- removed the attempt at batch correction from `01-transcriptomic-analysis.Rmd` and its html output
- removed the installation of `limma` package from `Dockerfile` as the need for it was removed
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start @cbethell! Per your directions for reviewers, I've included comments about how I think things could be simplified/DRY'd up. I've also sketched out how you might change up the arrangement of the plots a bit, which you don't need to accomplish as part of this pull request but may find it fun and interesting to mess around with at this point.


- run:
name: Unsupervised Transcriptomic Analysis
command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('analyses/unsupervised-transcriptomic-analysis/01-transcriptomic-analysis.Rmd', clean = TRUE)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give this notebook, and possibly the analysis folder, a more descriptive name please? Consider that unsupervised analysis is quite broad and you may want to call this transcriptomic-dimension-reduction. I would say if you expect to try unsupervised methods in this module (e.g., hierarchical clustering) then the analysis folder name itself is probably a-okay.

# magrittr pipe
`%>%` <- dplyr::`%>%`

reduction_fn <- function(name, id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given #83 (comment), I would take this out for now. I'll also note that reduction_fn is not particularly descriptive and without reading the code, I expected for this to run the three dimension reduction techniques. In general, I like for functions to begin with a verb (plot_tsne_scatterplot) to be consistent with the style guide used at the CCDL.

# Detect the ".git" folder -- this will in the project root directory.
# Use this as the root directory to ensure proper execution, no matter where
# it is called from.
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is an R notebook and not a script, the working directory will be wherever the file is located. So this becomes less important and you could safely use the symlinked files with ../data. You don't have to change this, more of an FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually made this change in the updated version I have yet to commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't note it, but that means you would change the output directory stuff. You may have already caught this, but it's one of those things that I sometimes have trouble spotting during code review (e.g., the paths for plots would not be super obvious to me in the GH interface) and I don't think it'd be caught via CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha 👍

}
```

Read the metadata into a data.frame named `df2`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why df2 and not something more like metadata_df or histologies_df?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'll make the appropriate change.

respectively.
```{r}
# Read in dataset
df2 <- data.frame(readr::read_tsv(file.path(root_dir, "data", "pbta-histologies.tsv")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference here -- I would set data_dir <- file.path(root_dir, "data") above.

rsem_pca_plot <- ggplot2::ggplot(
rsem_pca_data,
ggplot2::aes(
x = rsem_pca_data[, 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process, where you specify a data.frame that contains the score information and use the first two columns as the x and y variables for your scatterplot, happens quite a bit. It seems that so long as you set up the data.frame correctly and in the same manner, you could supply that data.frame to a function that does your plotting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you should consider making the font size a bit bigger with

ggplot2::theme(text = ggplot2::element_text(size = 15))

ggplot2::aes(
x = rsem_pca_data[, 1],
y = rsem_pca_data[, 2],
color = method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this color argument the only thing that changes between the plot above and what you have here, you can add this as one of the arguments to your function (mentioned above). This is probably going to be tricky because the other arguments for x and y won't be strings so it would not be as easy as using aes_string() instead of aes(). Maybe it would look like, assuming the argument to the function you write is point_color = "method"...

Step 1 Add this to the beginning of the plotting function:

color_sym <- rlang::sym(point_color)

Step 2 the aes call would look something like:

  ggplot2::aes(
    x = score_df[, 1],
    y = score_df[, 2],
    color = !!color_sym
  )

Untested but here's my reference: https://community.rstudio.com/t/how-to-turn-strings-from-function-arguments-into-column-names-with-dplyr/19438

# Make a data.frame with PCA scores
rsem_pca_data <- data.frame(rsem_pca$x[, 1:2])
# Run the reduction_fn which aligns metadata and prepares data.frame for ggplot
rsem_pca_data <- reduction_fn(rsem_pca_data, rsem_ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this function as suggested, you can probably replace it with joining whatever 'scores' data.frame you have with the histologies data.frame which contain what you need to color the points. I think this would be the procedure in every case (i.e., RSEM PCA, kallisto UMAP).

It addresses [issue #9](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/9)
in the Open-PBTA analysis repository.

## Output files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume some of these came out with 0d74cc5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, I will make the appropriate changes here.

(poly-A versus stranded).

```{r, fig.width = 17, fig.height = 14}
# Plot grid with RSEM data colored by method vs tumor type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea about these figures could be laid out so you would be able to letter the panels:

8_28_19, 7_42 AM Office Lens

I would use cowplot for this, which very well may be on the Docker image already as a dependency with the addition of colorblindr here, and I think the way to build this with cowplot is to follow the three steps I laid out here. I haven't worked with cowplot v1.0.0 yet, though, IIRC. Because of the MRAN snapshot constraint noted here in the project README, double check what version, if any, is installed on the image.

You might find the arranging plots in a grid and shared legends section of the cowplot references helpful.

This is not a must have, but it is an opportunity to play around with this kind of thing which will almost certainly pop up again in your work and maybe even in this project!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea!

@jaclyn-taroni
Copy link
Member

Oh, forgot to note, you may want to try filtering out low count genes or genes missing from a large proportion of samples before dimension reduction and plotting.

@jaclyn-taroni
Copy link
Member

I also to forgot to mention that you’ll need to set a seed to make this reproducible.

@cbethell
Copy link
Contributor Author

I created a V3 clinical file to include predicted sex, overall survival, and RNA library type, which you can access immediately here if it will help with this analysis

The link in the comment above requires that I request permission to access. @jharenza would you be able to change the permissions to this clinical file?

@jharenza
Copy link
Collaborator

I created a V3 clinical file to include predicted sex, overall survival, and RNA library type, which you can access immediately here if it will help with this analysis

The link in the comment above requires that I request permission to access. @jharenza would you be able to change the permissions to this clinical file?

Changed permissions, and we just updated that file in the V3 release: #88 !

@cbethell
Copy link
Contributor Author

Changed permissions, and we just updated that file in the V3 release: #88 !

Great, thank you!

cbethell and others added 2 commits August 29, 2019 12:22
- changed analysis directory name and name of `.circleci` run to “transcriptomic-dimension-reduction”
- changed the data_dir file path
- renamed `df2` `metadata_df`
- created a function to specify the scores data.frame and use the first 2 columns as x and y for a ggplot
- removed output files that are no longer produced from documentation 
- `cowplot` is used to arrange plots in a grid (I am still playing around with this to achieve the best possible layout for the plots and legend)
- filtered out the low count genes (I am also still looking into whether or not the method used is the best)
- set a seed
- replaced `reduction_fn` with `align_metadata` (this will also be changed once
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on 'functionalizing' the plotting! Having looked at the entire thing again with your changes, I have some thoughts on how I would organize this and what functions I would write. These may conflict slightly with the inline comments because I made those before I put it all together.

You essentially have 3 steps that you repeat for gene expression matrix (e.g., kallisto, RSEM) method (e.g., PCA, t-SNE) pairs: the dimension reduction itself that yields a data.frame that contains scores you will plot, joining the scores with the metadata, plotting the first two dimensions and the variable you use to color the points changes

So I propose that you have functions for each of those three steps that are then wrapped up in a function that puts it all together:

plot_dimension_reduction: make suggested tweaks to font size and add arguments for the x-axis and y-axis labels and title

perform_dimension_reduction

  • Arguments:
    • transposed expression matrix
    • method (e.g., PCA, tSNE, UMAP)
    • seed with some default value
  • Returns:
    • scores_df with a column that includes sample identifiers
  • Steps:
    • sets seed using seed argument
    • uses logic to get the correct scores_df based on the methods argument

align_metadata

  • Arguments:
    • metadata_df
    • scores_df
  • Returns:
    • joined data.frame with both
  • Steps:
    • inner_join

dimension_reduction_wrapper

Brings all of this together ☝️ , so we need to account for the individual arguments from these functions

  • Arguments:
    • transposed expression matrix
    • dimension reduction method
    • seed with some default value
    • metadata_df
  • Returns:
    • list that contains n number of plots (however many different ways you end up coloring the scatter plot points
  • Steps:
    • perform_dimension_reduction
    • align_metadata using the scores_df from perform_dimension_reduction
    • plot_dimension_reduction using align_metadata output n times, changing the point_color argument each time - you should be able to tell from the dimension reduction method argument how to label the x-axis, y-axis and title (for PCA I would expect those to be PC1, PC2, PCA, respectively)

What I'm proposing has a weakness in that it doesn't save any of the models (e.g., prcomp output) to file which may be useful. You could put that in perform_dimension_reduction and make that take a file name as an argument -> account for that in dimension_reduction_wrapper. There may also be a better way of breaking out the plotting from the wrapper function, but let's try starting with this.

# Align the metadata with the `Biospecimen_ID` variable in the data.frame
metadata <- metadata_df %>%
dplyr::filter(Kids_First_Biospecimen_ID %in% name$Biospecimen_ID) %>%
dplyr::filter(!duplicated(Kids_First_Biospecimen_ID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does this duplication happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just once instance of this duplication in the RSEM and Kallisto data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, believe that's tracked here #95

# metadata's `Kids_First_Biospecimen_ID`
#
# Args:
# name: Name of the data.frame containing dimension reduction scores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scores_df would be a much more descriptive variable name. I could not intuit what name was without reading your comments.

#
# Args:
# name: Name of the data.frame containing dimension reduction scores
# id: Vector containing Biospecimen IDs extracted from the transposed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is also difficult to intuit.

dplyr::filter(!duplicated(Kids_First_Biospecimen_ID))
# Filter the data.frame to contain only the `Biospecimen_ID` values common
# to the metadata
name <- name %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you should be able to do a dplyr::inner_join somewhere in this function instead. It will do the filtering for you.

name <- name %>%
dplyr::filter(Biospecimen_ID %in% metadata$Kids_First_Biospecimen_ID)
# Assign Kids_First_Participant_ID to column `type` of the data.frame
name$Kids_First_Participant_ID <-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a factor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to be, good point

exp_kallisto <- exp_kallisto[, -1] %>%
dplyr::filter(!duplicated(gene_id)) %>%
tibble::column_to_rownames("gene_id") %>%
na.omit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes it is, I'll use that function in both cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to, I just was making sure that I understood the use of na.omit() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. However, I do prefer the use of complete.cases(.) here.

tibble::column_to_rownames("gene_id") %>%
na.omit()

# Filter out low count genes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a constant that would need to be changed in multiple places, so you should assign it to something like

gene_count_threshold <- 100

and then

kallisto_genes_to_keep <- rowSums(exp_kallisto) >= gene_count_threshold

color = !!color_sym
)) +
ggplot2::geom_point() +
ggplot2::theme(legend.position = "none")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the font size bigger please? #83 (comment)


```{r}
# Run PCA on kallisto
kallisto_pca <- prcomp(transposed_kallisto_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're repeating a lot of the same steps. Consider a function where one of the arguments is which dimension reduction method you want used and then you use logic to do the first two steps which are run the dimension reduction and set up the scores data.frame (also takes a transposed expression matrix as an argument)

if (method == "PCA" ){
  prcomp_results <- prcomp(transposed_expression)
  scores_df <- data.frame(prcomp_results$x[, 1:2])
} else if {
...

And then the other steps that you repeat after setting up a score data.frame, which are the same, are 1) joining with metadata 2) plot with disease_type_new 3) plot with experimental_strategy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also have this function take seed as an argument with some default (could be 2019) and use set.seed(seed) within the function. You could then potentially use the function to do something like run t-SNE multiple times with the same seed to see how the plot changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, made this change.

# magrittr pipe
`%>%` <- dplyr::`%>%`

align_metadata <- function(name, id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I think this function is trying to accomplish:

  1. Joining the relevant columns from the metadata to a data.frame that contains the scores
  2. Changing certain columns to be factors

I don't think you need to only select the relevant columns -- you might decide that you want to color points using other columns in the future.

I think this should take the scores data.frame and the metadata data.frame as arguments, possibly with an additional argument that tells you what column the relevant sample identifiers are in. (Don't worry about that last part too much, let's try changing this incrementally.)

So I think you essentially want to do:

scores_meta_df <- scores_df %>%
  # using the by argument this way allows us to join two variables with different names together
  dplyr::inner_join(metadata_df, by = c("<sample identifier column name in scores_df>" = "<sample identifier column name in metadata_df>")) 

Where you make sure that scores_df is set up to have sample identifiers in it as a column before you pass it to this function.

The sample identifiers I'm assuming would be biospecimen ids. The use of inner_join means it would drop out anything that didn't have a sample identifier that matched in both data.frames you are joining, i.e., all of the WXS, WGS biospecimen ids + their metadata should get dropped. (You won't need this information anymore because the selection strategy should be in the updated v3 files.)

This joined metadata, scores_meta_df, is what you want for plotting. I believe that passing a column that is a character and not a factor (provided you don't need to do any reordering of levels) shouldn't give you any trouble with your scatter plots.

- split the notebook into 2 separate notebooks (one for data prep and the other for plotting) 
- made the parallel changes to `.circleci/config.yml`
- used `dply::inner_join` to merge the dimension reduction scores with the metadata 
- added tsv files of the dimension reduction score data.frames 
- Created the suggested functions, including a wrapper function to execute all of the functions 
- saved aligned dimension reduction score data.frames as `RDS` files to be read into the second notebook
@cbethell cbethell changed the title In Progress: Addition of unsupervised-transcriptomic-analysis In Progress: Addition of 01-transcriptomic-analysis-prep.Rmd Sep 9, 2019
- `analysis/transcriptomic-dimension-reduction/results/kallisto_tsne_model.RDS`
- `analysis/transcriptomic-dimension-reduction/results/kallisto_umap_model.RDS`

_Note that the tsv files holding the dimension reduction scores represent the scores for the RSEM (denoted by the output tsv files beginning with 15.17) and Kallisto expression data (denoted by the output tsv files beginning with 109.299)._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is naming the files this way rather than using rsem or kallisto in the filenames preferable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is a result of the as.name function. This was meant to be a temporary fix, until I find a better method of using the transposed_expression_matrix argument for naming.

umap_results <- umap::umap(transposed_expression_matrix)
dimension_reduction_df <- data.frame(umap_results$layout)
} else {
print ("Error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use stop() here.

ID <- rownames(transposed_expression_matrix)
# Perform dimension reduction
if (method == "PCA" ){
prcomp_results <- prcomp(transposed_expression_matrix)
Copy link
Member

@jaclyn-taroni jaclyn-taroni Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here and throughout this function.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording the outcomes/discussion from pair programming

#

# Run `perform_dimension_reduction` function
dimension_reduction_df <- perform_dimension_reduction(transposed_expression_matrix, method, seed = 2019)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dimension_reduction_df <- perform_dimension_reduction(transposed_expression_matrix, method, seed = 2019)
dimension_reduction_df <- perform_dimension_reduction(transposed_expression_matrix, method, seed = seed)

# magrittr pipe
`%>%` <- dplyr::`%>%`

perform_dimension_reduction <- function(transposed_expression_matrix, method, seed = 2019) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept an argument that is the file name for the model RDS

# Assign the relevant rownames(Kids_First_Biospecimen_ID) to a column
dimension_reduction_df$Kids_First_Biospecimen_ID <- ID
# Write the resulting data.frame to a tsv file
readr::write_tsv(dimension_reduction_df,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the file written by this function would be the entire model as an RDS object. I would also not use the results_dir from the global environment but instead pass the filename (which contains the full path information) to the function.

dplyr::inner_join(metadata_df, by = c("Kids_First_Biospecimen_ID"))

# Write the resulting metadata aligned data.frame to a RDS file
readr::write_rds(aligned_scores_df,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be a TSV file so that folks that are not R users could plot, say, PC1 vs. a variable of their interest.

file.path(results_dir, file = paste(matrix_name, method, "scores.tsv", sep = "_")))

return(dimension_reduction_df)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line here

-Renamed the output files to better represent their content 
- Replaced `print("error")` with `stop` function 
- Fixed indentation 
- Included `filename` as an argument in the functions executed in this nb
- edited the `.circleci/config.yml` to reflect only the file on this branch
cbethell and others added 10 commits September 11, 2019 16:09
- transformed the .Rmd file to R script in order to pass command line argument to `perform_dimension_reduction` function 
- made the appropriate change to `.circleci/config.yml`
- changed file paths using `rprojroot` function 
- added `optparse` to Dockerfile (also needed to pass command line argument to function) 
- renamed the output files to better represent their content
- also fixed a type and some spacing
@jaclyn-taroni jaclyn-taroni marked this pull request as ready for review September 12, 2019 14:17
@jaclyn-taroni jaclyn-taroni changed the title In Progress: Addition of 01-transcriptomic-analysis-prep.Rmd Addition of 01-transcriptomic-analysis-prep.Rmd Sep 12, 2019
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 excellent job on the progression of this PR @cbethell !

@jaclyn-taroni jaclyn-taroni merged commit b8bbb04 into AlexsLemonade:master Sep 12, 2019
@cbethell cbethell deleted the transcriptomic-analysis branch December 18, 2019 19:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants