Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract brain tissue from gtex gene count #5

Open
ghost opened this issue Feb 15, 2022 · 5 comments
Open

extract brain tissue from gtex gene count #5

ghost opened this issue Feb 15, 2022 · 5 comments

Comments

@ghost
Copy link

ghost commented Feb 15, 2022

Hello,
I have downloaded GTEx rna-seq data from this site:
wget https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz
I want to extract only brain tissue gene expression data from this file.
I checked yarn package and does it has any function that can perform this function?
checkTissuesToMerge(obj, majorGroups, minorGroups, filterFun = NULL,plotFlag = TRUE, ...)
But this function will merge tissue based on gene expression file.
I just specifically want to extract brain tissue gene expression data only.
Can you please provide me some guidance on how to proceed with this?
Thank you

@Xenophong
Copy link

hi, even if I am not the author of this package, I think this is a more general question. You have to download the sample attribute manifest file, find which sample ids belong to the brain tissues, and then extract those samples from your gene expression file.

@marouenbg
Copy link
Contributor

Hi,
we moved the software to netZooR, we will be happy to help you here: https://github.com/netZoo/netZooR

@gmillerscripps
Copy link

I don't know if you've looked into the netZooR package that @marouenbg mentioned (I didn't know that support had transitioned to netZooR until I read this comment today), but yes @Xenophong is correct that you need the sample attribute data. For version 8 it would be:
https://storage.googleapis.com/adult-gtex/annotations/v8/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt

You can then look at column SMTSD and grepl for "Brain" (at least that is what I did), and then match the SAMPIDs in the phenodata to the column names of the gene reads data.

I just downloaded the files from GTEx to local and read them in that way. Most of this is directly from yarn::downloadGTEx but I had to modify some things.


  # Reading in Sample Phenotype Data
  # Data should contain:
  #' SAMPID
  #' SMTSD
  pd <- read.delim(file.path(data.dir, samp_pheno))
  
  message("Sample Phenotype Found")
  
  # Set as matrix and get row names
  pd <- as.matrix(pd)
  rownames(pd) <- pd[, "SAMPID"]
  
  # Subset if tissue subset given
tissue_subset <- "Brain"

    # Look for any matches on the tissue subset
    subset_rows <- which(grepl(paste0(tolower(tissue_subset), collapse  = "|"), tolower(pd[, 'SMTSD'])))
    pd <- pd[subset_rows, ]
  
  options <- unique(pd[, 'SMTSD']) 
  message("Unique Tissues:")
  message(paste0(options, collapse = '\n'))

  # Get IDs
  ids <- sapply(strsplit(pd[, "SAMPID"], "-"), function(i) paste(i[1:2], 
                                                                 collapse = "-"))

  # Read in the counts data
  #' Assuming data is in .gct format
  message("Reading in count data") 
  
  cnts_raw <- data.table::fread(file = file.path(data.dir, gene_file))
  print(dim(cnts_raw))
  # Get unique samples
  cnt_names <- unique(names(cnts_raw))
  
  # Join samples with sample phenotype data
  all <- data.frame(SAMPID = cnt_names) %>%
    left_join(
      as.data.frame(pd[, c("SAMPID","SMTSD")]), by = 'SAMPID'
    )
  
  # Subset to tissue subset if provided
  subset_samples <- all %>%
      filter(.,grepl(paste0(tolower(tissue_subset), collapse = '|'), tolower(SMTSD)))   %>%
    .$SAMPID
  
  # Need to change dashes to "." for comparing against pd data
  subset_samples <- gsub("\\-", ".", subset_samples)
  subset_samples <- subset_samples[grepl("GTEX", subset_samples)]

  # Subset to given subset    
  names(cnts_raw) <- gsub("\\-", "\\.", names(cnts_raw))
  cnts <- cnts_raw[,c("Name", "Description", subset_samples), with = FALSE]

  names(cnts) <- gsub("\\.", "\\-", names(cnts))
  
  # Get unique genes
  genes <- unlist(cnts[, 1])
  geneNames <- unlist(cnts[, 2])
  
  # Get the full names for those genes
  counts <- cnts[, -c(1:3)]
  
  counts <- as.matrix(counts)
  mode(counts) <- "integer"

  # Set the names of the counts to those genes
  rownames(counts) <- genes
  


@marouenbg
Copy link
Contributor

Hi @gmillerscripps ,
Apologies for the confusion, yarn is not maintained in netZooR but rather just called as a dependency. I will have to check which repo is the source code for bioconductor ad will let you know asap.
Best,
Marouen

@marouenbg
Copy link
Contributor

Hi @gmillerscripps and al,
This took too long but we now moved YARN in netZooR to maintain it long term netZoo/netZooR#314
Please let me know if you want anything changed/updated.

Marouen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants