Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCGA biolinks recognize already downloaded data? #153

Closed
NatashaJorge opened this issue Oct 3, 2017 · 12 comments
Closed

TCGA biolinks recognize already downloaded data? #153

NatashaJorge opened this issue Oct 3, 2017 · 12 comments

Comments

@NatashaJorge
Copy link

Hello,

I'm a post doc and just started using TCGAbiolinks, it's been really helpful! Many thanks!! :)

I'm currently working with some data already downloaded by a colleague and saved in the same file structure as GDCDownload and GDCprepare. I was wondering, is there is a way of TCGAbiolinks recognize this already downloaded data?

Thanks for your time and attention,
Best wishes!

@tiagochst
Copy link
Contributor

Hello,

It would be possible if you have the query object.

GDCquery returns a table with the query results (data category, data type, file id, file, etc, platform, workflow etc...) that is used to prepare the data correctly. Several fields are used; if it is from the legacy archive, data category, data type, platforms, among others. But to find the data it will used the following pattern
Root directory/project/source/data_category/data_type/file_id/file_name

Example: GDCdata/TCGA-GBM/harmonized/DNA_Methylation/Methylation_Beta_Value/079fcaff-3ae6-4150-b2e6-2b7330ffbcd9/jhu-usc.edu_GBM.HumanMethylation450.10.lvl-3.TCGA-19-A6J5-01A-21D-A33U-05.gdc_hg38.txt

Code: https://github.com/BioinformaticsFMRP/TCGAbiolinks/blob/master/R/prepare.R#L90-L94

If you have the query object it should be easy to read the data already downloaded. You would just need to be in the same directory as the root download folder (default is 'GDCdata').

Recreating the query object by hand would need you to know the data and the some of the fields values. I don't believe it is worthy to create it by hand. I would ask the query object he used.

Best regards,
Tiago

@NatashaJorge
Copy link
Author

Dear Tiago,

Thank you for your reply.

My file structure is exactly as your example, and I have the command line she used to get the query and download the data.

So, what command lines should I use? Just the GDCquery and GDCprepare?

Thanks again,
Best wishes,
Natasha

@tiagochst
Copy link
Contributor

You only need to use GDCquery and GDCprepare. But if you use GDCdownload it should say that all samples were already downloaded.

@NatashaJorge
Copy link
Author

Dear Tiago,

I'm still having problems with TCGAbiolinks recognizing the already downloaded data. I've checked the directory structure and it is the same as the one requested by TCGAbiolinks.

If I use GDCquery and GDCprepare together, it gives me the following error:
"Error in GDCprepare(cnv) :
I couldn't find all the files from the query. Please check if the directory parameter right or GDCdownload downloaded the samples.",

If I try the GDCdownloads, it starts downloading, but it gives me an error saying the file or directory does not exist:
"<simpleWarning in file.create(to[okay]): cannot create file 'GDCdata/TCGA-SKCM/harmonized/Copy_Number_Variation/Copy_Number_Segment/c70811e8-deb7-4a96-9d70-b322c15fe1a4/SPICS_p_TCGA_B_314_315_316_NSP_GenomeWideSNP_6_G06_1361494.grch38.seg.txt', reason 'No such file or directory'>
"

However, when I check the file, it is there and in the same directory requested:
[natasha@lobster MetilTCGA]$ ls GDCdata/TCGA-SKCM/harmonized/Copy_number_variation/Copy_Number_Segment/c70811e8-deb7-4a96-9d70-b322c15fe1a4/
SPICS_p_TCGA_B_314_315_316_NSP_GenomeWideSNP_6_G06_1361494.grch38.seg.txt

Please, do you know where am I getting it wrong?

Best wishes,
Natasha

@huwenhuo
Copy link

Hi Natasha, I just tried this. It seems fine with me. One thing to point out, you need your current working directory in the folder above GDCdata, not inside GDCdata. If you want to know the details of GDCprepare, just type GDCprepare and see the source code.

@modarzi
Copy link

modarzi commented Mar 6, 2019

@NatashaJorge
Hi Natasha

I have this problem too. could you solve your problem?
Thanks
Mohammad

@latifizadehhabib
Copy link

HI, I have the same issue. Would you please help me out if you could get any solution for that? Thank you.

@HuaZou
Copy link

HuaZou commented Jun 8, 2021

HI, I have the same issue. Would you please help me out if you could get any solution for that? Thank you.

After adding the directory = Outdir into GDCprepare function, my problem has been solved. the codes as follows:

get_OmicsData <- function(project  = cancer_type,
                          Outdir   = "mRNA"){
  if(Outdir == "mRNA"){
    query_Data <- GDCquery(project = project,
                           data.category = "Transcriptome Profiling",
                           data.type = "Gene Expression Quantification",
                           workflow.type = "HTSeq - Counts")    
  }else if(Outdir == "miRNA"){
    query_Data <- GDCquery(project = project,
                           data.category = "Transcriptome Profiling",
                           data.type = "miRNA Expression Quantification",
                           workflow.type = "BCGSC miRNA Profiling")     
  }else if(Outdir == "CNV"){
    query_Data <- GDCquery(project = project,
                           data.category = "Copy Number Variation",
                           data.type = "Copy Number Segment")     
  }else if(Outdir == "DNA_Methylation"){
    query_Data <- GDCquery(project = project,
                           data.category = "DNA methylation",
                           legacy = TRUE)     
  }
  
  GDCdownload(query = query_Data,
              method = "api",
              files.per.chunk = 60,
              directory = Outdir)
  
  expdat <- GDCprepare(query = query_Data,
                       directory = Outdir)
  return(expdat)
}

dat_mRNA <- get_OmicsData(project = cancer_type,
                          Outdir = "mRNA")
saveRDS(dat_mRNA, file = "TCGA-KIRC_mRNA.RDS")

Hope it would be helpful to you.

@tiagochst
Copy link
Contributor

@HuaZou Thank you

@DzenisKoca
Copy link

I am probably late with answer. Windows has limit to the length of path/filename, and limit is 260 characters. For example, the maximum path on drive D is "D:\some 256-character path string". To solve this issue, try creating or moving project folder directly into the drive "D:\path_to_project_that_is_less_then_60_character_long", or on some shorter path (less than 60 characters, since length from project folder to file that contains data is around 199 characters by default).

@alopehba
Copy link

alopehba commented Feb 1, 2023

I am probably late with answer. Windows has limit to the length of path/filename, and limit is 260 characters. For example, the maximum path on drive D is "D:\some 256-character path string". To solve this issue, try creating or moving project folder directly into the drive "D:\path_to_project_that_is_less_then_60_character_long", or on some shorter path (less than 60 characters, since length from project folder to file that contains data is around 199 characters by default).

I have the same problem; I try but still can't solve. 555

@alopehba
Copy link

alopehba commented Feb 2, 2023

I am probably late with answer. Windows has limit to the length of path/filename, and limit is 260 characters. For example, the maximum path on drive D is "D:\some 256-character path string". To solve this issue, try creating or moving project folder directly into the drive "D:\path_to_project_that_is_less_then_60_character_long", or on some shorter path (less than 60 characters, since length from project folder to file that contains data is around 199 characters by default).

I have the same problem; I try but still can't solve. 555

The funny thing is, I went through it again today, 'GDCquery-GDCDownload-GDCprepare', and I found that there was no difference between where I had stored the GDCdata and where I had stored it before, it was in the same directory as the script location, and all the files in the GDCdata were exactly the same. The Windows path is also adjusted to within 60 characters.
The funny thing is that it didn't work last time, but it did work this time?
The only difference is that the last time I did GDCquery, GDCprepare did not operate continuously, and I manually created an empty folder of GDCdata in the middle, and when I ran GDCDownload, the subfiles in it were created one after another. but the contents of the files are exactly the same. But it just doesn't recognize it. Download final it turned' reason 'No such file or directory'>'
This time, I deleted the empty folder 'GDCdata', deleted everything in it, and then strict followed the 'GDCquery-GDCDownload-GDCprepare' process. Then it recognized it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants