Error when importing bqfetch #5

marinaperezs · 2023-05-08T14:25:23Z

Hi everyone !

I'm trying to read a big table from BigQuery using python in google colab and I found bqfetch, however, when I try to import BigQueryFetcher and BigQueryTable I get an error.
I installed it by doing:

!pip install bqfetch
!pip install -r requirements.txt

But when running the second command, I get this error:

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

Then, my code is this:

from bqfetch import BigQueryFetcher, BigQueryTable

table = BigQueryTable("PROJECT", "DATASET", "TABLE")
fetcher = BigQueryFetcher('/path/to/bq_service_account.json', table)
chunks = fetcher.chunks('id', by_chunk_size_in_GB=2)

for chunk in chunks:
    df = fetcher.fetch(chunk, nb_cores=1, verbose=True)

Am I doing something wrong ? Because this is what I get:

Some help would be appreciated! Because I cannot run anything so I can't get the table I need in order to have as a dataframe in python :(

Thank you in advance!
Marina

The text was updated successfully, but these errors were encountered:

TristanBilot · 2023-05-08T15:27:44Z

Hello there,
Thank you for using bqfetch!

to fix ERROR: Could not open requirements file:, you should create the requirements file locally to install the dependencies. Please create the requirements.txt file in your google colab and paste the content of the requirements file present in the repo.
the import error is due to the way you are importing the module. Until now, the only way to import a class from bqfetch was by doing from bqfetch.bqfetch import BigQueryFetcher. I created a new release where it is now possible to import more intuitively using from bqfetch import BigQueryFetcher.
To fix this error, you can thus use the legacy method with from bqfetch.bqfetch import .. or download and install the 1.1.0 release locally to use the new import system.

marinaperezs · 2023-05-08T16:02:13Z

Thank you so much for your quick answer!!

I need to read a table that has 122602779 rows as a data frame in google colab to be able to use it with machine learning algorithms. Which parallel_backend do you recommend? billiard, joblib or multiprocessing ? And do you recommend
to fetch by number of chunks instead of chunk size ? How much size/number of chunks?

Thank you!!

TristanBilot · 2023-05-08T16:46:10Z

Before dividing your big dataset into chunks of relatively small size using one of the functions available in bqfetch, did you verify that your data are independent? Namely that it is possible for you to divide the whole dataset into multiple batches that you can train independently.
If it is independent, then you can leverage this module.

Next, for the backend, I recommend you using the default backend, so no need to set the parallel_backend argument for the moment. If this backend leads to issues, then maybe try one of the other available backends.

Then for the fetching, I recommend to fetch by chunk size as you can easily manage your memory and avoid memory overflows in your colab environment. However, you need to specify an index column in your dataset on which the dataset can be partitioned. I give some examples of index columns in the README.
If it is hard for you to find a nice index column, then try to fetch by number of chunks by estimating the chunk size with respect to the size of your dataset and your available memory. You can also start with a small chunk size, and increase more and more until you raise an overflow error.

Do not hesitate if any other questions!

marinaperezs · 2023-05-08T17:12:27Z

What do you mean my data are independent? What I wanted to do is read a BigQuery table as dataframe in python, but it seems to be too big and if I do it by connecting to BQ and querying the table it doesn't work. So my idea was to read it "by parts" so that I can then maybe download them in csv and put them together so that I construct my table again and I can load it to python.

I don't know if this is possible with bqfetch, but if not, do you know any other way to load a big BQ table to python?

Also, I just got this error, I don't know if it has something to do with BQ or Colab:

TristanBilot · 2023-05-08T17:33:18Z

I understand that you want to split your dataset into multiple parts, however it's no use to reconstruct the whole dataframe as you will still end up with a memory overflow because your dataframe is too huge to fit on your machine. What you can do to deal with this large table is to run the training loop on the parts of data instead of the whole dataframe, like in mini-batch training.

To summary:

❌ Fetch all the data => train: this leads to memory overflow as you have too much data to process.
✅ Fetch 1/n chunk of the data => train, ..., Fetch n/n chunk of the data => train: this is is feasible, but each row in your dataset should be independent, meaning that running the training loop on all the data at once or running the training loop on many small batches should approximately produce the same results. If your rows are not independent, then you need to provide a proper index column.

TristanBilot · 2023-05-08T17:37:51Z

Your error is about the BigQuery API, can you provide the full code you used to fetch? I think your chunk size is too big and the fetching took more than 10min and raised a timeout error.

marinaperezs · 2023-05-08T17:49:20Z

I'll think about what you said and probably come back with questions hahah thank you !!!

This is my code:

from bqfetch.bqfetch import BigQueryFetcher
from bqfetch.bqfetch import BigQueryTable

table = BigQueryTable("gcpinv-230419-ifym1p4zudsrz4zu", "prueba_lstm", "results")
fetcher = BigQueryFetcher('/content/drive/MyDrive/TFG mio/gcpinv-230419-ifym1p4zudsrz4zu-699b17de7880.json', table)
chunks = fetcher.chunks('row_num', by_chunk_size_in_GB=2, verbose=True)

for chunk in chunks:
  df = fetcher.fetch(chunk, nb_cores=-1, verbose=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when importing bqfetch #5

Error when importing bqfetch #5

marinaperezs commented May 8, 2023

TristanBilot commented May 8, 2023

marinaperezs commented May 8, 2023

TristanBilot commented May 8, 2023 •

edited

Loading

marinaperezs commented May 8, 2023 •

edited

Loading

TristanBilot commented May 8, 2023 •

edited

Loading

TristanBilot commented May 8, 2023

marinaperezs commented May 8, 2023

Error when importing bqfetch #5

Error when importing bqfetch #5

Comments

marinaperezs commented May 8, 2023

TristanBilot commented May 8, 2023

marinaperezs commented May 8, 2023

TristanBilot commented May 8, 2023 • edited Loading

marinaperezs commented May 8, 2023 • edited Loading

TristanBilot commented May 8, 2023 • edited Loading

TristanBilot commented May 8, 2023

marinaperezs commented May 8, 2023

TristanBilot commented May 8, 2023 •

edited

Loading

marinaperezs commented May 8, 2023 •

edited

Loading

TristanBilot commented May 8, 2023 •

edited

Loading