Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when importing bqfetch #5

Open
marinaperezs opened this issue May 8, 2023 · 7 comments
Open

Error when importing bqfetch #5

marinaperezs opened this issue May 8, 2023 · 7 comments

Comments

@marinaperezs
Copy link

Hi everyone !

I'm trying to read a big table from BigQuery using python in google colab and I found bqfetch, however, when I try to import BigQueryFetcher and BigQueryTable I get an error.
I installed it by doing:

!pip install bqfetch
!pip install -r requirements.txt

But when running the second command, I get this error:

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

Then, my code is this:

from bqfetch import BigQueryFetcher, BigQueryTable

table = BigQueryTable("PROJECT", "DATASET", "TABLE")
fetcher = BigQueryFetcher('/path/to/bq_service_account.json', table)
chunks = fetcher.chunks('id', by_chunk_size_in_GB=2)

for chunk in chunks:
    df = fetcher.fetch(chunk, nb_cores=1, verbose=True)

Am I doing something wrong ? Because this is what I get:

Captura de pantalla 2023-05-08 a las 16 23 49

Some help would be appreciated! Because I cannot run anything so I can't get the table I need in order to have as a dataframe in python :(

Thank you in advance!
Marina

@TristanBilot
Copy link
Owner

Hello there,
Thank you for using bqfetch!

  1. to fix ERROR: Could not open requirements file:, you should create the requirements file locally to install the dependencies. Please create the requirements.txt file in your google colab and paste the content of the requirements file present in the repo.
  2. the import error is due to the way you are importing the module. Until now, the only way to import a class from bqfetch was by doing from bqfetch.bqfetch import BigQueryFetcher. I created a new release where it is now possible to import more intuitively using from bqfetch import BigQueryFetcher.
    To fix this error, you can thus use the legacy method with from bqfetch.bqfetch import .. or download and install the 1.1.0 release locally to use the new import system.

@marinaperezs
Copy link
Author

Thank you so much for your quick answer!!

I need to read a table that has 122602779 rows as a data frame in google colab to be able to use it with machine learning algorithms. Which parallel_backend do you recommend? billiard, joblib or multiprocessing ? And do you recommend
to fetch by number of chunks instead of chunk size ? How much size/number of chunks?

Thank you!!

@TristanBilot
Copy link
Owner

TristanBilot commented May 8, 2023

Before dividing your big dataset into chunks of relatively small size using one of the functions available in bqfetch, did you verify that your data are independent? Namely that it is possible for you to divide the whole dataset into multiple batches that you can train independently.
If it is independent, then you can leverage this module.

Next, for the backend, I recommend you using the default backend, so no need to set the parallel_backend argument for the moment. If this backend leads to issues, then maybe try one of the other available backends.

Then for the fetching, I recommend to fetch by chunk size as you can easily manage your memory and avoid memory overflows in your colab environment. However, you need to specify an index column in your dataset on which the dataset can be partitioned. I give some examples of index columns in the README.
If it is hard for you to find a nice index column, then try to fetch by number of chunks by estimating the chunk size with respect to the size of your dataset and your available memory. You can also start with a small chunk size, and increase more and more until you raise an overflow error.

Do not hesitate if any other questions!

@marinaperezs
Copy link
Author

marinaperezs commented May 8, 2023

What do you mean my data are independent? What I wanted to do is read a BigQuery table as dataframe in python, but it seems to be too big and if I do it by connecting to BQ and querying the table it doesn't work. So my idea was to read it "by parts" so that I can then maybe download them in csv and put them together so that I construct my table again and I can load it to python.

I don't know if this is possible with bqfetch, but if not, do you know any other way to load a big BQ table to python?

Also, I just got this error, I don't know if it has something to do with BQ or Colab:
Captura de pantalla 2023-05-08 a las 19 25 23

@TristanBilot
Copy link
Owner

TristanBilot commented May 8, 2023

I understand that you want to split your dataset into multiple parts, however it's no use to reconstruct the whole dataframe as you will still end up with a memory overflow because your dataframe is too huge to fit on your machine. What you can do to deal with this large table is to run the training loop on the parts of data instead of the whole dataframe, like in mini-batch training.

To summary:

  • ❌ Fetch all the data => train: this leads to memory overflow as you have too much data to process.
  • ✅ Fetch 1/n chunk of the data => train, ..., Fetch n/n chunk of the data => train: this is is feasible, but each row in your dataset should be independent, meaning that running the training loop on all the data at once or running the training loop on many small batches should approximately produce the same results. If your rows are not independent, then you need to provide a proper index column.

@TristanBilot
Copy link
Owner

Your error is about the BigQuery API, can you provide the full code you used to fetch? I think your chunk size is too big and the fetching took more than 10min and raised a timeout error.

@marinaperezs
Copy link
Author

I'll think about what you said and probably come back with questions hahah thank you !!!

This is my code:

from bqfetch.bqfetch import BigQueryFetcher
from bqfetch.bqfetch import BigQueryTable

table = BigQueryTable("gcpinv-230419-ifym1p4zudsrz4zu", "prueba_lstm", "results")
fetcher = BigQueryFetcher('/content/drive/MyDrive/TFG mio/gcpinv-230419-ifym1p4zudsrz4zu-699b17de7880.json', table)
chunks = fetcher.chunks('row_num', by_chunk_size_in_GB=2, verbose=True)

for chunk in chunks:
  df = fetcher.fetch(chunk, nb_cores=-1, verbose=True)

Captura de pantalla 2023-05-08 a las 19 43 14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants