Skip to content

Dewey-Data/deweydatapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Please read this carefully as the download approach has been changed slightly.

deweydataypy

Python library for Dewey Data Inc.

Find the release notes here: Release Notes.

Bug report: https://community.deweydata.io/c/help/python/43.

Explore data at https://www.deweydata.io/.

Underlying Amplify API tutorial: https://github.com/amplifydata/amplifydata-public/blob/main/README.md

Library tutorial

1. Create API Key

In the system, click Connections → Add Connection to create your API key.

As the message says, please make a copy of your API key and store it somewhere. Also, please hit the Save button before use.

2. Get a product path

Choose your product and Get / Subscribe → Connect to API then you can get API endpoint (product path). Make a copy of it.

3. Install deweydatapy library

You can install this library directly from the GitHub source as following.

pip install deweydatapy@git+https://github.com/Dewey-Data/deweydatapy

If you use PyCharm, [Python Packages] → [Add Package] → [From Version Control] → Select [Git] and input

https://github.com/Dewey-Data/deweydatapy
# Use deweydatapy library
import deweydatapy as ddp

deweydatapy package has the following functions:

  • get_meta: gets meta information of the datset, especially date range as returned in a dict
  • get_file_list: gets the list of files in a DataFrame
  • download_files: download files from the file list to a destination folder
  • download_files0: download files with apikey and product path to a destination folder
  • download_files1: download files with apikey and product path to a destination folder (see below Examples for the difference between download_files0 and download_files1)
  • read_sample: read a sample of data for a file download URL
  • read_sample0: read a sample of data for the first file with apikey and product path
  • read_local: read data from locally saved csv.gz file

4. Examples

I am going to use Advan Weekly Pattern as an example.

import deweydatapy as ddp

# API Key
apikey_ = "Paste your API key from step 1 here."

# Advan product path
pp_advan_wp = "Paste product path from step 2 here."

You will only have one API Key while having different product paths for each product.

As a first step, check out the meta information of the dataset by

meta = ddp.get_meta(apikey_, pp_advan_wp, print_meta = True);

This will return a DataFrame with meta information. print_meta = True will print the meta information.

You can see that the data has a partition column DATE_RANGE_START. Dewey data is usally huge and the data will be partitioned by this column into multiple files. We can also see that the minimum data available date is 2018-01-01 and maximum data available date is 2024-01-08. After checking this, I will download data between 2023-09-03 and 2023-12-31.

Next, collect the list of files to download by

files_df = ddp.get_file_list(apikey_, pp_advan_wp, 
                             start_date = '2023-09-03',
                             end_date = '2023-12-31',
                             print_info = True);



Be careful!! ------------------------------------------------

For a selected date range, the download sever assigns file numbering (0, 1, 2, ...) for each file. Thus, if you have different date ranges (different start_date and end_date), file names will change due to the file numbering. For example, the following files_df1 and files_df2 will have different file names due to different start_date.

files_df1 = ddp.get_file_list(apikey_, pp_advan_wp, 
                              start_date = '2023-09-03',
                              end_date = '2023-12-31',
                              print_info = True)

files_df2 = ddp.get_file_list(apikey_, pp_advan_wp, 
                              start_date = '2023-07-01',
                              end_date = '2023-12-31',
                              print_info = True)

This also applies to the funtion download_files0 and download_files1 (demonstrated below) in the same way.
------------------------------------------------------------

If you do not specifiy start_date, it will collect all the files from the minimum available date, and do not spesify end_date, all the files to the maximum available date.

Most Dewey datasets are very large. Please specify start_date and end_date.

print_info = True set to print another meta information of the files like below:

files_df has a file links (DataFrame) with the following information:

  • index: file index ranges from 1 to the number of files
  • page: page of the file
  • link: file download link
  • partition_key: to subselect files based on date
  • file_name
  • file_extension
  • file_size_bytes
  • modified_at

Finally, you can download the data to a local destination folder by

ddp.download_files(files_df, "C:/Temp", skip_exists = True)

This will download files to C:/Temp directory, with the following progress messages.

If you attempt to download all the files again and want to skip already existing downloaded files, set skip_exists = True. The default value is set to False (the default value was True in versions 0.1.x).

You can also use filename_prefix option to give file name prefix for all the files. For example, following will save all the files in the format of advan_wp_xxxxxxx.csv.gz.

ddp.download_files(files_df, "C:/Temp", filename_prefix = "advan_wp_", skip_exists = True)

Alternatively, you can download files skipping get_file_list by

ddp.download_files0(apikey_, pp_advan_wp, "C:/Temp",
                    start_date = '2023-09-03', end_date = '2023-12-31')

or

ddp.download_files1(apikey_, pp_advan_wp, "C:/Temp",
                    start_date = '2023-09-03', end_date = '2023-12-31')

The difference between download_files0 and download_files1 is that download_files0 collects all the file list (link) upfront and start downloading. As the links are valid for 24 hours, this may cause an interruption if the download takes over 24 hours. download_files1, on the other hand, collects a small page (group) of flie links and download them, and move on to the next page and download them, and so on. This helps the collected links to be valid while downloading. So, it is recommended to use download_files1 for a large number of files that may take over 24 hours to download.

Some datasets do not have partition column as they are time invariant (SafeGraph Global Places (POI) & Geometry, for example).

meta = ddp.get_meta(apikey_, pp_sg_poipoly, print_meta = True);

There is no partition column and minimum and maximum dates are not available. In that case, you can download the data without specifiying a date range.

files_df = ddp.get_file_list(apikey_, pp_sg_poipoly, print_info = True);



You can quickly load/see a sample data by

sample_df = ddp.read_sample(files_df['link'][0], nrows = 100)

This will load sample data for the first file in files_df (files_df['link'][0]) for the first 100 rows. You can see any files in the list.

You can also see the sample of the first file by

sample_data = ddp.read_sample0(apikey_, pp_advan_wp, nrows = 100);

This will load the first 100 rows for the first file of Advan data.

You can open a downloaded local file (a csv.gz or csv file) by

sample_local = ddp.read_local("C:/Temp/Weekly_Patterns_Foot_Traffic_Full_Historical_Data-0-DATE_RANGE_START-2023-09-04.csv.gz",
                              nrows = 100)

Thanks

About

Dewey Data Inc. Python API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages