Access raw data via Amazon S3 and Kinesis
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

README.rst Raw Data

This repository contains Python example code for working with raw data delivered by's fully-managed Data Pipeline product, at

This Python repository is a suite of tools, mostly usable from the command-line, which make it easy to evaluate and integrate the raw data.

Customers can use this repository to:

  • gain batch and streaming access to the raw data that collects from their sites; streaming access is provided via Amazon Kinesis Streams, and batch access via Amazon S3
  • generate schemas and DDL for common data warehousing tools, such as Redshift, BigQuery, and Apache Spark
  • create data samples that can be evaluated using in-memory analyst tools such as Excel or R Studio (xlsx/csv samples)

To make use of raw data, you must be a customer of's Data Pipeline product. To gain access for your account, please contact directly at

You can download this repository by cloning it from Github, e.g.

$ git clone

Or, you can install it into an environment with pip, e.g.

$ pip install parsely_raw_data

The files in this module are named for the services they interface with. You can simply run modules to use command-line tools provided, or import the modules to script them yourselves using your own Python scripts.

Module and CLI Guide

If you have the project installed with pip, you can use the following console scripts anywhere:

  • parsely_redshift = parsely_raw_data.redshift
  • parsely_bigquery = parsely_raw_data.bigquery
  • parsely_s3 = parsely_raw_data.s3
  • parsely_stream =
  • parsely_schema = parsely_raw_data.docgen

Alternately, you can clone the repo, and run each module from within the repo directory, like this:

cd <path_to_parsely_raw_data_repo_directory>

  • python -m parsely_raw_data.samples: Generate data samples in CSV and XLSX format
  • python -m parsely_raw_data.s3: Fetch archived event data from S3 Bucket
  • python -m Consume a Kinesis Stream of real-time event data
  • python -m parsely_raw_data.schema: Inspect schemas for Redshift, BigQuery, and Spark
  • python -m parsely_raw_data.redshift: Create an Amazon Redshift table for events and load data
  • python -m parsely_raw_data.bigquery: Create a Google BigQuery table for events and load data

Creating a New Version

These are the steps that should be followed when releasing a new version of this library

  • Increment the version number in according to semantic versioning rules
  • git commit -m 'increment version'
  • git tag x.x.x where x.x.x is the new version number
  • git push origin master --tags
  • Create a new release for the new tag in github, noting any relevant changes
  • Push to PyPI with python sdist upload