# Notebook Summary 

This is a databricks sample notebook demonstrating how to use the ETIQ library to run data analyses on a spark dataset.

### Quickstart

  1. Install and import etiq library with the spark extension 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
### Data Issues


  4. Load the New York Yellow Taxi Trips data
  
  5. Scan for data isssues. In this case we limit our scan to ordering issues i.e. where the pickup time is recorded as occurring after the drop off time.

In [None]:
# Install the spark extension for etiq. This will install the etiq base package as a dependency
%pip install etiq-spark

In [None]:
# Import the spark extensions for etiq
import etiq.spark

In [None]:
# Login to the etiq dashboard
from etiq import login as etiq_login
etiq_login("https://dashboard.etiq.ai/", "<your-key>")

In [None]:
# Create an ETIQ project for our analysis
project = etiq.projects.open(name="NYC Yellow Taxi Trips")


## Load the NY Yellow Taxi Trips Data

In [None]:
# Load ny yellow taxi trips data into a spark dataframe
yellow_taxi_trips = spark.read.load("dbfs:/databricks-datasets/nyctaxi/tables/nyctaxi_yellow")
yellow_taxi_trips.show()

In [None]:
# Create etiq dataset from the dataframe
yellow_taxi_trips_dataset = etiq.spark.SimpleSparkDatasetBuilder.datasets(validation_features=yellow_taxi_trips,
                                                                          label='tip_amount',
                                                                          cat_col = ['payment_type', 'rate_code_id', 'store_and_fwd_flag', 'vendor_id'],
                                                                          date_col = ['dropoff_datetime', 'pickup_datetime'],
                                                                          name='NY Yellow Tax Trips')
# Create a snapshot (containing the dataset) under the previous created project
snapshot = project.snapshots.create(name="Data Issues",
                                    dataset=yellow_taxi_trips_dataset,
                                    model=None)

## Find data issues

In [None]:
# Scan the snapshot for data issues.
# We limit these to only issues where pickup_datetime is recorded as occuring after dropoff_datetme
(segments, issues, issue_summary) = snapshot.scan_data_issues(orderings=[('pickup_datetime', 'dropoff_datetime')], 
                                                              filter_ids=[], 
                                                              duplicate_features_subset=[])