## What Is a Dataset?

A dataset is a collection of data points with a common schema. The Cortex Python SDK's provides transformations and visualizations to facilitate data cleaning, feature identification and feature construction. In this notebook we demonstrate how to build a dataset and how to view the contents of datasets. Please install `cortex-python`,`cortex-python[builders]` for builder functionality, `cortex-python[viz]` for vizualizations. This example will additionally require `numpy`, which is not installed via `cortex-python`.

In [None]:
!pip install cortex-python[builders,viz]
!pip install numpy

## How is a Dataset Built? 
First, import the Cortex library and instantiate a builder.

In [None]:
from cortex import Cortex
import numpy as np
import pandas as pd

builder = Cortex.local().builder()


Builder is the top level factory object in the Cortext Python SDK. The builder returns a factory object that is customized to handle the context for the particular class it builds. A dataset requires a collection of data to be useful, so the factory object returns a dataset builder that can take data in a number of different forms.

For example, you can associate a CSV file with a dataset:

In [None]:
csv_data_set_builder = builder.dataset('ds01')

csv_example_data_set = csv_data_set_builder.from_csv('./data/sample.csv').build()

Or a dataset with JSON:

In [None]:
json_data_set_builder = builder.dataset('ds02')

json_example_data_set = json_data_set_builder.from_json('./data/sample.json').build()

Or from a pandas DataFrame:

In [None]:
# two columns of random numbers, indexed a through e
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
q = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

# make a data frame by composing the columns together and labeling them
pdf = pd.DataFrame({'c1':s,'c2':q})

pd_data_set_builder = builder.dataset('ds03')

data_frame_data_set = pd_data_set_builder.from_df(pdf).build()

You can also set the title and description of your dataset:

In [None]:
csv_example_data_set.title = 'A Title for the example'
csv_example_data_set.description = 'A somewhat longer piece of text that describes the purpose of the dataset.'

Once constructed, the dataset can be explicitly persisted.

In [None]:
csv_example_data_set.save()

Note that with the `Cortex.local()` client, the dataset is persisted to the local disk. When using the Cortex client `Cortex.client()`, the dataset is persisted in Cortex.

## Dataset Feature Construction

Datasets help in feature construction through the use of pipelines. Pipelines allow functions to be chained together to modify and combine columns to create and clarify new features in the dataset. To find out how to create and persist pipelines, see [Pipeline](https://docs.cortex.insights.ai/docs/cortex-python-sdk-guide/pipeline/).

## View Datasets

Datasets can be viewed in tables or through visualizations. 

### Data Dictionary
A Dataset can generate a data dictionary:

### pandas DataFrame

Datasets can also generate pandas DataFrames. 

In [None]:
jdf = json_example_data_set.as_pandas()

pandas' DataFrames include several different methods for [viewing data](https://pandas.pydata.org/pandas-docs/stable/10min.html#viewing-data) .

In [None]:
jdf.tail()

### With Visualizations 

Here are the built-in visualizations that you get with datasets. Visualizations require a dataframe. Most commonly the dataframe is constructed by running a pipeline on the data set: 

In [None]:
clean_csv_pl = csv_example_data_set.pipeline('clean_csv_pl')

def drop_unused(pipeline, df):
    df.drop(columns=['b','c'],inplace=True)

clean_csv_pl.add_step(drop_unused)
cleaned_csv_df = clean_csv_pl.run(csv_example_data_set.as_pandas())
cleaned_csv_df

In [None]:
v = csv_example_data_set.visuals(cleaned_csv_df)

In [None]:
v.show_corr_heatmap()

In [None]:
v.show_corr('a')

In [None]:
v.show_corr_pairs('e')

In [None]:
v.show_dist('e')

In [None]:
v.show_probplot('e')