## Data Understanding
The ability to efficiently handle and explore data is paramount for machine learning projects. In this notebook, we'll delve into techniques such as reading from tables and computing statistics to explore and understand your data set.

### 1. Prerequisites
The table we would like to explore is the delta shared data product table **`cashflow`**. For that please check the `Unity Catalog` and navigate to `Delta Shares Received` and find **`<BDC_SHARE_NAME>`**.

Run the following code to start a SparkSession and read from the desired table. Please replace the value `<TABLE_PATH>` with the full path to the data product table. The path structure in databricks follows the logic `catalog.schema.table`. 

In [0]:
# start spark session
from pyspark.sql import SparkSession

In [0]:
# load data 
df = spark.read.table("<TABLE_PATH>")

Let's have a look at the table itself.

In [0]:
# get preview of data
display(df)

### 2. Explore Data with Summary Stats
While using notebooks, you have various options to view summary statistics for your dataset. Some of the options are:
* using spark DataFrame's built-in method (e.g. `summary()`)
* using databricks' utility methods (e.g. `dbutils.data.summarize()`)
* using databricks' built-in data profiler/visualizations
* using external libraries such as `matplotlib`

The first an simplest way is using Spark's `summary` function.

In [0]:
# display summary statistics with spark
display(df.summary())

Another way of displaying summary statistics is to use Databricks' `summarize` function. The **`summarize`** function automatically generates a comprehensive report for the dataframe. This report encompasses crucial statistics, data types, and the presence of missing values, providing a holistic view of the dataset

Within this generated summary, the interactive features allow you to sort the information by various criteria:

* Feature Order
* Non-Uniformity
* Alphabetical
* Amount Missing/Zero

Furthermore, leveraging the datatype information, you can selectively filter out specific datatypes for in-depth analysis. This functionality enables us to create charts tailored to our analytical preferences, facilitating a deeper understanding of the dataframe and extracting valuable insights.

In [0]:
dbutils.data.summarize(df)