## Data Understanding

> # &#x261D;
> This exercise is **optional**. When skipping this exercise, you will still be able to finish the use case `Cashflow Forecast`.  The purpose is to get to know to built-in tools of SAP Databricks to support the
> typical workflow of data scientist in data understanding. 

The ability to efficiently handle and explore data is paramount for machine learning projects. In this notebook, we'll delve into techniques such as reading from tables and computing statistics to explore and understand your data set. For that we will use the shared data product **`Cashflow`**. 

### 1. Install and import packages

In [0]:
%pip install ydata-profiling[pyspark]=4.17.0
%restart_python

In [0]:
from ydata_profiling import ProfileReport
import plotly.express as px

### 2. Load data from data product `Cashflow` 
# &#x270D;
The table we would like to explore is the delta shared data product table `cashflow`. For that please check the `Unity Catalog` and navigate to `Delta Shares Received`.

Run the following code to start a SparkSession and read from the desired table. Please replace the value `<TABLE_PATH>` with the full path to the data product table. The path structure in databricks follows the logic `share.schema.table`. 

![find_cashflow_dataproduct.png](../../images/find_cashflow_dataproduct.png)

In [0]:
# start spark session
from pyspark.sql import SparkSession

In [0]:
# load data --> Shared Data Product : Cashflow 
df = spark.read.table("bdc_share_cash_flow.cashflow.cashflow")

Let's have a look at the table itself.

In [0]:
# get preview of data
display(df)

While using notebooks, you have various options to view summary statistics for your dataset. Some of the options are:
* using spark DataFrame's built-in method (e.g. `summary()`)
* using databricks' utility methods (e.g. `dbutils.data.summarize()`)
* using databricks' built-in data profiler/visualizations
* using external libraries such as `matplotlib`

### 3. Explore Data with Summary Stats
The first an simplest way is using Spark's `summary` function.

In [0]:
# display summary statistics with spark
display(df.summary())

### 4. Explore data with Databricks _summarize()_ function
Another way of displaying summary statistics is to use Databricks' `summarize` function. The `summarize` function automatically generates a comprehensive report for the dataframe. This report encompasses crucial statistics, data types, and the presence of missing values, providing a holistic view of the dataset

Within this generated summary, the interactive features allow you to sort the information by various criteria:

* Feature Order
* Non-Uniformity
* Alphabetical
* Amount Missing/Zero

Furthermore, leveraging the datatype information, you can selectively filter out specific datatypes for in-depth analysis. This functionality enables us to create charts tailored to our analytical preferences, facilitating a deeper understanding of the dataframe and extracting valuable insights.

In [0]:
dbutils.data.summarize(df)

### 5. Explore Data with Databricks' built-in data profiler

In [0]:
report = ProfileReport(
    df.toPandas(),
    title="Cashflow",
    infer_dtypes=False,
    interactions=None,
    missing_diagrams=None,
    correlations={
        "auto": {"calculate": False},
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
    },
)

In [0]:
report_html = report.to_html()
displayHTML(report_html)