# Data Exploration
- This notebook performs exploratory data analysis on the dataset.
- To expand on the analysis, attach this notebook to a cluster with runtime version **14.3.x-cpu-ml-scala2.12**,
edit [the options of pandas-profiling](https://pandas-profiling.ydata.ai/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/253734152472098).

In [0]:
import mlflow
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

# Download input data from mlflow into a pandas DataFrame
# Create temporary directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# Download the artifact and read it
training_data_path = mlflow.artifacts.download_artifacts(run_id="8f378f5d11fd41068a6e15c009871525", artifact_path="data", dst_path=temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# Delete the temporary data
shutil.rmtree(temp_dir)

target_col = "Severity"

# Drop columns created by AutoML before pandas-profiling
df = df.drop(['_automl_split_col_0000', '_automl_sample_weight_0000'], axis=1)

# Convert columns detected to be of semantic type datetime
datetime_columns = ["date"]
df[datetime_columns] = df[datetime_columns].apply(pd.to_datetime, errors="coerce")

# Convert columns detected to be of semantic type numeric
numeric_columns = ["current_value", "previous_value"]
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors="coerce")

## Semantic Type Detection Alerts

For details about the definition of the semantic types and how to override the detection, see
[Databricks documentation on semantic type detection](https://docs.microsoft.com/azure/databricks/applications/machine-learning/automl#semantic-type-detection).

- Semantic type `categorical` detected for column `rn`. Training notebooks will encode features based on categorical transformations.
- Semantic type `datetime` detected for column `date`. Training notebooks will convert each column to a datetime type and encode features based on temporal transformations.
- Semantic type `numeric` detected for columns `current_value`, `previous_value`. Training notebooks will convert each column to a numeric type and encode features based on numerical transformations.

## Truncate columns
Only the first 100 columns will be considered for pandas-profiling, to avoid out-of-memory issues. Special columns, such as the target column, are always included. Modify the next cell to rerun pandas-profiling on a different set of columns.

In [0]:
special_cols = ["Severity"]
df = pd.concat([df[special_cols], df.drop(columns=special_cols).iloc[:, :100 - len(special_cols)]], axis=1)

## Profiling Results

In [0]:
from ydata_profiling import ProfileReport
df_profile = ProfileReport(df, minimal=True, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)

In [0]:
# Convert the Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(df)

# Group by 'Severity' and count the occurrences
severity_counts = spark_df.groupBy("Severity").count()

# Display the results
display(severity_counts)