# Exploratory Data Analysis

To get started, we will prototype the workflow locally.

**Warning:** this notebook may fail if your local machine does not have sufficient resources. 

## Install requirements

Install required packages.

In [None]:
!pip install --upgrade dask distributed fastparquet adlfs xgboost pandas

## Get Data

The data is modified from a Kaggle competition and hosted publicly.

start a distributed Client

In [None]:
from distributed import Client

c = Client()
c

initialize the Pythonic filesystem

**Tip:** if you're not using public data, you need to provide data credentials. These can be retrieved through Azure ML Datastores, e.g.:

```python
from azureml.core import Workspace

ws = Workspace.from_config()
ds = ws.get_default_datastore() # ws.datastores["my-datastore-name"]

storage_options = {
    "account_name": ds.account_name,
    "account_key": ds.account_key
}
```

In [None]:
from adlfs import AzureBlobFileSystem

container_name = "malware"
storage_options = {"account_name": "azuremlexamples"}

fs = AzureBlobFileSystem(**storage_options)
fs

list the processed (partitioned) files

In [None]:
files = fs.ls(f"{container_name}/processed")
files

read data into a (dask) dataframe - note pandas also accepts the ``storage_options`` argument

In [None]:
import dask.dataframe as dd

for f in files:
    if "train" in f:
        df_train = dd.read_parquet(f"az://{f}", storage_options=storage_options)
    elif "test" in f:
        df_test = dd.read_parquet(f"az://{f}", storage_options=storage_options)

df_train

## Exploratory Data Analysis (EDA)

Explore the data - for the purpose of this tutorial, we will simply print out a sample of the train and test dataframes and compute other basic descriptions.

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
%%time
df_train.describe().compute()

In [None]:
%%time
df_train["HasDetections"].compute().hist()

## Data Preparation

Prepare data for ML - for the purpose of this tutorial, we will simply ignore non-numeric columns.

In [None]:
cols = [col for col in df_train.columns if df_train.dtypes[col] != "object"]
cols

In [None]:
X = df_train[cols].drop("HasDetections", axis=1).values.persist()
X

In [None]:
y = df_train["HasDetections"].values.persist()
y

## Train XGBoost

Now, we can use the ``xgboost.dask`` module for distributed XGBoost training through Python.

In [None]:
import xgboost as xgb

dtrain = xgb.dask.DaskDMatrix(c, X, y)
dtrain

In [None]:
num_boost_round = 2  # just see if it works

params = {
    "objective": "binary:logistic",
    "learning_rate": 0.1,
    "gamma": 0,
    "max_depth": 8,
}

In [None]:
%%time
model = xgb.dask.train(c, params, dtrain, num_boost_round=num_boost_round)
model

## Save model

Optionally, save the model.

In [None]:
model["booster"].save_model("xgboost.model")