# Exploratory Data Analysis

To get started, we will prototype the workflow locally.

**Warning:** this notebook may fail if your local machine does not have sufficient resources. 

## Install requirements

Install required packages.

In [None]:
!pip install --upgrade dask distributed fastparquet adlfs lightgbm pandas python-snappy pyarrow

## Get Data

The data is modified from a Kaggle competition and hosted publicly.

start a distributed Client

In [None]:
from distributed import Client

c = Client()
c

initialize the Pythonic filesystem

**Tip:** if you're not using public data, you need to provide data credentials. These can be retrieved through Azure ML Datastores, e.g.:

```python
from azureml.core import Workspace

ws = Workspace.from_config()
ds = ws.get_default_datastore() # ws.datastores["my-datastore-name"]

storage_options = {
    "account_name": ds.account_name,
    "account_key": ds.account_key
}
```

In [None]:
from adlfs import AzureBlobFileSystem

container_name = "nyctlc"
storage_options = {"account_name": "azureopendatastorage"}

fs = AzureBlobFileSystem(**storage_options)
fs

In [None]:
files = fs.ls(f"{container_name}")
files

In [None]:
files = fs.glob(f"{container_name}/yellow/puYear=2018/puMonth=*/*.parquet")
files[-5:]

In [None]:
len(files)

read data into a (dask) dataframe - note pandas also accepts the ``storage_options`` argument

In [None]:
import dask.dataframe as dd

df = dd.read_parquet(
    f"az://{container_name}/yellow/puYear=2018/puMonth=12/*.parquet",
    storage_options=storage_options,
).persist()
df

In [None]:
len(df)

## Exploratory Data Analysis (EDA)

Explore the data - for the purpose of this tutorial, we will simply print out a sample of the train and test dataframes and compute other basic descriptions.

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.dtypes

In [None]:
import matplotlib.pyplot as plt

df["tipAmount"].compute().hist(bins=1000, figsize=(16, 8), color="b")
plt.xlim([0, 20])

## Data Preparation

Prepare data for ML - for the purpose of this tutorial, we will simply ignore non-numeric columns.

In [None]:
cols = [
    col
    for col in df.columns
    if (df.dtypes[col] != "object") and (df.dtypes[col] != "datetime64[ns]")
]
cols

In [None]:
X = df[cols].drop("tipAmount", axis=1).values.persist()
X

In [None]:
y = df["tipAmount"].values.persist()
y

## Train LightGBM

Now, we can use the ``lightgbm.dask`` module for distributed LightGBM training through Python.

In [None]:
import lightgbm as lgbm

params = {
    "objective": "regression",
    "boosting": "gbdt",
    "num_iterations": 1000,
    "learning_rate": 0.1,
    "num_leaves": 16,
}

In [None]:
%%time

model = lgbm.LGBMRegressor(**params).fit(X, y)
model

In [None]:
%%time

model = lgbm.dask.LGBMRegressor(**params).fit(X, y)
model

In [None]:
model

## Save model

Optionally, save the model.

In [None]:
model.get_params()