<img src="https://unidata.ucar.edu/images/logos/badges/badge_unidata_100.jpg" alt="Unidata Logo" style="float: right; height: 98px;">

# Xarray to Scikit-Learn
___

In [1]:
import numpy as np

import xarray as xr
import pandas as pd

## Load in Xarray Tutorial Dataset

In [2]:
ds_ar = xr.tutorial.load_dataset("air_temperature")
ds_ar

Future sanity check for size of dataframe later

In [3]:
data_length = np.shape(ds_ar.air.values.ravel())
data_length[0]

3869000

Let's add a second data variable to make the dataset a bit more realistic

In [4]:
ds_ar['air_plus5'] = ds_ar['air']+ 5 
ds_ar

## Make a Pandas Dataframe that is sensible for future analysis in Scikit-learn

Scikit-learn requires a 2D array, with input variables as columns, and each sample as a row. You can do this with xarray stack function

In [5]:
# first need to grab the coordiante names as a tuple

my_tuple = tuple(ds_ar.coords)
desired_values = ('time', 'time1')

In [6]:
reordered_tuple = tuple(value for value in desired_values if value in my_tuple) + tuple(value for value in my_tuple if value not in desired_values)
print(reordered_tuple)

('time', 'lat', 'lon')


In [7]:
df = ds_ar.stack(stacked = reordered_tuple).to_dataframe()
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,air,air_plus5,time,lat,lon
time,lat,lon,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-01-01,75.0,200.0,241.199997,246.199997,2013-01-01,75.0,200.0
2013-01-01,75.0,202.5,242.5,247.5,2013-01-01,75.0,202.5
2013-01-01,75.0,205.0,243.5,248.5,2013-01-01,75.0,205.0
2013-01-01,75.0,207.5,244.0,249.0,2013-01-01,75.0,207.5
2013-01-01,75.0,210.0,244.099991,249.099991,2013-01-01,75.0,210.0


Sanity check

In [8]:
print('Did we lose any data points?')
df.shape[0] != data_length[0]

Did we lose any data points?


False

Users might want to drop the multi-index, or use it for their analysis. 

### Question:

Is there a smart way using pint/MetPy to put time as the first value in the tuple? 

is it worth the additional overhead within MetPy to simplyfy this code to the user to just do:

xarray_dataset.timestack() 