# Feature Engineering

At this point, it's assumed that you've explored the [battery charging dataset on Trove](https://trove.apple.com/dataset/aiedu_battery_charging/1.0.0) and are familiar with the variables in that data and what they represent. 

In this notebook, it will be up to you to transform the battery charging dataset into a dataset that can be used for training an ML model, including calculating and adding new columns to the data that can be used as input features. 

> This is a step known as **feature engineering**, where input features are numerical values or categories that will help you to make accurate predictions.

Format-wise your final data should satisfy the following requirements:
* You've created at least 2 input features that you think will be useful for prediction
* Every column is an input feature or a predictive target (targets are typically the _last_ column(s) in a dataset)
* There are no missing (nan) values in your data
* Your data is split into two sets: training and test

**Your final training and test datasets should be something that can be directly used for training and evaluating a simple NN or a baseline model.**

## Steps for creating features

In the rest of this notebook, you will be tasked with performing the following steps and creating an _initial_ featurized dataset to use for model training: 
>1. **Loading the data**: Create an initial DataFrame to work with 
2. **Define a target**: Define a predictive target and make it one column in your data
3. **Removing noise and null values**: Removing values that are noisy or not useful for the predictive model you want to build
4. **Train/test split**: Splitting the data into train/test sets
5. **Creating features**: Transforming data and creating new input features that you think will have predictive power
6. **Saving data**: Saving your transformed training and test data so you can use it for model training!

Don't worry about making the _best_ features; consider this notebook a starting point that you can revisit; later, adding new features and removing others to get the best features for your predictive goals. 

Once you've completed this notebook, you'll submit it, alongside any helper functions, for review by the Education team. 

>Your tasks will be marked as **TASK**s in markdown and `##TODOs` in code. 


In [1]:
import turitrove as trove
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



---
# Load (or Mount) the Data

In the last EDA notebook, you might have saved some transformed battery charging data in a binary file (pickle) format. You can either load that saved data, with with pandas `.read_pickle()` or `read_csv()` or start anew by mounting the Trove battery dataset as usual.

> **TASK**: Load in the battery charging data as a DataFrame. And, make sure your date-time stamps are formatted correctly; not as generic object types but as date-time stamps. 

Recall, you can find the relevant info for mounting a Trove dataset on it's [Trove page](https://trove.apple.com/dataset/aiedu_battery_charging/1.0.0).


In [2]:
## TODO: Load or mount the battery data as a DataFrame
df = pd.read_csv("cleaned.csv")

## TODO: Make sure the date-time variables are formatted as date-time stamps
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

df.dtypes

start            datetime64[ns]
end              datetime64[ns]
stream                   object
value                   float64
user_id                  object
charging_time           float64
dtype: object

In [3]:
df.head()

Unnamed: 0,start,end,stream,value,user_id,charging_time
0,2020-03-03 00:00:00,2020-03-03 00:13:00,/device/batteryPercentage,73.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,
1,2020-03-03 00:13:00,2020-03-03 00:26:00,/device/batteryPercentage,72.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,
2,2020-03-03 00:26:00,2020-03-03 00:26:00,/device/batteryPercentage,71.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,
3,2020-03-03 00:26:00,2020-03-03 00:27:00,/device/batteryPercentage,70.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,
4,2020-03-03 00:27:00,2020-03-03 00:27:00,/device/batteryPercentage,69.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,


---
# Define a Target Column

Since you want to predict the duration that a device will be _plugged in_, you're going to want to **create a target variable** that captures that information.  

If you loaded in your own, explored data, you may have already completed this step. If not, make sure to complete this task:

>**TASK**: Create a DataFrame that represents _only_ when users have plugged-in their devices and the devices are charging, and add a column to the DataFrame that represents the `duration` (in minutes) of these plugged-in events.

**Hints**: 
* Make sure your data is in the correct format for making certain calculations, e.g. with timestamps.
* You can get specific rows of data by using [pandas conditional filtering](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe)
* Use `.copy()`: When making new DataFrames, it's good practice to name them something uniqe as you modify them _and_ to use the `.copy()` function to make sure you're making a new copy rather than just a reference to the original DataFrame.  See [the Python docs on copy](https://docs.python.org/3/library/copy.html) for more details about that nuance.

After this task is complete, your plug-in events dataset should look a bit like this (these are just a few fake, example rows—with calendar dates removed—which won't _exactly_ match your dataset, except for the stream and value columns):
```
start	   end	      stream	            value  user_id	duration

04:53:00	05:32:00	/device/isPluggedIn	1.0	agh-184	39.0
07:20:00	16:59:00	/device/isPluggedIn	1.0	agh-184	579.0
07:30:00	07:30:00	/device/isPluggedIn	1.0	agh-184	0.0
```

In [4]:
## TODO: Create a dataframe of plug-in events only (using .copy())
df_plugin = df.copy()
df_plugin.drop(['charging_time'], axis = 1, inplace = True)

## TODO: Add a 'duration' column that represents the length of a charge in mins

# Charging Col
df_plugin['duration'] = np.where(df_plugin['stream'] == "/device/isPluggedIn",
                                 (df_plugin['end'] - df_plugin['start']).dt.total_seconds() / 60, 0)

# Filter for all Plug-In Events & Ensure that Value is 1.0
df_plugin = df_plugin[df_plugin['stream'] == "/device/isPluggedIn"]
df_plugin = df_plugin[df_plugin['value'] == 1.0]

# Reset Index
df_plugin.reset_index(inplace = True)

# Display First Few Cols to Verify Accuracy
df_plugin.head()

Unnamed: 0,index,start,end,stream,value,user_id,duration
0,35,2020-03-03 04:53:00,2020-03-03 05:32:00,/device/isPluggedIn,1.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,39.0
1,71,2020-03-03 07:20:00,2020-03-03 16:59:00,/device/isPluggedIn,1.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,579.0
2,160,2020-03-04 05:02:00,2020-03-04 05:51:00,/device/isPluggedIn,1.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,49.0
3,193,2020-03-04 07:30:00,2020-03-04 07:30:00,/device/isPluggedIn,1.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,0.0
4,195,2020-03-04 07:30:00,2020-03-04 13:30:00,/device/isPluggedIn,1.0,cd8014ee-5e62-4e41-9c24-de50178b4a97,360.0


### Test cells

In this notebook, you'll find a few provided test cells; you should pass in your _current_, working DataFrame and these tests will check whether your data format looks about right. These tests are not extensive, they are just meant to tell you whether you are on the right track to proceed!

In [5]:
## --- TEST CELL --- ##
## TODO: replace None function arg with your DataFrame from the above exercise

from checks import test_dtypes_plugin_vals
    
test_dtypes_plugin_vals(df_plugin) ## YOUR DF HERE


[92m[1mPassed format and 'duration' column tests so far!


---
## What about the other rows of data?

For optimized battery charging, we want to predict how long someone is likely to plug-in and charge their device, and so the plug-in events are the most important source of data, but you can still use information about un-plugged events and battery charge levels to create input features that may be useful for prediction. 

This information can be _merged_ into your plug-in events DataFrame. 

### Merging data: example

For example, say you want to record the battery charge level at the `start` of a plug-in event, because you hypothesize that when users have a lower charge level, they are more likely to plug-in their devices for a long `duration`. 

> However, plug-in events and battery charge level information are contained in two different _streams_ of data: '/device/isPluggedIn' and '/device/batteryPercentage'.

Since we're interested in the `start` charge level at the start of a plug-in event, we can do what's called a data **merge** or join on that column by doing the following two steps. 
1. Create two different DataFrames; one for plug-in events and one for battery charge levels
2. Merge the two based on their respective `start` times to create a single DataFrame with _both_ plug-in and battery charge level information. 

### Not-matching `start` times

Pandas provides a few different functions for merging DataFrames, including `merge()` which is helpful for merging on exactly-matching values, but this is not useful in this case because the `start` of a battery charge level event and the `start` of a plug-in event may not match up to the minute. 

Instead we want a merge function that allows us to match on the _closest_ `start` times between the two streams, which we can do with the pandas function `merge_asof()`. 

**Resources**: 
* The [pandas documentation for `merge_asof()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html)
* [This blog post](https://towardsdatascience.com/how-to-merge-not-matching-time-series-with-pandas-7993fcbce063) on merging not-exactly-matching data with `merge_asof()`. 

> **TASK**: Create a DataFrame of _only_ the battery level stream; you should already have a DF of plug-in events from earlier tasks. Define those two DataFrames, then run the provided code to use `merge_asof()` and get a `battery_start_level` column in a new DataFrame, `plug_in_charge_df`.

> The resultant, `plug_in_charge_df` should be the DataFrame you continue to work with, as you proceed in this notebook. 

In the provided code, the two DataFrames that you provide will be merged based on a few parameters:
* on - column name that I want to *near-match* in the left and right df's (`start`)
* by - column(s) that should *exactly-match* between left and right df's (`user_id`)
* direction - whether to search for prior, next, or generally closest matches for the _on_ column values (`forward`)

Combined, these parameters ensure that we get the first recorded battery percentage `value` that _starts_ immediately after the `start` of a given plugged-in event, for a given `user_id`.

**Note**: Merging _can_ introduce some missing values into your data, if a nearest start time is not found upon merging. You will be asked to check for these missing values and deal with them in a bit.

In [6]:
## TODO: Define two dataframes for plug-in and battery charge level streams

plug_in_df = df_plugin ## your plug-in events DF, created earlier ##

battery_charge_df = df.copy()
battery_charge_df.drop(['charging_time'], axis = 1, inplace = True)
battery_charge_df = battery_charge_df[battery_charge_df['stream'] == "/device/batteryPercentage"]

In [7]:
## TODO: run provided code

### --- provided code --- ### 

# merge two df's on closest start time + matching user_id's
# sorting df's on `start`
plug_in_charge_df = pd.merge_asof(plug_in_df.sort_values('start'),         ## left df
                                  battery_charge_df.sort_values('start'),  ## right df
                                  on = 'start',
                                  by = 'user_id',                          ## id's should match
                                  direction = 'forward',                   ## right df's start is >= left start
                                  suffixes=('_plugin', '_batt_level'))                   

# view resultant df
plug_in_charge_df.head()

Unnamed: 0,index,start,end_plugin,stream_plugin,value_plugin,user_id,duration,end_batt_level,stream_batt_level,value_batt_level
0,15838622,2020-03-02,2020-03-02 00:26:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,26.0,2020-03-02 00:08:00,/device/batteryPercentage,6.0
1,23335971,2020-03-02,2020-03-02 00:02:00,/device/isPluggedIn,1.0,968dace2-e758-4138-9f95-2e83521bdb36,2.0,2020-03-12 21:10:00,/device/batteryPercentage,74.0
2,12653112,2020-03-02,2020-03-02 04:50:00,/device/isPluggedIn,1.0,2cb0e6eb-e925-4fee-ae84-82113d7c5e91,290.0,2020-03-02 00:03:00,/device/batteryPercentage,83.0
3,20198683,2020-03-02,2020-03-02 00:04:00,/device/isPluggedIn,1.0,d1d8f32e-e50b-42aa-b9d7-a4cc0bdfb4d2,4.0,2020-03-02 00:03:00,/device/batteryPercentage,9.0
4,35091055,2020-03-02,2020-03-02 00:25:00,/device/isPluggedIn,1.0,478146d8-13fd-4f4d-a5c3-a45e24e62ea1,25.0,2020-03-05 00:52:00,/device/batteryPercentage,81.0


### Dropping columns 

At this point, you should see quite a bit of information in the `plug_in_charge_df` DataFrame. There are two columns for the separate streams of data, end times, and values associated with each.

But recall: we just wanted to add the starting battery charge level to your plug-in events DataFrame; this is currently represented as the `value_batt_level` column.

So, next, I provide some code for dropping some extraneous battery level information, and re-naming the `value_batt_level` column to `battery_start_level`.

In [8]:
### --- provided code --- ### 

# drop some battery_level columns
plug_in_charge_df = plug_in_charge_df.drop(['end_batt_level', 'stream_batt_level'], axis=1)
plug_in_charge_df = plug_in_charge_df.rename(columns={"end_plugin": "end", 
                                                      "stream_plugin": "stream",
                                                      "value_plugin": "value", 
                                                      "value_batt_level": "start_batt_level"})
# see result
plug_in_charge_df.head()

Unnamed: 0,index,start,end,stream,value,user_id,duration,start_batt_level
0,15838622,2020-03-02,2020-03-02 00:26:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,26.0,6.0
1,23335971,2020-03-02,2020-03-02 00:02:00,/device/isPluggedIn,1.0,968dace2-e758-4138-9f95-2e83521bdb36,2.0,74.0
2,12653112,2020-03-02,2020-03-02 04:50:00,/device/isPluggedIn,1.0,2cb0e6eb-e925-4fee-ae84-82113d7c5e91,290.0,83.0
3,20198683,2020-03-02,2020-03-02 00:04:00,/device/isPluggedIn,1.0,d1d8f32e-e50b-42aa-b9d7-a4cc0bdfb4d2,4.0,9.0
4,35091055,2020-03-02,2020-03-02 00:25:00,/device/isPluggedIn,1.0,478146d8-13fd-4f4d-a5c3-a45e24e62ea1,25.0,81.0


### Check your work

At this point, a sample of your data should look something like this; same as your initial plug-in events frame but with a `duration` and `start_batt_level` column. 

```
start	 end	   stream	         value  user_id	duration start_batt_level

04:53:00  05:32:00  /device/isPluggedIn  1.0  agh-184	39.0     74.0
07:20:00  16:59:00  /device/isPluggedIn  1.0  agh-184	579.0    9.0
07:30:00  07:30:00  /device/isPluggedIn  1.0  agh-184	0.0      81.0
```

In addition to checking that you have these columns, here is also a good point to **explore** (again) and check:
1. That the format of your data is expected; numbers are likely float values and dates are date-time's
2. If you've introduced any missing values with the merge step—remember, you'll want to create data without any missing values for training and ML model. 

In [9]:
## Check for NA Values in Merged DF
plug_in_charge_df.isna().sum()

index                  0
start                  0
end                    0
stream                 0
value                  0
user_id                0
duration               0
start_batt_level    4073
dtype: int64

In [12]:
plug_in_charge_df = plug_in_charge_df.dropna(axis = 0)
plug_in_charge_df.head()

Unnamed: 0,index,start,end,stream,value,user_id,duration,start_batt_level
0,15838622,2020-03-02,2020-03-02 00:26:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,26.0,6.0
1,23335971,2020-03-02,2020-03-02 00:02:00,/device/isPluggedIn,1.0,968dace2-e758-4138-9f95-2e83521bdb36,2.0,74.0
2,12653112,2020-03-02,2020-03-02 04:50:00,/device/isPluggedIn,1.0,2cb0e6eb-e925-4fee-ae84-82113d7c5e91,290.0,83.0
3,20198683,2020-03-02,2020-03-02 00:04:00,/device/isPluggedIn,1.0,d1d8f32e-e50b-42aa-b9d7-a4cc0bdfb4d2,4.0,9.0
4,35091055,2020-03-02,2020-03-02 00:25:00,/device/isPluggedIn,1.0,478146d8-13fd-4f4d-a5c3-a45e24e62ea1,25.0,81.0


---
# Deal with Noisy Data

Noise is data that is irrelevant or incorrect information for a task.

There is some incorrect noise in this data, which you may have noticed in earlier explorations, caused by a somewhat noisy data collection pipeline:
* Occasionally date-times will be incorrectly marked in the far future (e.g. in the year 2170)

There is also some noise with respect to the Optimized Battery Charging feature which is designed to work for predicting plug-in charge durations that are longer than 0 minutes but shorter than 2 days. 
* 0-length charges are typically "blips" that may be caused by a wireless charger breaking contact with a device momentarily, and
* Greater than 48hr charge-lengths will be handled by a rule-based algorithm that pauses charging at 80% indefinitely.

Now, it is up to you to remove this noise.
> **TASK**: Remove noisy data in the `plug_in_charge_df`.
> 1. Remove any rows of data that contain start/end times that were collected past this year (2021)—data could _not_ have been collected in the future, so this is noise that should be removed.
> 2. Remove any plug-in `duration` values that represent charge "blips" or durations that are recorded as 0-length; 0-length charges are too short to engage or benefit from optimized battery charging. 
> 3. Remove any `duration` values that are greater than or equal to 48hrs in length, which will be routed to a different algorithm for pausing charging. 
>4. Identify and decide whether to remove or replace/impute NaN values in your data.

At the end of this section, you should be left with a DataFrame without any missing values.

In [13]:
## TODO: Remove noisy data and remove/replace missing data

# Step 1: Filter out rows where 'end' is past 2021-01-01
plug_in_charge_df = plug_in_charge_df[plug_in_charge_df['end'] <= "2021-01-01"]

# Step 2: Filter out rows where 'duration_plugin' is 0 or greater than 48 hours (48 hours = 2880 minutes)
# Assuming 'duration_plugin' is in minutes
plug_in_charge_df = plug_in_charge_df[
    (plug_in_charge_df['duration'] > 0) & 
    (plug_in_charge_df['duration'] < 2880)
]

plug_in_charge_df.head()

Unnamed: 0,index,start,end,stream,value,user_id,duration,start_batt_level
0,15838622,2020-03-02,2020-03-02 00:26:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,26.0,6.0
1,23335971,2020-03-02,2020-03-02 00:02:00,/device/isPluggedIn,1.0,968dace2-e758-4138-9f95-2e83521bdb36,2.0,74.0
2,12653112,2020-03-02,2020-03-02 04:50:00,/device/isPluggedIn,1.0,2cb0e6eb-e925-4fee-ae84-82113d7c5e91,290.0,83.0
3,20198683,2020-03-02,2020-03-02 00:04:00,/device/isPluggedIn,1.0,d1d8f32e-e50b-42aa-b9d7-a4cc0bdfb4d2,4.0,9.0
4,35091055,2020-03-02,2020-03-02 00:25:00,/device/isPluggedIn,1.0,478146d8-13fd-4f4d-a5c3-a45e24e62ea1,25.0,81.0


### Test cell

Run the following tests to see if you're on the right track and have a DataFrame without any missing values. 

In [14]:
## --- TEST cell, replace None with your df --- ##
from checks import noise_tests

noise_tests(plug_in_charge_df) ## YOUR DF HERE

[92m[1mPassed all noise removal tests, great work!


At this point, you _almost_ have data that is ready for training a very simple ML model!

You should have one, potentially-useful input feature: `start_batt_level` and one target, charge `duration` (in minutes). 

The final steps are to: 
* Split this data for training _and_ testing an ML model
* Add more input features of your own design
* Save the datasets in a well-structured form

---
# Train/Test Split

Now that you have de-noised (or "cleaned") your data, next, you'll need to create separate training and test datasets for building and evaluating any ML model you create. 

For time-series data, it is recommended that you do this split in _time_ rather than just randomly selecting data points to create these two datasets. 

>The below code does this for you, as long as you provide your de-noised dataset with a `user_id` column and a `start` column.

This code uses a provided helper function that takes in a DataFrame of plug-in events, sorts it by `start` time and splits the data for each user into about 80/20 train and test data. The function returns a train and test DataFrame with the same columns as in the input DataFrame. You are welcome to take a look at this function in `helpers.py`, if you like, but you are not expected to modify it. 

It may take **up to 10mins to run**, depending on your machine and size of the df you pass in. It is suggested that you take a break or read up on some useful references in this time:
* This wikipedia page describes why we split in time to avoid [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning))
* Kaggle has several pandas-related learning resources including [this short lesson](https://www.kaggle.com/ryanholbrook/creating-features) on pandas and feature engineering.

In [16]:
### --- provided code --- ### 
# Replace None with the name of your clean, plug events dataframe, and run this code

from helpers import split80_20
train_df, test_df = split80_20(plug_in_charge_df) ## YOUR DF HERE


In [17]:
### --- provided code --- ### 
# Check that the train/test sizes are about 80/20

print('Train length: ', len(train_df))
print('Test length: ', len(test_df))
print()
print('Decimal % test data of total data: ', len(test_df)/(len(test_df)+len(train_df)))


Train length:  658966
Test length:  169774

Decimal % test data of total data:  0.20485797717016194


---
# Create (More) Features

At this point, you should have training and test sets of data that contain quite a few columns, including a target charge `duration` for an ML model to predict, and the battery level at the start of a plug-in event, `start_batt_level`, which will act as one input feature.

> This is a good point to save your progress by **saving your current train/test datasets** in case you want to re-load them or modify them later on. You'll also be prompted to save this data in a _specific_ format for training an ML model, at the end of this notebook.

**To complete this notebook, you are required to create at least 1 more input feature of your own.**

These features should represent information you think will be useful in making a prediction about charge duration. There is only one rule to heed as you create any time-dependent features:
> The data used in prediction (test data) must be known at the time a prediction is to be made.

What does this mean?  
* You can't create any features in the test data, `test_df` that aren't known at the time of testing or before.
* Typically this means calculating any statistical input features, like the mean plug-in duration for a user around every hour, using _training_ data, `train_df` that was recorded in the past.

>**TASK**: Create _at least_ one more column that represents an engineered feature; this column should be added to both the train and test datasets.

You may find it useful to refer back to the example notebook, **Feature Engineering for Household Power** in which statistical features, `Mean_hourly_power` and `Std_hourly_power`, are calculated from _only_ the training data and added as a column to _both_ train and test datasets. 

In [26]:
## TODO: Create 1+ more input features for the train_df and test_df
# New Feature: Time Of Day
def classify_time_of_day(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

train_df['time_of_day'] = df_plugin['start'].dt.hour.apply(classify_time_of_day)
test_df['time_of_day'] = df_plugin['start'].dt.hour.apply(classify_time_of_day)

## TODO: (Optional) Save your working train and test sets to revisit later
train_df.head()

Unnamed: 0,start,end,stream,value,user_id,start_batt_level,time_of_day,duration
0,2020-03-02 00:00:00,2020-03-02 00:26:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,6.0,Night,26.0
1,2020-03-02 00:27:00,2020-03-02 01:03:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,19.0,Morning,36.0
2,2020-03-02 01:45:00,2020-03-02 02:07:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,37.0,Morning,22.0
3,2020-03-02 02:15:00,2020-03-02 02:29:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,41.0,Morning,14.0
4,2020-03-02 05:25:00,2020-03-02 05:35:00,/device/isPluggedIn,1.0,fd53a044-3e2f-43dd-a42b-5a6e702fb3d5,29.0,Morning,10.0


---
# Save Your Data

Before you use your data to train a model, it is convention to save it in a specific format:
> **TASK**: Format your final training and test datasets for model training and eval, and save them.
> 1. Your target variable should be the _last_ column in the dataset
> 2. All other columns should be different input features (other information should be _dropped_).
> 3. Save your training and test datasets as two, csv or (recommended) binary pickle files for later loading.

**ALSO**: Un-mount any Trove data you are no longer using; you'll likely used your locally saved data for model training, henceforth.

In [27]:
## TODO: Format train/test data: all cols are features, last col is the target
col_names = ['start', 'end', 'stream', 'value', 'user_id', 'start_batt_level', 'time_of_day', 'duration']
train_df = train_df[col_names]
test_df = test_df[col_names]

In [28]:
## TODO: Save the train and test sets 
train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)

---
## Going further and iteration
At this point, you should have initial, featurized datasets to start training and evaluating ML models. Great work! 

> Training and evaluating a baseline is exactly what you'll do in the next notebook. 

You can even add to this notebook and train your first, simple model! Especially if you are familiar with the ML libraries, Turicreate or scikit-learn, you can simply pass in your training data into a simple regression model and evluate on the test dataset.

You may end up returning to this notebook or creating a new one to explore creating other features—you are left to your imagination about what will and won't work for this use case and you are encouraged to use your peers and instructors as brainstorming partners!  

Feature engineering is very open-ended, full of exploration, and marked by trial and error. If you can think of a feature your dataset might provide that would be useful in predicting the target value, create it. Then test it out by fitting a model to it and evaluating the fit.  

If you're uncertain where to start, go back to the instructional modules in Canvas. There are ideas for creating different types of features there. Then think about the data you have, and what types of features you might create. There's no "correct" answer, only what you can dream up!