# Prediticing spot prices for AWS EC2 Instances

![Cointainers](https://images.unsplash.com/photo-1508404971049-e37350e9f05c?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=d7711d4685e561c1326b89841ca0db2b&auto=format&fit=crop&w=667&q=80)

# Table of Contents

* Introduction
* Background
* Import libraries
* EDA (Exploratory Data Analysis)
* Cleaning
* Implement Model
* Conculsion on results

# Introduction

The purpose of this experiment is to train a deep learning model to predict an outcome on time series data. I will be using the Fast.ai library for the model.  More specifically, we will be predicting the Spot prices for specifc regions.

# Background

Amazon Web Services [(AWS)](https://aws.amazon.com) provides virtual computing environments via their EC2 service. You can launch instances with your favourite operating system, select pre-configured instance images or create your own. Why this is revelant to data sciensits is because generally to run deep learning models you need a machine with a good GPU. EC2 can be configured with a P2/P3 instance and can be configured with up to 8 or 16 GPUs respectively! 

However, you can request Spot Instance Pricing. Which basically charges you for the spot price that is in effect for the duration of your instance running time. They are adjusted based on long-term trends in supply and demand for Spot instance capacity.  Spot instances can be discounted at up to 90% off compared to On-Demand pricing. 


Our goal will be to predict Spot pricing for the  different global regions on offer:

* US East
* US West
* South America (East)
* EU (European Union) West 
* EU Central
* Canda 
* Asia Pacific North East 1
* Asia Pacific North East 2
* Asia Pacific South
* Asia Pacific Southeast 1 
* Asia Pacific Southeast 2





# Import Libraries

In [None]:
import seaborn as sns

%reload_ext autoreload
%autoreload 2
%matplotlib inline

from IPython.display import HTML, display
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
print(os.listdir("../input"))



Lets import all the tables

In [None]:
PATH = "../input/"
PATH_WRITE = "/kaggle/working/"

In [None]:
ls {PATH}

In [None]:
table_names = ['ap-southeast-2', 'ap-northeast-2', 'ca-central-1', 'us-east-1', 
               'ap-northeast-1', 'ap-south-1', 'sa-east-1', 'eu-west-1', 
               'ap-southeast-1', 'us-west-1', 'eu-central-1']

In [None]:
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

# EDA 

Lets call head and take a look at what our data looks like.

In [None]:
for t in tables: display(t.head())

Lets call summary

In [None]:
for t in tables: display(DataFrameSummary(t).summary())

I think we need to change some of the columns names

In [None]:
new_labels = ['Date', 'Instance Type', 'OS', 'Region', 'Price ($)']

In [None]:
for t in tables:
  t.columns = new_labels

In [None]:
for t in tables: display(t.head())

In [None]:
for t in tables:
    plt.figure(figsize=(12,8))
    sns.countplot(t['Instance Type'], order=t['Instance Type'].value_counts().iloc[:20].index)
    plt.xticks(rotation=90);


List of questions to ask:

* Average price for certain instances in each region
* Frequent instance type
* Seasonlity of instances
* Determine if there are any stationary variables
* Which instance type is frequently link with what OS?
* Need to plot instances in time-intervalse eg: between 21:00 - 22:00

Also need to figure out how to give each region a table name for the graphs.

Lets look at the tables seperately:

# US East 

In [None]:
us_east = pd.read_csv("../input/us-east-1.csv")
PATH_USEAST = "../input/us-east-1.csv"

In [None]:
us_east.columns = new_labels
us_east.head()

In [None]:
us_east['Date'].head()

We need to parse the dates otherwise the dates will not apear on axis.  The format string needs to match the format in the column EXACTLY! For more info look [here](http://strftime.org/) and [here](https://codeburst.io/dealing-with-datetimes-like-a-pro-in-pandas-b80d3d808a7f)

In [None]:
us_east['Date'] = pd.to_datetime(us_east['Date'], format='%Y-%m-%d %H:%M:%S+00:00', utc=False)

In [None]:
us_east.info()

In [None]:
us_east['Date'].head(500)

## Instance: d2.xlarge

In [None]:
d2_= us_east[us_east['Instance Type'] == "d2.xlarge"].set_index('Date')

In [None]:
d2_Unix = us_east[us_east['OS'] == "Linux/UNIX"].set_index('Date')

In [None]:
d2_Suse = us_east[us_east['OS'] == "SUSE Linux"].set_index('Date')

In [None]:
d2_Win = us_east[us_east['OS'] == "Windows"].set_index('Date')

In [None]:
d2.head()

In [None]:
d2.head()

In [None]:
d2.head(100).plot(title="d2.xlarge Instances", figsize=(15,10))

In [None]:
d2_Suse.head(100).plot(title="d2.xlarge Instances OS:SUSE Linux", figsize=(15,10))

In [None]:
d2_Unix.head(100).plot(title="d2.xlarge Instances OS:Linux/UNIX", figsize=(15,10))

In [None]:
d2_Win.head(100).plot(title="d2.xlarge Instances OS:Windows", figsize=(15,10))

Looks like windows Instances can get quite pricey with highs of around roughly `$` 7 dollars - `$` 29!🤷‍♀️

# Cleaning

Lets go over the steps for checking for signs that we need to clean our dataset. 

In [None]:
us_east['Instance Type'].value_counts(dropna=False)

In [None]:
us_east['OS'].value_counts(dropna=False)

In [None]:
len(us_east.isnull())

Out of 3 721 999 entries none have null values

Lets train on another dataset

In [None]:
eu_west = pd.read_csv("../input/eu-west-1.csv")
PATH_euwest = "../input/"

In [None]:
eu_west.columns = new_labels

In [None]:
eu_west.info()

In [None]:
eu_west['Instance Type'].value_counts(dropna=False)

In [None]:
len(eu_west.isnull())

# Implement Model

Things to do:

*  Len the Instance type [done]
* Add date part [done] 

* Create cat & continous vars [done] - do not have any other kind continous var!!
* Process datasets [done]
* Split Dataset - via datetime [done]
* Create RMSE metric
* Create model data object
* calculate embeddings
* Train model 

In [None]:
add_datepart(eu_west, 'Date', drop=False)

In [None]:
 eu_west.reset_index(inplace=True)
eu_west.to_feather(f'{PATH_WRITE}eu_west')
eu_west.shape

In [None]:
eu_west=pd.read_feather(f'{PATH_WRITE}eu_west')

In [None]:
eu_west.columns

In [None]:
joined = eu_west
joined_test = eu_west

In [None]:
joined.to_feather(f'{PATH_WRITE}joined')
joined_test.to_feather(f'{PATH_WRITE}joined_test')

In [None]:
joined = pd.read_feather(f'{PATH_WRITE}joined')
joined_test = pd.read_feather(f'{PATH_WRITE}joined_test')

In [None]:
joined.head()

In [None]:
cat_vars = [
    
    'Instance Type', 
    'OS', 
    'Region',
     'Year' ,              
 'Month' ,             
 'Week'   ,            
      'Day',           
  'Dayofweek',         
  'Dayofyear'         
]

contin_vars = ['Elapsed']

n = len(joined); n

In [None]:
dep = 'Price ($)'
joined = joined[cat_vars+contin_vars+[dep,'Date']].copy()

In [None]:
joined_test[dep] = 0
joined_test = joined_test[cat_vars+contin_vars+[dep,'Date',]].copy()

In [None]:
for cat in cat_vars: joined[cat] = joined[cat].astype('category').cat.as_ordered()

eu_west['Price ($)'] = eu_west['Price ($)'].astype('float32')

In [None]:
for contin in contin_vars: 
    joined[contin] = joined[contin].astype('float32')
    joined_test[contin] = joined_test[contin].astype('float32')

In [None]:

idxs = get_cv_idxs(n, val_pct=150000/n)
joined_sample = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_sample); samp_size

In [None]:
samp_size = n


In [None]:
joined_sample.head()

In [None]:
df_train, y, nas, mapper = proc_df(joined_sample,'Price ($)', do_scale=True)
yl = np.log(y)

In [None]:
joined_test = joined_test.set_index("Date")

In [None]:
df_test, _, nas, mapper = proc_df(joined_test,'Price ($)', do_scale=True,mapper=mapper, na_dict=nas )


In [None]:
%debug

In [None]:
df_train.info()

In [None]:
train_val_split = 0.80
train_size = int(2383999  * train_val_split);train_size
val_idx = list(range(train_size, len(df_train)))

In [None]:
val_idx = np.flatnonzero(
         (df_train.index<=datetime.datetime(2017,4,12)) & (df_train.index>=datetime.datetime(2017,4,12)))

In [None]:
val_idx=[0]

In [None]:
len(val_idx)

We can put our Model. 

In [None]:
def inv_y(a): return np.exp(a)

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

In [None]:
md = ColumnarModelData.from_data_frame(PATH_euwest, val_idx, df_train, yl.astype(np.float32), 
                                       cat_flds=cat_vars, bs=128, test_df=df_test)

In [None]:
cat_sz = [(c, len(df_train[c].cat.categories)+1) for c in cat_vars]


# Conclusion on Results