# Scikit-Learn Linear Regression
Using SALES_VIEW from DWC. This view has 6,291,450 records

## Install fedml_aws library

In [1]:
pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_aws-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-aws
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 1.0.0
    Uninstalling fedml-aws-1.0.0:
      Successfully uninstalled fedml-aws-1.0.0
Successfully installed fedml-aws-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import Libraries

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # plotting

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='scikit-learn/linear-regression', bucket_name='fedml-bucket')

## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `SALES_VIEW` created in your DWC. To gather this data, please refer to https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Please note the 2M records data was downloaded and duplicated 3 times to represent a large dataset in DWC.

In [4]:
import json
with open('config.json', 'r') as f:
    config = json.load(f)

In [5]:
%%time
db = DbConnection()
train_data = db.execute_query('SELECT * FROM ' + config['schema'] +'.SALES_VIEW')
data = pd.DataFrame(train_data[0], columns=train_data[1])
data

CPU times: user 25.1 s, sys: 3.61 s, total: 28.7 s
Wall time: 32.3 s


Unnamed: 0,Region,Country,Order_ID,Item_Type,Sales_Channel,Order_Priority,Units_Sold,Unit_Price,Unit_Cost,Total_Revenue,Total_Cost,Total_Profit
0,Sub-Saharan Africa,Guinea-Bissau,197647750,Beverages,Offline,C,7216,47.45,31.79,342399.20,229396.64,113002.56
1,Sub-Saharan Africa,Sudan,321990668,Beverages,Offline,C,3049,47.45,31.79,144675.05,96927.71,47747.34
2,Sub-Saharan Africa,Sudan,982767236,Beverages,Offline,C,1519,47.45,31.79,72076.55,48289.01,23787.54
3,Sub-Saharan Africa,Guinea-Bissau,897898280,Beverages,Offline,C,6909,47.45,31.79,327832.05,219637.11,108194.94
4,Sub-Saharan Africa,Sudan,458928811,Beverages,Offline,C,6088,47.45,31.79,288875.60,193537.52,95338.08
...,...,...,...,...,...,...,...,...,...,...,...,...
6291445,Europe,Portugal,895778262,Baby Food,Online,L,9549,255.28,159.42,2437668.72,1522301.58,915367.14
6291446,Asia,Nepal,201643168,Office Supplies,Online,H,9549,651.21,524.96,6218404.29,5012843.04,1205561.25
6291447,Europe,Montenegro,607594430,Baby Food,Offline,M,9549,255.28,159.42,2437668.72,1522301.58,915367.14
6291448,Middle East and North Africa,Qatar,150846421,Baby Food,Offline,C,7663,255.28,159.42,1956210.64,1221635.46,734575.18


## Make sure there are no na or null columns

In [6]:
data.isna().any()

Region            False
Country           False
Order_ID          False
Item_Type         False
Sales_Channel     False
Order_Priority    False
Units_Sold        False
Unit_Price        False
Unit_Cost         False
Total_Revenue     False
Total_Cost        False
Total_Profit      False
dtype: bool

In [7]:
data.isnull().any()

Region            False
Country           False
Order_ID          False
Item_Type         False
Sales_Channel     False
Order_Priority    False
Units_Sold        False
Unit_Price        False
Unit_Cost         False
Total_Revenue     False
Total_Cost        False
Total_Profit      False
dtype: bool

In [8]:
data.columns

Index(['Region', 'Country', 'Order_ID', 'Item_Type', 'Sales_Channel',
       'Order_Priority', 'Units_Sold', 'Unit_Price', 'Unit_Cost',
       'Total_Revenue', 'Total_Cost', 'Total_Profit'],
      dtype='object')

## Correlation

In [9]:
# Create correlation matrix
corr_matrix = data.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

In [10]:
corr_matrix

Unnamed: 0,Units_Sold,Unit_Price,Unit_Cost,Total_Revenue,Total_Cost,Total_Profit
Units_Sold,1.0,0.000807,0.000659,0.523055,0.471242,0.59829
Unit_Price,0.000807,1.0,0.986049,0.738524,0.753562,0.577423
Unit_Cost,0.000659,0.986049,1.0,0.728145,0.764154,0.505104
Total_Revenue,0.523055,0.738524,0.728145,1.0,0.987724,0.880793
Total_Cost,0.471242,0.753562,0.764154,0.987724,1.0,0.796014
Total_Profit,0.59829,0.577423,0.505104,0.880793,0.796014,1.0


In [11]:
print(type(data))

<class 'pandas.core.frame.DataFrame'>


In [12]:
df = data.iloc[:,6:]
df

Unnamed: 0,Units_Sold,Unit_Price,Unit_Cost,Total_Revenue,Total_Cost,Total_Profit
0,7216,47.45,31.79,342399.20,229396.64,113002.56
1,3049,47.45,31.79,144675.05,96927.71,47747.34
2,1519,47.45,31.79,72076.55,48289.01,23787.54
3,6909,47.45,31.79,327832.05,219637.11,108194.94
4,6088,47.45,31.79,288875.60,193537.52,95338.08
...,...,...,...,...,...,...
6291445,9549,255.28,159.42,2437668.72,1522301.58,915367.14
6291446,9549,651.21,524.96,6218404.29,5012843.04,1205561.25
6291447,9549,255.28,159.42,2437668.72,1522301.58,915367.14
6291448,7663,255.28,159.42,1956210.64,1221635.46,734575.18


In [13]:
for i in df.columns:
    print(df[i])

0          7216
1          3049
2          1519
3          6909
4          6088
           ... 
6291445    9549
6291446    9549
6291447    9549
6291448    7663
6291449    3600
Name: Units_Sold, Length: 6291450, dtype: int64
0           47.45
1           47.45
2           47.45
3           47.45
4           47.45
            ...  
6291445    255.28
6291446    651.21
6291447    255.28
6291448    255.28
6291449    152.58
Name: Unit_Price, Length: 6291450, dtype: float64
0           31.79
1           31.79
2           31.79
3           31.79
4           31.79
            ...  
6291445    159.42
6291446    524.96
6291447    159.42
6291448    159.42
6291449     97.44
Name: Unit_Cost, Length: 6291450, dtype: float64
0           342399.20
1           144675.05
2            72076.55
3           327832.05
4           288875.60
              ...    
6291445    2437668.72
6291446    6218404.29
6291447    2437668.72
6291448    1956210.64
6291449     549288.00
Name: Total_Revenue, Length: 6291450, d

## Train SciKit Model

`train_data` is the data you want to train your model with.

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [14]:
clf = dwcs.train_sklearn_model(df,
                               train_script='sales_train.py',
                               instance_type='ml.c4.xlarge',
                              wait=True)

Training data uploaded
2021-10-06 23:17:14 Starting - Starting the training job...
2021-10-06 23:17:37 Starting - Launching requested ML instancesProfilerReport-1633562234: InProgress
......
2021-10-06 23:18:37 Starting - Preparing the instances for training.........
2021-10-06 23:19:58 Downloading - Downloading input data...
2021-10-06 23:20:43 Training - Training image download completed. Training in progress..[34m2021-10-06 23:20:44,415 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-10-06 23:20:44,418 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:20:44,429 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-10-06 23:20:44,967 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:20:44,979 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m20