## Stock Price Prediction using `XGBoost` Algorithm

> 1. Build and train an Amazon SageMaker model<br>
> 2. Deploy and test the Amazon SageMaker model endpoint<br>
> 3. Create an AWS Lambda function
> 4. Build, deploy and test an API Gateway endpoint for the REST API

### 1. Build and train an Amazon SageMaker model<br>
#### 1.1 Create a S3 Bucket
**Boto3** is the **Amazon Web Services (AWS) Software Development Kit (SDK)** for Python, which allows you to directly create, update, and delete AWS resources from your Python scripts.

Boto3 makes it easy to integrate yours Python application, library, or script with AWS services including Amazon S3, Amazon EC2, DynamoDB, and more.

In [None]:
import boto3
from pyasn1_modules.rfc5126 import ContentType

s3 = boto3.client('s3')

In [17]:
bucket_name = 'yf-stock-price-prediction'

try:
    s3.create_bucket(Bucket=bucket_name)
    print("S3 Bucket has been created successfully")
except Exception as e:
    print('S3 error: ', e)

S3 Bucket has been created successfully


#### 1.2. Create train and validation csv

In [3]:
from datetime import datetime
import yfinance as yf

# Initialize
start_date = datetime(2021, 1, 1)
end_date = datetime(2024, 11, 15)

# Get the data
df_data = yf.download('AAPL', start = start_date, end = end_date)
df_data.reset_index(inplace=True)
print(df_data)

[*********************100%***********************]  1 of 1 completed

Price                       Date   Adj Close       Close        High  \
Ticker                                  AAPL        AAPL        AAPL   
0      2021-01-04 00:00:00+00:00  126.544205  129.410004  133.610001   
1      2021-01-05 00:00:00+00:00  128.108765  131.009995  131.740005   
2      2021-01-06 00:00:00+00:00  123.796432  126.599998  131.050003   
3      2021-01-07 00:00:00+00:00  128.020782  130.919998  131.630005   
4      2021-01-08 00:00:00+00:00  129.125763  132.050003  132.630005   
..                           ...         ...         ...         ...   
969    2024-11-08 00:00:00+00:00  226.960007  226.960007  228.660004   
970    2024-11-11 00:00:00+00:00  224.229996  224.229996  225.699997   
971    2024-11-12 00:00:00+00:00  224.229996  224.229996  225.589996   
972    2024-11-13 00:00:00+00:00  225.119995  225.119995  226.649994   
973    2024-11-14 00:00:00+00:00  228.220001  228.220001  228.869995   

Price          Low        Open     Volume  
Ticker        AAPL 




 #### 1.3. Extract, Load & Transform

In [4]:
# Drop unnecessary columns from the data table

# The column in the DataFrame has a MultiIndex with two levels: Price and Ticker.

# Now, drop the columns
df_data.drop(columns=[('Date', ''), ('Adj Close', 'AAPL')], inplace=True)
print(df_data)


Price        Close        High         Low        Open     Volume
Ticker        AAPL        AAPL        AAPL        AAPL       AAPL
0       129.410004  133.610001  126.760002  133.520004  143301900
1       131.009995  131.740005  128.429993  128.889999   97664900
2       126.599998  131.050003  126.379997  127.720001  155088000
3       130.919998  131.630005  127.860001  128.360001  109578200
4       132.050003  132.630005  130.229996  132.429993  105158200
..             ...         ...         ...         ...        ...
969     226.960007  228.660004  226.410004  227.169998   38328800
970     224.229996  225.699997  221.500000  225.000000   42005600
971     224.229996  225.589996  223.360001  224.550003   40398300
972     225.119995  226.649994  222.759995  224.009995   48566200
973     228.220001  228.869995  225.000000  225.020004   44923900

[974 rows x 5 columns]


In [5]:
# Remove last row as that is what we are going to predict (next day data)

# (df.iloc[rows, columns])
df_data_features = df_data.iloc[:-1, :]

print(df_data_features)

Price        Close        High         Low        Open     Volume
Ticker        AAPL        AAPL        AAPL        AAPL       AAPL
0       129.410004  133.610001  126.760002  133.520004  143301900
1       131.009995  131.740005  128.429993  128.889999   97664900
2       126.599998  131.050003  126.379997  127.720001  155088000
3       130.919998  131.630005  127.860001  128.360001  109578200
4       132.050003  132.630005  130.229996  132.429993  105158200
..             ...         ...         ...         ...        ...
968     227.479996  227.880005  224.570007  224.630005   42137700
969     226.960007  228.660004  226.410004  227.169998   38328800
970     224.229996  225.699997  221.500000  225.000000   42005600
971     224.229996  225.589996  223.360001  224.550003   40398300
972     225.119995  226.649994  222.759995  224.009995   48566200

[973 rows x 5 columns]


In [6]:
# Extract the 'Open' column as that is the target Row-1 (next data) price prediction
df_data_targets = df_data.loc[1:, ('Open', 'AAPL')].rename('Targets')

print(df_data_targets)

1      128.889999
2      127.720001
3      128.360001
4      132.429993
5      129.190002
          ...    
969    227.169998
970    225.000000
971    224.550003
972    224.009995
973    225.020004
Name: Targets, Length: 973, dtype: float64


In [42]:
# Assign the target column using .loc to avoid the SettingWithCopyWarning
df_data_features.loc[:, 'Target'] = list(df_data_targets)

# For `XGBoost` to work we should have `Target` as first column
first_column = df_data_features.pop('Target')
df_data_features.insert(loc=0, column='Target', value=first_column)

df_data_final = df_data_features.copy()
print(df_data_final)

Price       Target       Close        High         Low        Open     Volume
Ticker                    AAPL        AAPL        AAPL        AAPL       AAPL
0       128.889999  129.410004  133.610001  126.760002  133.520004  143301900
1       127.720001  131.009995  131.740005  128.429993  128.889999   97664900
2       128.360001  126.599998  131.050003  126.379997  127.720001  155088000
3       132.429993  130.919998  131.630005  127.860001  128.360001  109578200
4       129.190002  132.050003  132.630005  130.229996  132.429993  105158200
..             ...         ...         ...         ...         ...        ...
968     227.169998  227.479996  227.880005  224.570007  224.630005   42137700
969     225.000000  226.960007  228.660004  226.410004  227.169998   38328800
970     224.550003  224.229996  225.699997  221.500000  225.000000   42005600
971     224.009995  224.229996  225.589996  223.360001  224.550003   40398300
972     225.020004  225.119995  226.649994  222.759995  224.0099

#### 1.4. Train Test Split

In [8]:
from sklearn.model_selection import train_test_split

# Split the DataFrame into training and testing sets (80% train, 20% test)
train_data, test_data = train_test_split(df_data_final, test_size=0.2, random_state=123)

print(train_data.shape, test_data.shape)

(778, 6) (195, 6)


In [9]:
prefix = 'xgboost-algorithm'

# SageMaker Python SDK 2.0 format
train_csv_path = 's3://{}/{}/{}/{}'.format(bucket_name, prefix, 'train', 'train.csv')
test_csv_path = 's3://{}/{}/{}/{}'.format(bucket_name, prefix, 'test', 'test.csv')

print(train_csv_path)
print(test_csv_path)


s3://yf-stock-price-prediction/xgboost-algorithm/train/train.csv
s3://yf-stock-price-prediction/xgboost-algorithm/test/test.csv


In [18]:
train_data.to_csv(train_csv_path, index=False, header=False)
test_data.to_csv(test_csv_path, index=False, header=False)

#### 1.5. Build `XGBoost` Model<br>
#### How to Use SageMaker XGBoost
With SageMaker, we can use `XGBoost` as a built-in algorithm or framework. By using XGBoost as a framework, we have more flexibility and access to more advanced scenarios, such as k-fold cross-validation, because we can customize our own training scripts


> #### Use XGBoost as a framework
Use XGBoost as a framework to run our customized training scripts that can incorporate additional data processing into our training jobs

> #### Use XGBoost as a built-in algorithm
Use the XGBoost built-in algorithm to build an XGBoost training container

In [41]:
# I am using the approach XGBoost as a built-in algorithm

import sagemaker
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

#### 1.6. Find an `XGBoost` image uri and build an XGBoost container

In [None]:
xgboost_container = image_uris.retrieve('xgboost', boto3.Session().region_name, '1.5-1')

print(xgboost_container)

#### 1.7. Initialize hyperparameters<br>
#### Booster Parameters:
> **max_depth** - Maximum depth of a tree. Increasing the value makes the model more complex 

> **eta** - Step size shrinkage used in updates to prevent overfitting

> **gamma** - Minimum loss reduction required to make a further partition on a leaf node of the tree

> **min_child_weight** - Minimum sum of instance weight needed in a child

> **subsample** - Subsample ratio of the training instance

#### Learning Task Parameter:
> **objective** - Specifies the learning task and the corresponding learning objective

In [21]:
hyperparameters = {
    "max_depth": '5',
    "eta": '0.2',
    "gamma": '4',
    "min_child_weight": '6',
    "subsample": '0.7',
    "objective": 'reg:squarederror',
    "early_stopping_rounds": 10,
    "num_round": 1000
}

#### 1.8. Set an output path where the trained model will be saved

In [22]:
output_path = 's3://{}/{}/{}/'.format(bucket_name, prefix, 'output')

print(output_path)

s3://yf-stock-price-prediction/xgboost-algorithm/output/


#### 1.9. Construct a SageMaker estimator that calls the xgboost-container<br>
> Enable the **train_use_spot_instances** constructor arg - a simple self-explanatory boolean

> Set the **train_max_wait** constructor arg

> **train_max_run** - The timeout in seconds for training

### `Run sagemaker-role-creation CFT which creates required role, then use the role arn in the below section`

In [26]:
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                                          hyperparameters=hyperparameters,
                                          output_path=output_path,
                                          role='<sagemaker-role-arn>',
                                          instance_count=1,
                                          instance_type='ml.m5.xlarge',
                                          volume_size_in_gb=5,
                                          use_spot_instances=False,
                                          max_run=300
                                          )

#### 1.10. Define the data type and paths to the training and validation datasets

In [27]:
content_type = 'csv'
train_input = TrainingInput('s3://{}/{}/{}/'.format(bucket_name, prefix, 'train'), content_type=content_type)
test_input = TrainingInput('s3://{}/{}/{}/'.format(bucket_name, prefix, 'test'), content_type=content_type)

#### 1.11. Execute the `XGBoost` training job

In [28]:
estimator.fit({"train": train_input, "validation": test_input})

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-11-23-04-23-02-384


2024-11-23 04:23:04 Starting - Starting the training job...
2024-11-23 04:23:19 Starting - Preparing the instances for training...
2024-11-23 04:24:06 Downloading - Downloading the training image......
2024-11-23 04:24:51 Training - Training image download completed. Training in progress....
2024-11-23 04:25:25 Uploading - Uploading generated training model
2024-11-23 04:25:25 Completed - Training job completed
..Training seconds: 105
Billable seconds: 105


### 2. Deploy and test the Amazon SageMaker model endpoint<br>
#### 2.1. Deploy trained `XGBoost` model as Endpoint
 1. Environment
 > Within SageMaker - Serialization by User<br>
 > **Outside SageMaker - Serialization by Endpoint**
 
2. Method to invoke the Endpoint
> **API - Single Prediction**<br>
> S3 Bucket - Batch Prediction

3. Data type based on method
> **API - JSON**<br>
> S3 Bucket - CSV

To host a model through Amazon EC2 using Amazon SageMaker, deploy the model that you trained in Create and Run a Training Job by calling the **deploy method** of the **xgb_model estimator**

When you call the deploy method, few key things that you need to specify
> **initial_instance_count (int)** - The number of instances to deploy the model

> **instance_type (str)** - The type of instances that you want to operate your deployed model

> **serializer (int)** - The type of instance that you want to operate your deployed model

In [29]:
from sagemaker.serializers import CSVSerializer

csv_serializer = CSVSerializer()
xgb_predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large', serializer=csv_serializer)

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-11-23-04-28-08-918
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2024-11-23-04-28-08-918
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2024-11-23-04-28-08-918


------!

In [None]:
print(xgb_predictor.endpoint_name)

#### 2.2. Make prediction with the use of Endpoints

In [31]:
# Initialize
start_date = datetime(2024, 11, 14)
end_date = datetime(2024, 11, 15)

# Get the data
df_data = yf.download('AAPL', start = start_date, end = end_date)
df_data.reset_index(inplace=True)
print(df_data)

[*********************100%***********************]  1 of 1 completed

Price                       Date   Adj Close       Close        High    Low  \
Ticker                                  AAPL        AAPL        AAPL   AAPL   
0      2024-11-14 00:00:00+00:00  228.220001  228.220001  228.869995  225.0   

Price         Open    Volume  
Ticker        AAPL      AAPL  
0       225.020004  44923900  





In [32]:
# Drop unwanted columns

df_data.drop(columns=[('Date', ''), ('Adj Close', 'AAPL')], inplace=True)

data_features = df_data.values
print(data_features)

[[2.28220001e+02 2.28869995e+02 2.25000000e+02 2.25020004e+02
  4.49239000e+07]]


#### 2.3. Inference - Serialized Input by build-in function (Lambda function friendly)

In [40]:
Input = [[2.25000000e+02, 2.26919998e+02, 2.24270004e+02, 2.26399994e+02, 4.78322000e+07]]

Serialized_Input = ','.join(map(str, Input[0]))

print('Serialized_Input...:', Serialized_Input)

Y_pred = xgb_predictor.predict(Serialized_Input).decode('utf-8')
print(f"Predicted Stock Price: ${Y_pred}")

Serialized_Input...: 225.0,226.919998,224.270004,226.399994,47832200.0
Predicted Stock Price: $226.04385375976562



### 3. Create an AWS Lambda function<br>
#### 3.1. Lambda function handler - Copy & Paste below lambda handler code to a new Lambda created in aws console

In [35]:
import boto3

ENDPOINT_NAME = '<sagemaker-endpoint-name>'
runtime = boto3.client('runtime.sagemaker')
email_client = boto3.client('sns')

def lambda_handler(event, context):
    inputs = event['data']
    
    result = []
    for input_data in inputs:
        serialized_input = ','.join(map(str, input_data))
        response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME, ContentType='text/csv', Body=serialized_input)
        prediction = response['Body'].read().decode().strip()
        result.append(prediction)
        
    # Publish result to a topic which in turn sends the predictions to all the subscribers of the topic
    '''response_sns = email_client.publish(
        TopicArn = 'Arn of the created topic',
        Message = 'Prediction is ' + str(result),
        Subject = 'NK Finance Daily - Price Prediction')
    '''
    
    return result

In [36]:
# To test the lambda function locally

Input_json = { 
        "data": [
            [2.25000000e+02, 2.26919998e+02, 2.24270004e+02, 2.26399994e+02, 4.78322000e+07],
            [2.25000000e+02, 2.26919998e+02, 2.24270004e+02, 2.26399994e+02, 4.78322000e+07]
        ]
}

result1 = lambda_handler(Input_json, None)
print(result1)

['226.04385375976562', '226.04385375976562']


#### 3.2. Create SNS Topic to send emails to users with price prediction<br>
> 1. Now under SNS select A2P (Application-to-person) communication to send the predicted price to user email
> 2. Create a top and add a subscription as user email
> 3. User receives an email to confirm their subscription
> 4. Copy the topic's arn and provide it in the above lambda handler in the push functions section
> 5. Update the iam role to attach the SNS policy for lambda to access the topic

### 4. Build, deploy and test an API Gateway endpoint for the REST API<br>
> 1. Create REST API Gateway with a POST method and integrate the lambda
> 2. Now POST api call using api gateway endpoint should direct to lambda and return response to client

In [None]:
import requests

API_ENDPOINT = '<api-gateway-endpoint>'

json_request = { 
        "data": [
            [2.25000000e+02, 2.26919998e+02, 2.24270004e+02, 2.26399994e+02, 4.78322000e+07],
            [2.25000000e+02, 2.26919998e+02, 2.24270004e+02, 2.26399994e+02, 4.78322000e+07]
        ]
}

response = requests.post(url=API_ENDPOINT, json=json_request)

In [None]:
print(f"Status Code: {response.status_code}, Response: {response.json()}")

### Cleanup and Terminate

In [None]:
#sagemaker.Session().delete_endpoint(endpoint_name=xgb_predictor.endpoint_name)
#sagemaker.Session().delete_endpoint_config(endpoint_config_name=xgb_predictor.endpoint_name)
#sagemaker.Session().delete_model(model_name=xgb_predictor.endpoint_name)

In [None]:
#bucket_to_delete = boto3.resource("s3").Bucket(bucket_name)
#bucket_to_delete.objects.all().delete()
#bucket_to_delete.delete()