## **FOUNDATION PROJECT - GROUP ASSIGNMENT** ##

> **Use Case ::** Predicting Stock Movement - **"Stock Market Copilot"**

> **Dataset Source ::** Yahoo Finance - https://finance.yahoo.com/quote/TSLA/history/?filter=history

> **Group No. ::** 6

## **MODEL MONITORING**

**Install alibi_detect library**

In [1]:
pip install alibi alibi_detect --no-warn-script-location

Collecting alibi
  Downloading alibi-0.9.6-py3-none-any.whl.metadata (22 kB)
Collecting alibi_detect
  Downloading alibi_detect-0.12.0-py3-none-any.whl.metadata (28 kB)
Collecting numpy<2.0.0,>=1.16.2 (from alibi)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m881.1 kB/s[0m eta [36m0:00:00[0m
Collecting blis<0.8.0 (from alibi)
  Downloading blis-0.7.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting scikit-image<0.23,>=0.17.2 (from alibi)
  Downloading scikit_image-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting Pillow<11.0,>=5.4.1 (from alibi)
  Downloading pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting attrs<24.0.0,>=19.2.0 (from alibi)
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting dill<0.4.0,>=0.3

In [1]:
pip install fastapi uvicorn pyngrok --no-warn-script-location



In [6]:
!pip install yfinance
!pip install ta

Collecting ta
  Downloading ta-0.11.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ta
  Building wheel for ta (setup.py) ... [?25l[?25hdone
  Created wheel for ta: filename=ta-0.11.0-py3-none-any.whl size=29412 sha256=211c025fe7b2f8bba66993e9af7db1e8e4b8d36272b15ec8d716f0f07b633b45
  Stored in directory: /root/.cache/pip/wheels/a1/d7/29/7781cc5eb9a3659d032d7d15bdd0f49d07d2b24fec29f44bc4
Successfully built ta
Installing collected packages: ta
Successfully installed ta-0.11.0


### **Detecting the Drift between the Train Dataset and Production Dataset**

**About Data Drift -**
> Data drift is a change in the statistical properties of the input data used to train a machine learning model, which can lead to a model's performance degrading. It's important to detect data drift because it can significantly impact a model's accuracy and reliability.

> Reasons why data drift is important to predict:
>> **Model performance:** When a model's input data distribution changes, the model's assumptions become invalid, which can lead to suboptimal predictions and inaccurate results.
>> **Model failure:** Data drift is one of the two main reasons for silent model failure.
>> **Identifying causes:** Investigating the characteristics of an observed drift can help identify the causes of any performance change

**source -**
 - *https://www.evidentlyai.com/ml-in-production/data-drift#:~:text=Data%20drift%20is%20a%20shift,on%20or%20earlier%20production%20data.*
 - https://encord.com/blog/detect-data-drift/#:~:text=Akruti%20Acharya,predictions%20and%20potentially%20inaccurate%20results.
 - https://nannyml.readthedocs.io/en/stable/tutorials/detecting_data_drift.html#:~:text=The%20model%20has%20been%20trained,you%20to%20detect%20data%20drift.

**Steps Taken -**
1. We load the Train and Production datasets that are in parquet format from Github (repositry where the data is being version controlled).
2. The Tabular drift detector (**"TabularDrift"**) is initialized with the training data (trainDataset_df) and significane level of p=0.05 (setting threshold) to create a drift detector model.
3. The detector model is then used to **predict the overall drift** and **feature wise drift** between the Production dataset (prodDataset_df) and Train dataset.
4. To test the deviation between the distribution of features in train and production dataset we use Chi square and K-S test methods.

In [15]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import alibi
from alibi_detect.cd import ChiSquareDrift, TabularDrift
from alibi_detect.saving import save_detector, load_detector

from contextlib import asynccontextmanager
from pydantic import BaseModel
from multiprocessing import Process
from threading import Thread

import requests

# Setting the no. of records display in output and no. of characters displayed in a column
pd.options.display.max_columns = 20 # Max 20 rows displayed. First and Last 10 rows shown, if limit exceeded
pd.options.display.max_rows = 20 # Max 20 columns displayed. First and Last 10 columns shown, if limit exceeded
pd.options.display.max_colwidth = 80 # Max of 80 characters displayed per column. Post limit truncated with an ellipsis (...)
np.set_printoptions(precision=4, suppress=True) # Displays only upto 4 decimals.

# Using "warnings" module to suppress/ignore warnings thrown by methods
import warnings
warnings.filterwarnings('ignore')

In [3]:
from pyngrok import ngrok

In [4]:
# Storing the ngrok auth-token which will be later used to authorize the web user posting the API request when connecting to the API service hosted at port 8000
ngrok.set_auth_token("2vxw3xzdUxYQCVkmEwessJaQxY7_CtqFHYEmWRbKF2zA5VNB")



In [7]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
import numpy as np
from joblib import load
from sklearn.preprocessing import MinMaxScaler
import yfinance as yf
import ta
import re
import datetime
import logging
import tensorflow as tf
from alibi_detect.cd import TabularDrift
from alibi_detect.utils.saving import save_detector, load_detector

# Initialize FastAPI app
app1 = FastAPI()

# Logger setup
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Input schema
class PredictionRequest(BaseModel):
    Date: str
    range: str
    n_days: int
    stock: str

@app1.post("/checkDrift")
def check_drift(input_data: PredictionRequest):
    try:
        logger.info(f"Received prediction request: {input_data}")

        # Extract ticker
        ticker_match = re.search(r'\((\w+)\)', input_data.stock)
        if not ticker_match:
            raise HTTPException(status_code=400, detail="Invalid stock format")
        ticker_code = ticker_match.group(1)

        # Load historical stock data
        stock = yf.Ticker(ticker_code)
        stock_data = stock.history(period="1y")

        df = stock_data.reset_index()
        df['Days'] = (df['Date'] - df['Date'].min()).dt.days

        data = pd.concat([df[['Date', 'Days']], df[['Close', 'Volume']]], axis=1)

        # Add technical indicators
        data['SMA_10'] = ta.trend.sma_indicator(data['Close'], window=10)
        data['RSI'] = ta.momentum.rsi(data['Close'], window=14)
        data['MACD'] = ta.trend.macd_diff(data['Close'])
        data.dropna(inplace=True)

        # Parse and validate input dates
        start_date = pd.to_datetime(input_data.Date, utc=True)
        if not isinstance(start_date, pd.Timestamp):
            raise HTTPException(status_code=400, detail="Invalid date format")

        n_days = int(input_data.n_days)
        if n_days <= 0:
            raise HTTPException(status_code=400, detail="n_days must be positive")

        VALID_START = pd.to_datetime('2025-04-01', utc=True)

        # Prepare data for prediction
        forecast_days = n_days
        prediction_used_data = data[-max(60, forecast_days * 2):]
        training_data = data.iloc[:-len(prediction_used_data)]

        numerical_features = ['Close', 'Volume', 'SMA_10', 'RSI', 'MACD']
        categorical_features = []
        x_features = numerical_features + categorical_features

        # Prepare for drift detection
        catg_vars = categorical_features
        categories_per_feature = {x_features.index(f): None for f in catg_vars}
        cd = TabularDrift(
            training_data[x_features].values,
            p_val=0.05,
            categories_per_feature=categories_per_feature
        )

        filepath = 'datadrift'
        save_detector(cd, filepath, legacy=True)
        cd = load_detector(filepath)

        # Predict drift
        drift_predictor = cd.predict(prediction_used_data[x_features].to_numpy())
        drift = drift_predictor['data']['is_drift']
        p_val_list = drift_predictor['data']['p_val']
        threshold = drift_predictor['data']['threshold']

        stat = []
        stat_val = []
        fname = []

        for f in range(cd.n_features):
            stat.append('Chi2' if f in categories_per_feature else 'K-S')
            stat_val.append(drift_predictor['data']['distance'][f])
            fname.append(x_features[f])

        summary_df = pd.DataFrame({
            "Feature Name": fname,
            "Statistical Test": stat,
            "Statistical Value": stat_val,
            "P-Value": [f'{val:.3f}' for val in p_val_list],
            "Drift Detected": ["Yes" if val < 0.05 else "No" for val in p_val_list]
        })

        return {
            'overall_drift_status': drift,
            'threshold_value': threshold,
            'feature_summary': summary_df.to_dict(orient='records'),
            'numeric_features_train': training_data[x_features].tail(10).to_dict(orient='records'),
            'numeric_feature_predict': prediction_used_data[x_features].head(10).to_dict(orient='records'),
            'num_features': numerical_features,
            'catg_features': categorical_features
        }

    except Exception as e:
        logger.error(f"Prediction error: {str(e)}", exc_info=True)
        raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")


#### **Chi Square Test for Numeric and Kolmogorov-Smirnov (K-S) Test for Categorical Features**
Both the Chi test and K-S test are performed to Determines if a variable is likely to come from a given distribution.

> **Chi Square Test-**
The Chi Square value indicates the deviation between the observed and expected frequencies for categoricl features.
>> A **higher** values indicates potential drift and a **lower** value indicates less deviation.

**source -** *https://www.scribbr.com/statistics/chi-square-distribution-table/#:~:text=A%20chi%2Dsquare%20distribution%20is,range%20is%200%20to%20%E2%88%9E.*

> **K-S Test-**
The K-S test measures the maximum difference between two cumulative distributions for numerical features.
>> A **lower** K-S value indicates smaller differences , while a **higher** K-S value suggests a larger deviation.

**source -** *https://towardsdatascience.com/evaluating-classification-models-with-kolmogorov-smirnov-ks-test-e211025f5573*


#### **Graphical Representation of Drift in dataset for Numerical and Categorical data**
> In this stage we will be visualizing the disparity in data set for continuous range data - Numerical datatype and discrete ranges data - Categorical datatypes.

> Since distribution of data for each feature of both datasets (Train and Production) will be overlapping each other, we will be able to see patterns if any type or value (low/high) is concentrated in either of the datasets.

> **KDE Plot -**
>> Kernel Density Estimate is used for visualizing the Probability Density of a continuous variable. It depicts the probability density at different values in a continuous variable.

>> The KDE plot visually represents the distribution of data, providing insights into its shape, central tendency, and spread.

> **Bar Plot -**
>> A bar plot uses rectangular bars to represent data categories, with bar length or height proportional to their values. It compares discrete categories, with one axis for categories and the other for values.


**source -**
- *https://www.geeksforgeeks.org/bar-plot-in-matplotlib/*
- *https://www.geeksforgeeks.org/kde-plot-visualization-with-pandas-and-seaborn/*

In [10]:
import uvicorn
# Starting a local server at the desired endpoind to host the service/application
def run_server():
    uvicorn.run(app1, host="0.0.0.0", port=9090, log_level="info")

# Threading is used to allow the service to run without interupting the main program that is being executed
thread = Thread(target=run_server, daemon=True)
thread.start()

In [11]:
# Expose the FastAPI app
public_url = ngrok.connect(9090)
print(f"Public URL: {public_url}")

Public URL: NgrokTunnel: "https://03e6-34-32-182-106.ngrok-free.app" -> "http://localhost:9090"


**Exporting the Public URL**

In [19]:
# Exporting the Ngrok Public (secure link) URL to a file for using in Streamlit
with open("ngrokPublicURL2.txt", "w") as f:
    f.write(public_url.public_url)

#### **Testing the API endpoint**

In [13]:
# Define the parameter values to be passed in the request
payload = {
     "Date": "2025-04-20",
     "n_days": 20,
     "range": "1 month",
     "stock": "AAPL (AAPL)"
}

In [18]:
# Make a POST request to the FastAPI server
resp = requests.post(public_url.public_url + "/checkDrift", json=payload)

# Printing the status code
print(f"Status: {resp.status_code}")

# Printing the headers
print(f"Header Data: {resp.headers}")

# Print the content of the response (the actual prediction)
print(f"Response Received: {resp.text}")



INFO:     34.32.182.106:0 - "POST /checkDrift HTTP/1.1" 200 OK
Status: 200
Header Data: {'Content-Length': '3348', 'Content-Type': 'application/json', 'Date': 'Sat, 19 Apr 2025 22:59:36 GMT', 'Ngrok-Agent-Ips': '34.32.182.106', 'Server': 'uvicorn'}
Response Received: {"overall_drift_status":1,"threshold_value":0.01,"feature_summary":[{"Feature Name":"Close","Statistical Test":"K-S","Statistical Value":0.1580168753862381,"P-Value":"0.210","Drift Detected":"No"},{"Feature Name":"Volume","Statistical Test":"K-S","Statistical Value":0.222995787858963,"P-Value":"0.023","Drift Detected":"Yes"},{"Feature Name":"SMA_10","Statistical Test":"K-S","Statistical Value":0.25654008984565735,"P-Value":"0.006","Drift Detected":"Yes"},{"Feature Name":"RSI","Statistical Test":"K-S","Statistical Value":0.48881855607032776,"P-Value":"0.000","Drift Detected":"Yes"},{"Feature Name":"MACD","Statistical Test":"K-S","Statistical Value":0.13502109050750732,"P-Value":"0.379","Drift Detected":"No"}],"numeric_featu