Project Description for Handover
Project Title: Hourly Forecasting of Message Volume and Anomaly Classification
Development Environment: WSL/Ubuntu-22.04 Programming Language: Python
1. Hourly Forecasting of Message Volume (NMSG)
Objective: Develop a model using the NeuralProphet algorithm to forecast the hourly volume of incoming messages to the servers based on historical data.
Data Source: Sourced from the S3 production "Performance" bucket, stored in "/Data/train_data/".
Development Process:
-
Proof of Concept (POC): Initiated in "notebooks/NeuralProphet_NMSG_testing.ipynb" before modularizing into the "src" folder.
-
Folder Structure:
- "src/components": Houses defined components utilized within specific pipelines.
- "src/pipelines": Contains key pipelines for the project
-
Script References:
- 1. ETL_pipeline.py: This script facilitates Export-Transform-Load operations on the data. The
date_stringvariable within the script can be modified to specify the date up to which you intend to train the model. All data beyond this date will be reserved for testing purposes.# Running the ETL pipeline if __name__ == "__main__": from datetime import datetime columns=["DT", "NMSG"] # columns=["DT", "TO500RT","SERVER"] if "NMSG_1" in columns: run_etl_pipeline(columns=columns) else: date_string = '2023-12-31 23' # Modify this date as needed date_format = '%Y-%m-%d %H' datetime_object = datetime.strptime(date_string, date_format) run_etl_pipeline(split_date=datetime_object, columns=columns)
- 2. Hyperparam_tuning.py: Conducts hyperparameter tuning using MLflow for model optimization.
- 3. Predict_pipeline.py:
Utilized for making predictions on the selected model post hyperparameter tuning. In this script, the trained data is reused for making future predictions.
The number
csv_file = "../../artifacts/v2/DT_NMSG/train_data.csv" data = predictor.load_csv_data(csv_file) # Assuming test_data length is required for forecasting periods future_forecast = predictor.make_predictions(data, 1024)
1024represents the quantity of data points for future predictions and can be adjusted as needed.
- 1. ETL_pipeline.py: This script facilitates Export-Transform-Load operations on the data. The
Future Development:
- The model should be periodically retrained in order to correspond to new trends and fresh data
- Should be developed Endpoint for the model to be able to call the model and make prediction
- Should be decided where to deploy the model.
Note: Uncommenting "-e ." in the initial run of requirements.txt activates setup.py, facilitating the use of Python modules from different locations. Following this, it can be commented out.
2. Anomaly Classification
Objective: Develop an anomaly detection model that utilizes the LSTM RNN algorithm to predict anomalies based on historical patterns, using the last 24 hours of performance data.
Development Process:
-
Proof of Concept (POC): Currently in the POC stage, with development stages documented in the following notebooks:
-
Folder Structure:
- notebooks/LSTM_Classification_data_preparation.ipynb: Details the data preprocessing stage.
- notebooks/LSTM_Classification_model.ipynb: Focuses on the implementation of the LSTM model itself.
-
Script References:
-
notebooks/LSTM_Classification_data_preparation.ipynb: Here there has been the same raw data for
NMSGthat has been used also in the upper model + some other data (SERVER,TO500RT)-
You may see that there are
NMSGandTO500RTvalues for 3 differentSERVER. The idea was not to separte the serverse, that is why the values ofNMSGandTO500RThave been aggregated (sum) for the sameDT. -
Next based on the values of
NMSGandTO500RTthere has been calculated new KPIANOMALY(which is binary). This KPI will be used as a target column for the column (i.e we will try to predict it).df_hourly['ANOMALY'] = df_hourly.apply(lambda row: 'YES' if (row['TO500RT'] / row['NMSG'] * 100 > 0.01) else 'NO', axis=1)
-
Before proceeding with the model we need still add some more data that could be helpful for the model building, i.e adding more attributes such as
CPU_MAX,CPU_AVG. These columns unlike theNMSGdata were not obtained from S3 bucket, but were downloaded from the Performance KPIs UI by hand and loaded into../Data/CPUfolder. -
Next the tables with
CPU_MAX,CPU_AVGandNMSGget joined on theDTcolumn. -
The final table gets sinked into
../artifacts/DATE_NMSG_CPUMAX_CPUAVG_ANOMALY/df_input_1.csv
-
-
notebooks/LSTM_Classification_model.ipynb:
-
Time-series data preparation
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """
This function converts a time series dataset into a supervised learning format suitable for training machine learning models. It frames the dataset by shifting the input and output sequences based on the specified number of lag observations.
data: The input time series data, either as a list or NumPy array.n_in: The number of lag observations to use as input features (X).n_out: The number of observations to predict as output (y).dropnan: A boolean indicating whether to drop rows with NaN values after framing.
The function returns a Pandas DataFrame where each row represents a time step, and each column corresponds to a lagged observation or a future time point prediction.
In the context of time series analysis:
- Lag Observations (Input Sequence): These are past observations used as features to predict future values. For example,
var1(t-1)represents the value of variable 1 at the previous time step. - Output Sequence (Forecast Sequence): These are future observations to be predicted. For example,
var1(t)represents the value of variable 1 at the current time step, whilevar1(t+1)represents the value at the next time step.
By specifying
n_inandn_out, you control the number of past observations used for prediction (n_in) and the number of future observations to predict (n_out). This technique allows the conversion of a time series problem into a supervised learning problem suitable for training machine learning models like LSTM (Long Short-Term Memory) networks. -
After Analysing the data it has been revieled that data in
ANOMALYcolumn is imbalanced."0" with 20k+vs"1" with 2k dataSo it has been decided to drop 17k of data with "0" class (this can be regulated in the futre). -
At the stage of the defining the Layer for the NN model it has been observed that more you add Layers the model starts to function worse, so avoid adding extra Layers.
-
-
Future Development:
- The model should be periodically retrained in order to correspond to new trends and fresh data.
- Should be decided and defined from where the data can be streamed seamlessly both for
NMSGandCPU_MAX,`CPU_AVG. - Should be developed Endpoint for the model to be able to call the model and make prediction
- Should be decided where to deploy the model.
A full Dependabot + CVE audit was performed on requirements.txt. All resolvable HIGH severity vulnerabilities were patched. Two CRITICAL issues (torch, tensorflow) and one disputed CRITICAL (ray) remain open and require separate evaluation before upgrading due to potential breaking changes.
| Package | Before | After | CVE(s) fixed |
|---|---|---|---|
mlflow |
2.10.2 | 2.13.0 | CVE-2024-2928, CVE-2024-0520, CVE-2024-1560 — path traversal / LFI via URI manipulation |
aiohttp |
3.9.3 | 3.10.11 | CVE-2024-30251 (DoS), CVE-2024-23334 (dir traversal), CVE-2024-52304 (request smuggling) |
Jinja2 |
3.1.3 | 3.1.6 | CVE-2024-34064 (HTML injection), CVE-2025-27516 (sandbox breakout / RCE) |
Werkzeug |
3.0.1 | 3.0.3 | CVE-2024-34069 — debugger RCE via cross-origin interaction |
gunicorn |
21.2.0 | 23.0.0 | CVE-2024-1135, CVE-2024-6827 — HTTP request smuggling (TE.CL) |
Pillow |
10.2.0 | 10.3.0 | CVE-2024-28219 — buffer overflow in _imagingcms.c |
certifi |
2024.2.2 | 2024.7.4 | CVE-2024-39689 — compromised GLOBALTRUST root CA in trust store |
jupyter_server |
2.12.5 | 2.14.1 | CVE-2024-35178 — NTLMv2 hash leak on Windows |
jupyterlab |
4.1.1 | 4.2.5 | CVE-2024-43805 — DOM Clobbering XSS via Markdown cells |
notebook |
7.1.0 | 7.2.2 | CVE-2024-43805 — same as above |
protobuf |
4.25.2 | 4.27.5 | CVE-2024-7254 — stack overflow DoS via recursive field parsing |
urllib3 |
2.0.7 | 2.2.2 | CVE-2024-37891 — Proxy-Authorization header leaked on cross-origin redirect |
| Package | Version | CVE | Severity | Notes |
|---|---|---|---|---|
torch |
2.2.2 | CVE-2025-32434 | CRITICAL | Fix: >=2.6.0. Breaking change — held pending compatibility check. |
tensorflow |
2.15.0.post1 | CVE-2024-3660 | CRITICAL | Fix: >=2.16.0. Breaking change — held pending compatibility check. |
ray |
2.21.0 | CVE-2023-48022 | CRITICAL | No official patch (vendor disputed). Mitigation: firewall Ray Dashboard port, enable token auth. |
Problem: The original requirements.txt pinned all ~200 packages including transitive dependencies at exact versions. Dependabot raised 100+ alerts because every pinned transitive package was flagged individually.
Solution: Replaced the full pinned list with ~35 direct dependencies only, using >= minimum version bounds. Pip now resolves and installs the latest compatible transitive dependencies at install time, so Dependabot only monitors packages the project actually depends on directly.
Impact:
- Dependabot alert surface reduced from ~200 packages to ~35
- All future transitive dependency updates are handled automatically by pip
- Security patches in transitive deps no longer require manual
requirements.txtedits - The minimum versions set correspond to the last tested working versions