Data Layer

Abstract

The data layer is responsible for handling structured and untructured OLAP data of the project. It is built on Databricks platform. Most of the actions and pipelines are triggered by Master Agent.
Conceptually it is divided into two modules - Individual and Batch.
Individual Module gives quick diagnoses of problems taking senor information and unstructured data for customers' on-the-spot problems. Batch Module uses deep learning to analyze more complex relationships among data.

Current Progess And More To Come

Individual Module

Current Abilities

Individual Module

server.py - exposes the Data Diagnosis Agent that exposes multiple models and endpoints for the Customer Support Agent to call through MCP endpoints. As it matures it will provide layers of security and allow AI agents to access its tools.

Dataset: Vehicle Maintenance Telemetry Data

Methodology

We use an ensamble of multiple popular models for regression and classification tasks.

For regression of columns (failure_year, failure_month, failure_day) we train Linear regression, Random Forest regression, LightGBM regression and XGBoost regression.
For classification of columns (engine_failure_imminent, brake_issue_imminent, battery_issue_imminent) we train Logistic Regression, Random Forest Classifier, LightGBM Classifier, XGBoost Classifier and Support Vector Classifier.

Tentative Results

After experimentation, we found that the entire training took 35.9s, quite . the best performing model for each task is:

XGBoost for failure_date.
LightGBM for engine_failure_imminent
LogisticRegression Classifier for brake_issue_imminent
Xgboost Classifier for battery_issue_imminent

These models were selected for their impressive performance while not being overly-fit.

Batch Module

The batch directory contains some code (notebooks) directly hosted on Databricks.
It contains a sample ETL pipeline responsible for re-training and saving the weights of model after enough new data has arrived to meaningfully change the trends.
The data is processing follows proper medallion architecture.

This pipeline will be updated with addition of Supabase (Postgres) as primary data lake.

The model is a powerful Graph Attention Transformer (GAT) optimized for spacial-temporal data. It shows remarkable results even achieving Double Descent phenomenon.

battery_loss	engine_loss
val_loss	urgency_loss

Model Architecture Diagram:

(More description of the model can be found here (the report is too interesting to ignore btw))

Current Progess and More to Come

Individual Module

Current Abilities

Can diagnose a problem given relevent sensor data through RAG.
It is yet to reach its full potential. While it is not as strong and effective as heavy deep learning models, it is still very useful for quick diagnosis of problems and that is exactly what we plan it to be.

Concerns

Due to lack of good data the results might seem lackluster but it still has the potential to solve customer needs.
The server is entirely reliant on RAG from Customer Support Agent and violates the principle of isolation.

Improvements

Although this field is being actively research much of the work is not public. The widely accessible literature shows the potential of unstructured data like text and images to be used for diagnosis. Predictive Maintenance in IoT Using NLP Techniques
We plan to add more embeddings and integrate directly with database to decrease reliance on RAG.

Batch Module

Current Abilities

Can retrain model with new data and save weights. This ensures that it does not become irrelevent as time passes.
Can parallely process thousands of customers' data and inform them in advance.

Concerns

The model is not yet trained on real data. It is trained on synthetic data for testing purposes.
The External Communication Layer is yet to be mature enough to test.

Improvements

The model is being trained on real data and will be deployed to production soon.
The model will be integrated with database to serve predictions for new data.
The model will be optimized for better performance and accuracy.

Future Work

Complete multiple ETL pipelines for next rounds.
Try better datasets.
Simulate the behaviour of sensors.
Integrate completely with Postgres.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
batch		batch
images		images
models/v1		models/v1
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.py		server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Layer

Abstract

Contents

Current Progess And More To Come

Individual Module

Current Abilities

Individual Module

Methodology

Tentative Results

Batch Module

Current Progess and More to Come

Individual Module

Current Abilities

Concerns

Improvements

Batch Module

Current Abilities

Concerns

Improvements

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Layer

Abstract

Contents

Current Progess And More To Come

Individual Module

Current Abilities

Individual Module

Methodology

Tentative Results

Batch Module

Current Progess and More to Come

Individual Module

Current Abilities

Concerns

Improvements

Batch Module

Current Abilities

Concerns

Improvements

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages