Skip to content

ProjectMayhemAutomotive/DataLayer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Layer

Abstract

The data layer is responsible for handling structured and untructured OLAP data of the project. It is built on Databricks platform. Most of the actions and pipelines are triggered by Master Agent.
Conceptually it is divided into two modules - Individual and Batch.
Individual Module gives quick diagnoses of problems taking senor information and unstructured data for customers' on-the-spot problems. Batch Module uses deep learning to analyze more complex relationships among data.

Contents

Current Progess And More To Come

Individual Module

Current Abilities

Individual Module

Image of MCP Server Serving Tools

server.py - exposes the Data Diagnosis Agent that exposes multiple models and endpoints for the Customer Support Agent to call through MCP endpoints. As it matures it will provide layers of security and allow AI agents to access its tools.

Dataset: Vehicle Maintenance Telemetry Data

Methodology

We use an ensamble of multiple popular models for regression and classification tasks.

  • For regression of columns (failure_year, failure_month, failure_day) we train Linear regression, Random Forest regression, LightGBM regression and XGBoost regression.
  • For classification of columns (engine_failure_imminent, brake_issue_imminent, battery_issue_imminent) we train Logistic Regression, Random Forest Classifier, LightGBM Classifier, XGBoost Classifier and Support Vector Classifier.

Tentative Results

After experimentation, we found that the entire training took 35.9s, quite . the best performing model for each task is:

  • XGBoost for failure_date.
  • LightGBM for engine_failure_imminent
  • LogisticRegression Classifier for brake_issue_imminent
  • Xgboost Classifier for battery_issue_imminent

These models were selected for their impressive performance while not being overly-fit.

Batch Module

The batch directory contains some code (notebooks) directly hosted on Databricks.
It contains a sample ETL pipeline responsible for re-training and saving the weights of model after enough new data has arrived to meaningfully change the trends.
The data is processing follows proper medallion architecture.

This pipeline will be updated with addition of Supabase (Postgres) as primary data lake.

The model is a powerful Graph Attention Transformer (GAT) optimized for spacial-temporal data. It shows remarkable results even achieving Double Descent phenomenon.

battery_loss engine_loss
val_loss urgency_loss
Model Architecture Diagram:

(More description of the model can be found here (the report is too interesting to ignore btw))

Current Progess and More to Come

Individual Module

Current Abilities

Can diagnose a problem given relevent sensor data through RAG.
It is yet to reach its full potential. While it is not as strong and effective as heavy deep learning models, it is still very useful for quick diagnosis of problems and that is exactly what we plan it to be.

Concerns
  • Due to lack of good data the results might seem lackluster but it still has the potential to solve customer needs.
  • The server is entirely reliant on RAG from Customer Support Agent and violates the principle of isolation.
Improvements
  • Although this field is being actively research much of the work is not public. The widely accessible literature shows the potential of unstructured data like text and images to be used for diagnosis. Predictive Maintenance in IoT Using NLP Techniques
  • We plan to add more embeddings and integrate directly with database to decrease reliance on RAG.

Batch Module

Current Abilities
  • Can retrain model with new data and save weights. This ensures that it does not become irrelevent as time passes.
  • Can parallely process thousands of customers' data and inform them in advance.
Concerns
  • The model is not yet trained on real data. It is trained on synthetic data for testing purposes.
  • The External Communication Layer is yet to be mature enough to test.
Improvements
  • The model is being trained on real data and will be deployed to production soon.
  • The model will be integrated with database to serve predictions for new data.
  • The model will be optimized for better performance and accuracy.

Future Work

  • Complete multiple ETL pipelines for next rounds.
  • Try better datasets.
  • Simulate the behaviour of sensors.
  • Integrate completely with Postgres.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors