The data layer is responsible for handling structured and untructured OLAP data
of the project. It is built on Databricks platform. Most of the actions and pipelines are triggered by Master Agent.
Conceptually it is divided into two modules - Individual and Batch.
Individual Module gives quick diagnoses of problems taking senor information and unstructured data for customers' on-the-spot problems. Batch Module uses deep learning to analyze more complex relationships among data.
Dataset: Vehicle Maintenance Telemetry Data
We use an ensamble of multiple popular models for regression and classification tasks.
- For regression of columns (failure_year, failure_month, failure_day) we train Linear regression, Random Forest regression, LightGBM regression and XGBoost regression.
- For classification of columns (engine_failure_imminent, brake_issue_imminent, battery_issue_imminent) we train Logistic Regression, Random Forest Classifier, LightGBM Classifier, XGBoost Classifier and Support Vector Classifier.
After experimentation, we found that the entire training took 35.9s, quite . the best performing model for each task is:
![]() | ![]() |
![]() | ![]() |
- XGBoost for failure_date.
- LightGBM for engine_failure_imminent
- LogisticRegression Classifier for brake_issue_imminent
- Xgboost Classifier for battery_issue_imminent
These models were selected for their impressive performance while not being overly-fit.
The batch directory contains some code (notebooks) directly hosted on Databricks.
It contains a sample ETL pipeline responsible for re-training and saving the weights of model after enough new data has arrived to meaningfully change the trends.
The data is processing follows proper medallion architecture.

This pipeline will be updated with addition of Supabase (Postgres) as primary data lake.
The model is a powerful Graph Attention Transformer (GAT) optimized for spacial-temporal data. It shows remarkable results even achieving Double Descent phenomenon.
battery_loss |
engine_loss |
val_loss |
urgency_loss |
Can diagnose a problem given relevent sensor data through RAG.
It is yet to reach its full potential. While it is not as strong and effective as heavy deep learning models, it is still very useful for quick diagnosis of problems and that is exactly what we plan it to be.
- Due to lack of good data the results might seem lackluster but it still has the potential to solve customer needs.
- The server is entirely reliant on RAG from Customer Support Agent and violates the principle of isolation.
- Although this field is being actively research much of the work is not public. The widely accessible literature shows the potential of unstructured data like text and images to be used for diagnosis. Predictive Maintenance in IoT Using NLP Techniques
- We plan to add more embeddings and integrate directly with database to decrease reliance on RAG.
- Can retrain model with new data and save weights. This ensures that it does not become irrelevent as time passes.
- Can parallely process thousands of customers' data and inform them in advance.
- The model is not yet trained on real data. It is trained on synthetic data for testing purposes.
- The External Communication Layer is yet to be mature enough to test.
- The model is being trained on real data and will be deployed to production soon.
- The model will be integrated with database to serve predictions for new data.
- The model will be optimized for better performance and accuracy.
- Complete multiple ETL pipelines for next rounds.
- Try better datasets.
- Simulate the behaviour of sensors.
- Integrate completely with Postgres.









