# Deep Learning for Vision-Based SLAM: Comparing a MonoDepth2 Baseline to a Full SLAM Pipeline

## Introduction
Our project aims to explore and quantify the benefits of integrating deep learning components into a full monocular visual SLAM (Simultaneous Localization and Mapping) system. We plan to implement a baseline depth estimation model inspired by MonoDepth2 and compare its performance to a full SLAM pipeline that additionally estimates camera pose, performs loop closure detection, and builds a global 3D map. 

Given the potential complexity of this task, we have also considered an alternative approach to quantify the benefits of deep learning in SLAM — compare deep learning to traditional methods for monocular visual odometry. This would involve implementing a traditional SIFT-based pose estimation for visual odometry and compare this performance to using a deep learning model resembling DeepVO. Time permitting, we could integrate both these pose estimation steps into respective SLAM systems and compare the results.

This project will not only deepen our understanding of deep learning applications in robotics but also provide insights into how spatial and temporal context can improve depth estimation and localization. Both of us had a deep interest in doing something with computer vision for our final project. As this was a shared interest we wanted to challenge ourselves and do something beyond image classification.

## Research Questions and Hypotheses
**Potential Questions:**
- How does the integration of pose estimation and loop closure in a full SLAM pipeline improve the accuracy of depth estimation compared to a baseline MonoDepth2 model?
- What are the differences in performance when evaluating per-frame depth accuracy, overall camera trajectory (pose estimation), and global map consistency?
- How does Monocular Visual Odometry using Deep Learning (like DeepVO) compare to a more traditional feature-based approach with SIFT?

**Hypotheses:**
- The full SLAM system will produce more accurate depth maps (lower RMSE and absolute relative error) than MonoDepth2 by leveraging multi-frame information.
- The SLAM system will demonstrate significantly improved camera localization (measured by Absolute Trajectory Error and Relative Pose Error) compared to using a single-frame depth estimation model.
- The global 3D map generated by the SLAM pipeline will exhibit higher consistency and reduced drift compared to the baseline depth predictions from MonoDepth2.
- DeepVO will produce more flexible and robust relative trajectory estimations.

## Methods
**Baseline Model – MonoDepth2:**
- Implement an existing MonoDepth2 network using PyTorch to predict per-frame dense depth maps.
- Train the model on publicly available datasets (KITTI) and evaluate using standard depth metrics such as RMSE, Absolute Relative Error, and threshold accuracy.

**Full SLAM Pipeline:**
- **Pose Estimation Module:** Develop a CNN+LSTM-based visual odometry model to estimate relative camera motion between frames.
- **Loop Closure Detection:** Integrate a deep learning module to detect revisited areas and trigger map optimization. Or suppliment with a premade one such as NetVLAD
- **Mapping Module:** Fuse per-frame depth predictions with pose estimates to build a global 3D map. Apply bundle adjustment techniques to refine both the camera trajectory and the map.

**Evaluation:**
- Compare per-frame depth outputs from both systems using standard metrics.
- Evaluate SLAM’s pose accuracy using Absolute Trajectory Error (ATE) and Relative Pose Error (RPE).
- Assess global map quality via qualitative visualization and cloud alignment metrics.

**Data & Code Sources:**
- Use datasets like KITTI or TUM RGB-D.
- Use or adapt open-source code for MonoDepth2, visual odometry, and loop closure.

## Division of Work
- **Trevor Chartier:**
  - Set up project repository and environment.
  - Lead MonoDepth2 baseline implementation.
  - Develop visual odometry modules.
  - Implement pose evaluation metrics.

- **Max McLaurin:**
  - Implement loop closure and mapping modules.
  - Integrate global map construction.
  - Integrate all modules and handle synchronization.
  - Manage experiment comparisons and documentation.

## Possible Results
- The full SLAM pipeline is expected to produce lower depth errors compared to MonoDepth2 through the use of additional pose and temporal information.
- SLAM should demonstrate better camera trajectory accuracy (lower ATE and RPE).
- The 3D map from SLAM will be more coherent and less prone to drift.

## Timeline
**Sprint1 1 (Setup & Data Preparation):**
   - Project setup, data download, and preprocessing.

**Sprint 2 (Baseline Implementation):**
   - MonoDepth2 model training and evaluation.

**Sprint 3 (SLAM Module Development):**
   - Visual odometry and loop closure integration.

**Sprint 4 (Integration & Evaluation):**
   - Final integration, testing, evaluation, and result visualization.