# Assignment2 - lingyun3@andrew.cmu.edu

This repository contains implementations and results for various 3D reconstruction approaches from single view RGB images.

## 1. Loss Function Exploration

### 1.1 Voxel Grid Fitting
Binary cross entropy loss was implemented to fit 3D binary voxel grids.

<div style="display: flex; justify-content: center">
  <img src="output/1-loss-functions/1.1-voxel/1-vox-src-new.gif" width="200"/>
  <img src="output/1-loss-functions/1.1-voxel/1-vox-tgt-new.gif" width="200"/>
</div>
<p align="center">Source and target voxel grid optimization</p>

### 1.2 Point Cloud Fitting
Custom Chamfer loss implementation for fitting 3D point clouds.

<div style="display: flex; justify-content: center">
  <img src="output/1-loss-functions/1.2-pc/1-point-src-new.gif" width="200"/>
  <img src="output/1-loss-functions/1.2-pc/1-point-tgt-new.gif" width="200"/>
</div>
<p align="center">Source and target point cloud optimization</p>

### 1.3 Mesh Fitting
Smoothness loss implementation for mesh optimization.

<div style="display: flex; justify-content: center">
  <img src="output/1-loss-functions/1.3-mesh/1-mesh-src-new.gif" width="200"/>
  <img src="output/1-loss-functions/1.3-mesh/1-mesh-tgt-new.gif" width="200"/>
</div>
<p align="center">Source and target mesh optimization</p>

## 2. Single View to 3D Reconstruction

All the models are trained for 1500 epochs and use 1000-epoch checkpoint for inference. I use `CosineAnnealingLR` scheduler and keep the default initial learning rate. I increase `n_point` for pointcloud decoder to 2000. 

### 2.1 Image to Voxel Grid
The model architecture uses the pretrained `resnet18` backbone as feature extractor and uses linear projection to project the 512-dimensional image features into a [4 * 4 * 4 * 512] voxel feature so that we can apply `nn.ConvTranspose3d` afterwards. We use three deconvolution layer to decode the image feature with a final convolution layer coupled with a sigmoid to normalize the output within [0, 1] as the occupancy grid.

Sample results from the voxel reconstruction network:

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/vox/20_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/vox/20_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/vox/20_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Voxels, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/vox/89_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/vox/89_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/vox/89_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Voxels, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/vox/205_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/vox/205_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/vox/205_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Voxels, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/vox/231_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/vox/231_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/vox/231_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Voxels, Ground Truth</p>

### 2.2 Image to Point Cloud
I implement almost the same model architecture for img2pc and img2mesh with different final output_num:
```Python
self.decoder = nn.Sequential(
                nn.Linear(512, 1024),
                nn.ReLU(),
                nn.Linear(1024, 2048),
                nn.ReLU(),
                nn.Linear(2048, self.n_point * 3), # NOTE: `nn.Linear(2048, self.n_point * 3)` for img2mesh
                nn.Tanh()  # NOTE: bound the output to [-1, 1]
)     
```

Sample results from the point cloud reconstruction network:

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/point/1_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/point/1_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/point/1_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Points, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/point/5_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/point/5_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/point/5_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Points, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/point/6_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/point/6_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/point/6_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Points, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/point/27_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/point/27_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/point/27_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Points, Ground Truth</p>

### 2.3 Image to Mesh
Sample results from the mesh reconstruction network:

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_1.0/16_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/16_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/16_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Mesh, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_1.0/20_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/20_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/20_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Mesh, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_1.0/115_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/115_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/115_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Mesh, Ground Truth</p>

### 2.4 Quantitative Evaluation
Evaluation metrics comparing different reconstruction methods:
<div style="display: flex; justify-content: center">
<img src="output/2-reconstructing-3d/eval/eval_mesh.png" width="300"/>
<img src="output/2-reconstructing-3d/eval/eval_point.png" width="300"/>
<img src="output/2-reconstructing-3d/eval/eval_vox.png" width="300"/>
</div>
<p align="center">F1 scores at different thresholds</p>

### 2.5 Hyperparams

I lower `w_chamfer` in hope of getting more smooth mesh outcome. I fix other parameters including the epoch of the checkpoint. 

I visualize the 3D reconstruction result for models trained with different weights of chamfer loss as follows:

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_1.0/16_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/16_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/16_gt_mesh.gif" width="200"/>
</div>
<p align="center">w_chamfer = 1.0, sample_no = 16</p>

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_0.6/16_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.6/16_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.6/16_gt_mesh.gif" width="200"/>
</div>
<p align="center">w_chamfer = 0.6, sample_no = 16</p>

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_1.0/20_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/20_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_1.0/20_gt_mesh.gif" width="200"/>
</div>
<p align="center">w_chamfer = 1.0, sample_no = 20</p>

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_0.6/20_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.6/20_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.6/20_gt_mesh.gif" width="200"/>
</div>
<p align="center">w_chamfer = 0.6, sample_no = 20</p>

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_0.8/47_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.8/47_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.8/47_gt_mesh.gif" width="200"/>
</div>
<p align="center">w_chamfer = 0.8, sample_no = 47</p>

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/mesh_0.6/47_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.6/47_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/mesh_0.6/47_gt_mesh.gif" width="200"/>
</div>
<p align="center">w_chamfer = 0.6, sample_no = 47</p>

### 2.6 Interpret your model
I create a jupyter notebook `interpret_vis.ipynb` to load the model weights for voxel reconstruction and visualize the middle output after each deconvolution layer.

As is mentioned before, I project the 512-dimensional image features into a [4 * 4 * 4 * 512] voxel feature so that we can apply `nn.ConvTranspose3d` afterwards. We use three deconvolution layer to decode the image feature with a final convolution layer coupled with a sigmoid to normalize the output within [0, 1] as the occupancy grid.

To visualize the feature map in 2D space, I slice at the middle point of each dimension (namely half of the resolution) and demonstrate the resulted 2D heat map as follows:

<img src="output/2-reconstructing-3d/interpretation/feat_map=0.png" />
<img src="output/2-reconstructing-3d/interpretation/feat_map=1.png" />
<img src="output/2-reconstructing-3d/interpretation/feat_map=2.png" />

Observations: 

1. Activation patterns: The 3 maps show different activation patterns, suggesting different feature channels focus on different spatial aspects. 

2. Progressive resolution: The feature maps show clear resolution increase, demonstrating the deconvolution network's upsampling capability.

3. Spatial organization: The gradual organization of features suggests effective learning of 3D spatial relationships.

4. Channel specialization: Different channels (maps) capture different aspects of the 3D structure.

### 3.1 Implicit Reconstruction

I drew inspiration from the `Convolutional Occupancy Network` paper and implemented the network using a sequence of `ResBlock`. The network architecture is implemented in `occupancy_network` and I add relevant train/eval logic into the existing `train_model.py` and `eval_model.py` file.

Sample results from the implicit network:

<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/235_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/235_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/235_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/281_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/281_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/281_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/423_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/423_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/423_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/451_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/451_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/451_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/547_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/547_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/547_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/569_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/569_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/569_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/572_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/572_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/572_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>
<div style="display: flex; justify-content: center">
  <img src="output/2-reconstructing-3d/occupancy_1.0/594_gt.png" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/594_pred.gif" width="200"/>
  <img src="output/2-reconstructing-3d/occupancy_1.0/594_gt_mesh.gif" width="200"/>
</div>
<p align="center">Input RGB, Predicted Occupancy, Ground Truth</p>

Quantitative results:
<div style="display: flex; justify-content: center">
<img src="output/2-reconstructing-3d/eval/eval_occupancy_1.0.png"/>
</div>
<p align="center">F1 scores at different thresholds</p>