## What Are Git Submodules?

> üìñ Read the full article: [Managing Shared Data Science Code with Git Submodules](https://codecut.ai/managing-shared-data-science-code-git-submodules/)


Git submodules let you embed one Git repository inside another as a subdirectory. Instead of copying code between projects, you reference a specific commit from a shared repository, ensuring all projects use identical code versions.


```
your-project/
‚îú‚îÄ‚îÄ main.py
‚îî‚îÄ‚îÄ shared-utils/        # ‚Üê Git submodule
    ‚îî‚îÄ‚îÄ features.py
```

This ensures every team member gets the same shared code version, preventing the version drift shown in the example above.

> üìö For comprehensive Git fundamentals and production-ready workflows that complement Git submodule techniques, check out [Production-Ready Data Science](https://codecut.ai/production-ready-data-science/).



## Using Git Submodules in Practice

Consider our fintech company with fraud detection, credit scoring, and trading projects that all need shared ML utilities for risk calculation and feature engineering.

The shared `ml-utils` repository contains common ML functions:

```
ml-utils/
‚îú‚îÄ‚îÄ __init__.py
‚îú‚îÄ‚îÄ features.py
‚îî‚îÄ‚îÄ README.md
```

```python
# features.py
def calculate_risk_score(data):
    return data['income'] / max(data['debt'], 1)

def extract_time_features(df, time_col):
    df['hour'] = pd.to_datetime(df[time_col]).dt.hour
    df['is_weekend'] = pd.to_datetime(df[time_col]).dt.dayofweek.isin([5, 6])
    ...
    return df

def calculate_velocity(df, user_col, time_col):
    df = df.copy()
    df['transaction_count_1h'] = df.groupby(user_col)[time_col].transform('count')
    ...
    return df
```

Imagine your fraud detection project looks like this:

```
fraud-detection/
‚îú‚îÄ‚îÄ main.py
‚îî‚îÄ‚îÄ README.md
```

To add the shared utilities to your fraud detection project, you can run:

```bash
git submodule add https://github.com/khuyentran1401/ml-utils.git ml-utils
```

This will transform the structure of your project to:

```
fraud-detection/
‚îú‚îÄ‚îÄ main.py
‚îú‚îÄ‚îÄ README.md
‚îú‚îÄ‚îÄ .gitmodules
‚îî‚îÄ‚îÄ ml-utils/           # ‚Üê Submodule directory
    ‚îú‚îÄ‚îÄ features.py     #   Shared ML functions
    ‚îú‚îÄ‚îÄ __init__.py
    ‚îî‚îÄ‚îÄ README.md
```

The `.gitmodules` file tracks the submodule configuration:

```text
[submodule "ml-utils"]
    path = ml-utils
    url = https://github.com/khuyentran1401/ml-utils.git
```

Now you can use the shared utilities in your fraud detection pipeline:

In [None]:
# fraud_detection/train_model.py
from ml_utils.features import extract_time_features, calculate_velocity

def prepare_fraud_features(transactions_df):
    # Extract time-based features for fraud detection
    df = extract_time_features(transactions_df, 'transaction_time')

    # Calculate transaction velocity features
    df = calculate_velocity(df, 'user_id', 'transaction_time')

    return df

# Fraud detection model uses consistent utilities
fraud_features = prepare_fraud_features(raw_transactions)

## Team Collaboration

When a new team member joins the fraud detection team, they get the complete setup including shared ML utilities:

```bash
# Clone the fraud detection project with all ML utilities
git clone --recurse-submodules https://github.com/khuyentran1401/fraud-detection.git
cd fraud-detection
```

Alternatively, initialize submodules after cloning:

```bash
git clone https://github.com/khuyentran1401/fraud-detection.git
cd fraud-detection
git submodule update --init --recursive
```

When the code of the shared utilities is updated, you can update the submodule to the latest version:

```bash
# Update to latest ML utilities
git submodule update --remote ml-utils
```

This updates your local copy but doesn't record which version your project uses. Commit this change so teammates get the same utilities version:

```bash
# Commit the submodule update
git add ml-utils
git commit -m "Update ML utilities: improved risk calculation accuracy"
```

For comprehensive version control of both code and data in ML projects, see our [DVC guide](https://codecut.ai/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-2/).


## Managing Submodules Through VS Code

To simplify the process of managing submodules, you can use [VS Code's Source Control panel](https://code.visualstudio.com/docs/sourcecontrol/overview).

To manage submodules through VS Code's Source Control panel:

1. Open your main project folder in VS Code
2. Navigate to Source Control panel (Ctrl+Shift+G)
3. You'll see separate sections for main project and each submodule
4. Stage and commit changes in the submodule first
5. Then commit the submodule update in the main project

![VS Code Submodules](https://codecut.ai/wp-content/uploads/2025/07/vscode.png)

The screenshot shows VS Code's independent submodule management:

- ml-utils submodule (top): Has staged changes ready to commit with its own message
- fraud-detection main project (bottom): Shows submodule as changed, waits for submodule commit

## Submodules vs Python Packaging

Python packaging lets you distribute shared utilities as installable packages:

```bash
pip install company-ml-utils==1.2.3
```

This works well for stable libraries with infrequent changes. However, for internal ML utilities that evolve rapidly, packaging creates bottlenecks:

- Requires build/publish workflow for every change
- Slower iteration during active development
- Package contents are hidden - can't debug into utility functions
- Stuck with released versions - can't access latest bug fixes until next release

Git submodules work differently by making the source code directly accessible in your project for immediate access, full debugging visibility, and precise version control.