# 🛠️ Essential Tools for MLE and MLOps: Git, DVC, Jenkins, and Docker


## Why these tools?

When working on data science or machine learning projects as a team, it's not enough to just know how to train models. You need best practices that help you:

- Collaborate with other people
- Version control for code and data
- Automate processes (such as training and deployment)
- Package our applications so they work anywhere

This is where tools like these come into play:

- **Git** → Version control for code
- **DVC** → Version control for data and models
- **Jenkins** → Automation (CI/CD)
- **Docker** → Create containers so your code runs the same in any environment


## Git: Version control for code

Git is a system that allows you to **keep track of changes in your code**. It's like a "time machine" for your project files.

With Git you can:

- Save versions of your code (like "checkpoints")
- Work on different branches
- Collaborate with others without overwriting each other's work

Best of all: you can upload your project to GitHub and work from the cloud.

**Practical example**

```
# Initialize Git in a project
git init

# Add files to version control
git add .

# Save a change (commit)
git commit -m "First commit"

# Connect to GitHub and push the project
git remote add origin https://github.com/tu_usuario/tu_repo.git
git push -u origin main
```

> For example, my git where I upload these notebooks: [GitHub](https://github.com/Molgol/Data_Science/)

## DVC: Version Control for data and models

DVC (Data Version Control) extends Git so you can also version:

- Large **datasets** (which don't fit well in Git)
- **Trained models**
- Processing pipelines

This is especially useful because in ML not only the code changes, but also the data and experiments.


#### What does DVC do?

| Action     | What DVC does                                                     |
| ---------- | ----------------------------------------------------------------- |
| `dvc init` | Initializes DVC in your project                                   |
| `dvc add`  | Adds a file (e.g., a CSV or model) to version control             |
| `dvc push` | Uploads data to an external storage (like Google Drive, S3, etc.) |
| `dvc pull` | Downloads the data back                                           |


**Practical example**

```
# Install dvc
pip install dvc

# Initialize a new project
dvc init

# Add a dataset
dvc add data/train.csv

# DVC creates a .dvc file that can be versioned with Git
git add data/train.csv.dvc .gitignore
git commit -m "Added dataset to the control with DVC"
```

## Jenkins: Automating tasks with CI/CD

Jenkins is a tool used to automate tasks. For example:

- **Automatically test your code** when you make a change
- Retrain the model if there is new data
- **Deploy your app** if it passes the tests

This is part of the **CI/CD cycle**:

> CI = Continuous Integration // CD = Continuous Deployment


### How does Jenkins work?

- It has a web interface where you can see your "jobs"
- Each job can perform steps like: installing dependencies, running scripts, deploying, etc
- You can connect it to GitHub so it triggers every time you push code




### Example of `Jenkinsfile`:

A configuration file is placed at the root of the repository to define what Jenkins should do

```groovy
pipeline {
  agent any
  stages {
    stage('Install dependencies') {
      steps {
        sh 'pip install -r requirements.txt'
      }
    }
    stage('Run tests') {
      steps {
        sh 'python -m unittest discover tests'
      }
    }
    stage('Train the model') {
      steps {
        sh 'python train.py'
      }
    }
  }
}
```

## Docker – Containers to package your app

Docker allows you to **package your application** (with everything it needs to run: Python, libraries, code, etc.) into a **container**.

That container works the same everywhere: your computer, a server, the cloud — anywhere.

### What is Docker used for?

- Avoids the classic “it works on my machine” issue
- It's the standard way to deploy APIs in production
- Commonly used with FastAPI and Flask

### 📄 Example of `Dockerfile`

```
# 1. Base image with Python
FROM python:3.9

# 2. Working directory
WORKDIR /app

# 3. Copy files
COPY . .

# 4. Install dependencies
RUN pip install -r requirements.txt

# 5. Start-up command
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
```

Next, you build the container image and run it.

```
# Build the image
docker build -t mi_api_diabetes .

# Run the container
docker run -p 8000:80 mi_api_diabetes
```

## Final summary

### Conclusion

These tools are at the core of the modern workflow for machine learning projects:

| Tool        | What is it for?                     |
| ----------- | ----------------------------------- |
| **Git**     | Version control for code            |
| **DVC**     | Version control for data and models |
| **Jenkins** | Automate processes (CI/CD)          |
| **Docker**  | Package your app to run in any environment    |


What can I do next?

- Connect Jenkins with Docker to automatically deploy the API
- Use DVC + GitHub + CI/CD to retrain the model when there is new data
- Upload the Docker image to a cloud service (e.g., AWS, Render, etc.)
- Version experiments and metrics using DVC