# CIDC_for_Machine_Learning_v1

ML_CICD_Pipeline


https://github.com/Sobhan-Mohammadi/ML_CICD_Pipeline.git

Certainly! Below is a comprehensive guide from start to finish on setting up a full CI/CD pipeline for a machine learning project using GitHub Actions, Docker, and DVC. I will cover every aspect including project structure, necessary files, code snippets, and detailed instructions on how to interact with GitHub and set up the pipeline.

---

## **Project Overview**

This project is designed to predict house prices using a machine learning model. We will create a CI/CD pipeline that automates:
- Data fetching and preprocessing
- Model training
- Model evaluation
- Model deployment
- Continuous integration with automated testing
- Continuous deployment using Docker and GitHub Actions

## **Project Structure**

Here’s the directory structure for the project:

```
HousePricePrediction/
├── data/
│   ├── raw/                    # Raw data files
│   └── processed/              # Processed data files
├── models/                     # Trained models
├── notebooks/                  # Jupyter notebooks (optional)
├── scripts/                    # Python scripts
│   ├── data_preprocessing.py   # Preprocesses the data
│   ├── train_model.py          # Trains the model
│   ├── evaluate_model.py       # Evaluates the model
│   └── deploy_model.py         # Deploys the model (Flask app)
├── tests/                      # Unit tests
│   ├── test_data_preprocessing.py
│   ├── test_train_model.py
│   └── test_evaluate_model.py
├── .github/
│   └── workflows/
│       └── ci_cd_pipeline.yml  # GitHub Actions configuration
├── Dockerfile                  # Docker configuration
├── Makefile                    # Makefile for common commands
├── README.md                   # Project description
├── dvc.yaml                    # DVC pipeline configuration
├── params.yaml                 # Parameters and configuration
├── requirements.txt            # Python dependencies
└── setup.py                    # Python package configuration
```

## **Step-by-Step Guide**

### **Step 1: Set Up the Project Locally**

1. **Create and Activate Virtual Environment:**
    ```bash
    python3 -m venv venv
    source venv/bin/activate
    ```

2. **Install Dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

3. **Initialize Git Repository:**
    ```bash
    git init
    ```

4. **Create Initial Files and Directory Structure:**
    ```bash
    mkdir -p data/raw data/processed models notebooks scripts tests
    touch params.yaml dvc.yaml README.md requirements.txt setup.py Dockerfile Makefile
    touch .github/workflows/ci_cd_pipeline.yml
    ```

### **Step 2: Create and Populate Files**

1. **`requirements.txt`** - Python dependencies:

    ```text
    pandas
    numpy
    scikit-learn
    dvc
    pytest
    flask
    gunicorn
    pyyaml
    joblib
    ```

2. **`params.yaml`** - Configuration parameters:

    ```yaml
    data_path: data/raw/housing.csv
    processed_data_path: data/processed/processed_housing.csv
    model_path: models/house_price_model.pkl

    train:
      test_size: 0.2
      random_state: 42
      n_estimators: 100
      max_depth: 5
    ```

3. **Python Scripts in `scripts/`:**

    - **`data_preprocessing.py`:**

        ```python
        import pandas as pd
        import yaml

        def load_data(config_path):
            with open(config_path, 'r') as file:
                config = yaml.safe_load(file)
            
            data = pd.read_csv(config['data_path'])
            return data

        def preprocess_data(data):
            data.fillna(data.mean(), inplace=True)
            return data

        if __name__ == "__main__":
            config_path = 'params.yaml'
            data = load_data(config_path)
            processed_data = preprocess_data(data)
            processed_data.to_csv('data/processed/processed_housing.csv', index=False)
        ```

    - **`train_model.py`:**

        ```python
        import pandas as pd
        from sklearn.model_selection import train_test_split
        from sklearn.ensemble import RandomForestRegressor
        import joblib
        import yaml

        def train_model(config_path):
            with open(config_path, 'r') as file:
                config = yaml.safe_load(file)
            
            data = pd.read_csv(config['processed_data_path'])
            X = data.drop('price', axis=1)
            y = data['price']

            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=config['train']['test_size'], random_state=config['train']['random_state']
            )

            model = RandomForestRegressor(
                n_estimators=config['train']['n_estimators'], max_depth=config['train']['max_depth']
            )
            model.fit(X_train, y_train)

            joblib.dump(model, config['model_path'])

        if __name__ == "__main__":
            train_model('params.yaml')
        ```

    - **`evaluate_model.py`:**

        ```python
        import pandas as pd
        from sklearn.metrics import mean_squared_error
        import joblib
        import yaml

        def evaluate_model(config_path):
            with open(config_path, 'r') as file:
                config = yaml.safe_load(file)
            
            model = joblib.load(config['model_path'])
            data = pd.read_csv(config['processed_data_path'])
            X = data.drop('price', axis=1)
            y = data['price']

            predictions = model.predict(X)
            mse = mean_squared_error(y, predictions)
            
            print(f"Model Evaluation: Mean Squared Error = {mse}")

        if __name__ == "__main__":
            evaluate_model('params.yaml')
        ```

    - **`deploy_model.py`:**

        ```python
        from flask import Flask, request, jsonify
        import joblib
        import pandas as pd

        app = Flask(__name__)

        model = joblib.load('models/house_price_model.pkl')

        @app.route('/predict', methods=['POST'])
        def predict():
            data = request.get_json()
            input_data = pd.DataFrame(data, index=[0])
            prediction = model.predict(input_data)
            return jsonify({'prediction': prediction[0]})

        if __name__ == "__main__":
            app.run(host='0.0.0.0', port=8000)
        ```

4. **Unit Tests in `tests/`:**

    - **`test_data_preprocessing.py`:**

        ```python
        import pytest
        import pandas as pd
        from scripts.data_preprocessing import preprocess_data

        def test_preprocess_data():
            data = pd.DataFrame({
                'feature1': [1, 2, 3, None],
                'feature2': [4, None, 6, 7]
            })
            processed_data = preprocess_data(data)
            
            assert processed_data.isnull().sum().sum() == 0
        ```

    - **`test_train_model.py`:**

        ```python
        import pytest
        import os
        from scripts.train_model import train_model

        def test_train_model():
            train_model('params.yaml')
            assert os.path.exists('models/house_price_model.pkl')
        ```

### **Step 3: Docker Configuration**

1. **`Dockerfile`:**

    ```Dockerfile
    # Use an official Python runtime as a parent image
    FROM python:3.8-slim

    # Set the working directory
    WORKDIR /app

    # Copy the current directory contents into the container
    COPY . /app

    # Install any needed packages specified in requirements.txt
    RUN pip install --no-cache-dir -r requirements.txt

    # Make port 8000 available to the world outside this container
    EXPOSE 8000

    # Run the application
    CMD ["gunicorn", "--bind", "0.0.0.0:8000", "scripts.deploy_model:app"]
    ```

2. **Build and Run Docker Container:**

    ```bash
    docker build -t house-price-prediction .
    docker run -p 8000:8000 house-price-prediction
    ```

### **Step 4: DVC Configuration**

1. **`dvc.yaml`:**

    ```yaml
    stages:
      preprocess:
        cmd: python scripts/data_preprocessing.py
        deps:
          - scripts/data_preprocessing.py
          - params.yaml
        outs:
          - data/processed/processed_housing.csv

      train:
        cmd: python scripts/train_model.py
        deps:
          - scripts/train_model.py
          - data/processed/processed_housing.csv
          - params.yaml
        outs:
          - models/house_price_model.pkl

      evaluate:
        cmd: python scripts/evaluate_model.py
        deps:
          - scripts/evaluate_model.py
          - models/house_price_model.pkl
          - data/processed/processed_housing.csv
          - params.yaml
    ```

2. **Initialize and Run DVC:**

    ```bash
    dvc init
    dvc add data/raw/housing.csv
    dvc repro
    ```

### **Step 5: GitHub Actions CI/CD Pipeline**

1. **Create a GitHub Repository:**
    - Go to GitHub and create a new repository.
    - Clone the repository to your local machine.

2. **Push the Local

 Project to GitHub:**

    ```bash
    git add .
    git commit -m "Initial commit"
    git remote add origin https://github.com/yourusername/HousePricePrediction.git
    git push -u origin master
    ```

3. **GitHub Actions Workflow - `.github/workflows/ci_cd_pipeline.yml`:**

    ```yaml
    name: CI/CD Pipeline

    on:
      push:
        branches:
          - master
      pull_request:
        branches:
          - master

    jobs:
      build:
        runs-on: ubuntu-latest

        steps:
        - name: Checkout code
          uses: actions/checkout@v2

        - name: Set up Python
          uses: actions/setup-python@v2
          with:
            python-version: '3.8'

        - name: Install dependencies
          run: |
            python -m venv venv
            source venv/bin/activate
            pip install --upgrade pip
            pip install -r requirements.txt

        - name: Run tests
          run: |
            source venv/bin/activate
            pytest tests/

        - name: Set up Docker Buildx
          uses: docker/setup-buildx-action@v1

        - name: Log in to DockerHub
          uses: docker/login-action@v1
          with:
            username: ${{ secrets.DOCKER_USERNAME }}
            password: ${{ secrets.DOCKER_PASSWORD }}

        - name: Build and push Docker image
          run: |
            docker build . -t ${{ secrets.DOCKER_USERNAME }}/house-price-prediction
            docker push ${{ secrets.DOCKER_USERNAME }}/house-price-prediction

        - name: Deploy to server
          run: |
            ssh user@server_ip 'docker pull ${{ secrets.DOCKER_USERNAME }}/house-price-prediction && docker run -d -p 8000:8000 ${{ secrets.DOCKER_USERNAME }}/house-price-prediction'
    ```

4. **Set Up GitHub Secrets:**
    - Navigate to your GitHub repository settings and add secrets for Docker credentials and server information.
    - Ensure the secrets match those referenced in your GitHub Actions workflow file.

5. **Test the Workflow**
    - Push changes to the `master` branch and verify that the CI/CD pipeline runs successfully on GitHub.

### **Step 6: Final Steps**

1. **README.md:**

    ```markdown
    # House Price Prediction

    This project is a machine learning pipeline to predict house prices. The project uses a CI/CD pipeline for continuous integration and continuous deployment with Docker and GitHub Actions.

    ## Project Structure

    - **data/**: Contains raw and processed data.
    - **models/**: Contains trained models.
    - **scripts/**: Python scripts for data preprocessing, model training, evaluation, and deployment.
    - **tests/**: Unit tests for the scripts.
    - **.github/**: GitHub Actions CI/CD pipeline configuration.
    - **Dockerfile**: Docker configuration for the project.
    - **Makefile**: Makefile for common commands.
    - **dvc.yaml**: DVC pipeline configuration.
    - **params.yaml**: Hyperparameters and configuration settings.

    ## Setup

    1. Clone the repository.
    2. Create a virtual environment: `python3 -m venv venv`
    3. Activate the environment: `source venv/bin/activate`
    4. Install dependencies: `pip install -r requirements.txt`
    5. Run the pipeline: `dvc repro`

    ## CI/CD Pipeline

    The CI/CD pipeline is set up using GitHub Actions. It runs the tests, builds a Docker image, and deploys it to a server.

    ## Usage

    - Preprocess data: `python scripts/data_preprocessing.py`
    - Train the model: `python scripts/train_model.py`
    - Evaluate the model: `python scripts/evaluate_model.py`
    ```

2. **Makefile:**

    ```Makefile
    install:
        pip install -r requirements.txt

    test:
        pytest tests/

    run:
        python scripts/train_model.py

    deploy:
        docker build -t house-price-prediction .
        docker run -p 8000:8000 house-price-prediction
    ```

### **Conclusion**

This setup covers everything from creating the project structure to deploying the ML model using a CI/CD pipeline integrated with GitHub Actions. The pipeline will automatically run tests, build Docker images, and deploy them when changes are pushed to the `master` branch.

By following these steps, you will successfully set up a fully functional CI/CD pipeline for your machine learning project, ensuring continuous integration and deployment in a professional and automated manner.