# CI/CD for Machine Learning – Personal Notes

## 1. Introduction
This course covers **Continuous Integration (CI)** and **Continuous Delivery/Deployment (CD)** techniques tailored for machine learning workflows.

---

## 2. Software Development Life Cycle (SDLC)

### Definition
SDLC is a structured process for developing, deploying, and maintaining software applications.

### Key Stages
- **Build**: Compile source code into executable form.
- **Test**: Validate functionality and quality.
- **Deploy**: Release software into target environments.

---

## 3. SDLC in Machine Learning

### Unique Challenges
- ML models evolve with data; not static algorithms.
- Data engineering is resource-intensive: includes collection, transformation, storage, and serving.
- Integration with SDLC requires automation for speed and quality.

### Benefits of CI/CD in ML
- Streamlines delivery of high-quality ML software.
- Enables rapid prototyping and testing.
- Facilitates algorithm and hyperparameter exploration.
- Improves decision-making through faster iteration.

**Reference**: [Google Cloud – ML Lifecycle](https://cloud.google.com/blog/products/ai-machine-learning/making-the-machine-the-machine-learning-lifecycle)

---

## 4. What is CI/CD?

### Continuous Integration (CI)
- Automatically builds and tests code on integration into a shared repo.
- Prevents integration issues and ensures code stability.

### Continuous Delivery (CD)
- Automates delivery of code to production-like environments.
- Requires manual approval before deployment.

### Continuous Deployment (CD)
- Fully automates release to production without manual intervention.

---

## 5. CI/CD in Machine Learning

### Key Differences from Traditional Software
- ML = Code + Data -> both must be versioned.
- Experimentation requires tracking model performance and configurations.
- Reproducibility demands versioning of data, models, and code.

### Testing in ML CI
- Goes beyond unit tests: includes data preprocessing, training, and evaluation.
- Ensures pipeline reliability and model quality.

### Deployment Considerations
- More complex than traditional software.
- Requires:
  - Model serving infrastructure
  - Performance monitoring
  - Update management and rollback strategies

---

## 6. Course Scope

Focus areas:
- Data preparation and versioning
- Model development and evaluation
- Hyperparameter tuning
- CI/CD integration across these stages

---

## 7. Summary

### SDLC Workflow
- Build -> Test -> Deploy

### CI/CD Benefits in ML
- **CI**: Frequent code merging, early bug detection
- **CD (Delivery)**: Manual approval before release
- **CD (Deployment)**: Fully automated release

### ML-Specific Enhancements
- Data/model versioning for reproducibility
- Automation for experimentation
- Full pipeline testing
- Reliable and rapid deployment

 Continuous deployment is the practice of automatically releasing every code change to production, while continuous delivery is the practice of preparing code changes for release but allowing for manual approval before deployment.
 
Continuous deployment is actually the automated process of deploying code changes to production, while continuous delivery is the practice of preparing code changes for release.

![image.png](attachment:3e1638c9-d138-42da-9357-d31700bafc0d.png)

Generic Workflow


# YAML for CI/CD in Machine Learning – Personal Notes

## 1. What is YAML?

### Definition
- YAML stands for **"YAML Ain't Markup Language"**
- A human-readable data serialization format used for:
  - Configuration files
  - Data exchange
  - Structured data representation

### Comparison
- Alternative to XML
- Comparable to JSON in functionality
- Designed for readability and simplicity

### Usage in CI/CD
- YAML is the backbone of configuration in tools like:
  - **GitHub Actions** (workflow orchestration)
  - **DVC** (pipeline stages and metadata)
- File extensions: `.yaml` or `.yml`

---

## 2. YAML Syntax

### Structure Rules
- Uses **indentation** and **line separation** to define hierarchy
- Indentation is space-based (no tabs allowed)
- Syntax errors often stem from inconsistent spacing

name: Santosh
occupation: Instructor
# this is valid format
programming_languges: R, Python # this is too
  python:advanced
  javascript: advanced

  


### Best Practices
- Use YAML-aware IDEs with validation support
- Comments begin with `#` and are ignored during parsing

---

## 3. YAML Scalars

### Supported Scalar Types
- **String**: quoted or unquoted "Rustam" and Rustam both are sstrings
- **Number**: integer or float
- **Boolean**: `true` / `false` (unquoted)
- **Null**: `null` or `~`

### Notes
- Booleans and nulls must not be quoted to retain type
- Strings can be wrapped in `'single'` or `"double"` quotes when needed

---

## 4. YAML Collections

### Sequences (Lists)
- Ordered elements
- **a. Block style**: uses hyphens
  ```yaml
  - item1
  - item2
  - item3
```

** b. Flow style: uses brackets **

yaml
[item1, item2, item3]
Mappings (Key-Value Pairs)
Uniquely keyed values

Syntax:

yaml
key1: value1
key2:
  - nested1
  - nested2
key3: [val1, val2, val3]
```

![image.png](attachment:f089ce70-4ec8-479a-8f0b-a56108899283.png)


![image.png](attachment:9fa891f4-202e-42ca-be75-9262776af238.png)


# GitHub Actions (GHA) – Personal Notes

## 1. What is GitHub Actions?
![image.png](attachment:1742f301-4612-4a96-930c-eafad069f3b9.png)
### Definition
- GitHub Actions (GHA) is GitHub’s built-in automation and CI/CD system.
- Enables automation of build, test, and deployment pipelines directly within GitHub repositories.
- A **pipeline** is a sequence of interconnected steps representing the flow of work and data.

### Analogy
- Similar to a car assembly line: each step performs a specific task (e.g., attach engine, paint).
- In GHA, each step automates a part of the software development lifecycle.

**Reference**: [Medium – CI/CD with GitHub Actions for Android](https://medium.com/empathyco/applying-ci-cd-using-github-actions-for-android-1231e40cc52f)

---

## 2. Core Components of GitHub Actions

### Event
- An **event** triggers the execution of a workflow.
- Examples:
  - `push` to a branch
  - `pull_request` opened
  - `issue` created

### Workflow
- A **workflow** is a YAML-defined automated process.
- Stored in `.github/workflows/` directory.
- Can be triggered by:
  - Events
  - Manual triggers
  - Scheduled intervals
- Multiple workflows can exist in a repo:
  - One for testing PRs
  - One for deployment
  - One for issue labeling

### Steps and Actions
- A **step** is a unit of work executed in sequence.
- Steps share the same runner and can pass data between them.
- Examples:
  - Build application
  - Run tests
  - Execute shell scripts
- An **action** is a reusable application that performs a task.
  - Examples: `actions/checkout`, auto-commenting on PRs
  - 
![image.png](attachment:f1a3e195-9a37-435d-a37f-827024b89bfd.png)

### Jobs and Runners
- A **job** is a set of steps.
- Jobs are independent and can run in parallel.
- Jobs can be configured with dependencies.
- All steps in a job run on the same **runner** (compute machine).

---
![image.png](attachment:08478645-eea1-4b6a-a595-ce3aedde42b1.png)

## 3. Example Workflow

### Trigger
- A `push` event initiates the workflow.

### Job
- Runs on an Ubuntu Linux runner.

### Steps
```yaml
- name: Checkout code
  uses: actions/checkout@v3

- name: Run Python app
  run: python app.py
```

![image.png](attachment:7b91431b-6067-4613-a0aa-ac558b56ad8f.png)


![image.png](attachment:760a7523-1175-42a7-b9bd-9fbc12a62494.png)


you can also specify a "job" to be dependent on another "job."

# Intermediate YAML 

## 1. Overview

To work effectively with GitHub Actions and other CI/CD tools, a deeper understanding of YAML is required—especially for handling multiline strings, dynamic values, and multi-document structures.

![image.png](attachment:0765f84a-5ab6-4c2f-a796-49bd18aa3f88.png)
---

## 2. Multiline Strings: Block Scalar Format

### Purpose
- Used to represent multi-line strings with preserved formatting.
- Common in:
  - Shell commands
  - Log messages
  - Configuration blocks

### Styles
- **Literal (`|`)**: preserves line breaks and indentation exactly.
- **Folded (`>`)**: collapses line breaks into spaces for wrapped text.

---

## 3. Literal Style (`|`)

### Behavior
- Maintains all line breaks and indentation.
- Ideal for shell scripts or formatted logs.
- 
![image.png](attachment:4a6bc64e-bd27-469f-ac78-303c0e7de61b.png)

### Example
```yaml
script: |
  echo "Starting process"
  
    indented line
  echo "Done"
## 4. Folded Style (>)
Behavior
Converts line breaks into spaces.

Preserves blank lines and indented blocks.
```

![image.png](attachment:3678756a-9751-420f-9c89-b8596fa39c73.png)


![image.png](attachment:6f3e88ff-e229-4776-be75-0b51552c096b.png)


### Example
```yaml
message: >
  This is a long message
  that will be folded into
  a single paragraph.
      
## 5. Chomping Indicators
Purpose
Control how trailing newlines are handled in block scalars. added after style indicators

Modes
Clip (default): adds one newline at end (no symbol needed).

Strip (-): removes all trailing newlines.
```
![image.png](attachment:b3e30fdd-a7c0-4a8e-9ebe-d9522b250b06.png)

Keep (+): retains all trailing newlines.

![image.png](attachment:dff853e9-481b-4681-819f-5b8dfe22aba3.png)

Example
yaml
log: |-
  Line one
  Line two
yaml
log: |+
  Line one
  Line two

## 6. Dynamic Value Injection
### Description
Not part of standard YAML spec.

Used by specific tools to inject runtime values.

``` 
Syntax: ${{ expression }} or $ENV_VAR

### Use Cases
Referencing environment variables

Accessing config values from other YAML sections

Example

``` 
yaml
host: ${{ secrets.host_url }}
database: ${{ secrets.DB_URL }}
Note: Support depends on the tool (e.g., GitHub Actions, Helm, etc.).

## 7. Multi-Document YAML
### Purpose
Store multiple independent YAML documents in one file.

Useful for grouping related configs or metadata.

Syntax
Use --- to separate documents.

Example
``` 
yaml
---
name: Alice
age: 30
---
name: Bob

age: 40
occupation: Engineer
---
name: Carol
age: 25
References
yaml-multiline.info

```
![image.png](attachment:2306b283-94f8-4bdc-8ecd-3328096e4b50.png)

In [3]:
import yaml

with open("yaml_practice/demo.yaml", "r") as f:
    docs = yaml.safe_load_all(f)
    for i, doc in enumerate(docs):
        print(f"Document {i+1}:")
        print(doc)


Document 1:
{'string_value': 'hello world', 'integer_value': 42, 'float_value': 3.14, 'boolean_true': True, 'boolean_false': False, 'null_value_1': None, 'null_value_2': None, 'tools': ['GitHub Actions', 'DVC', 'MLflow', 'Great Expectations'], 'mlops_stack': ['GitHub Actions', 'DVC', 'MLflow', 'Great Expectations'], 'pipeline_stage': {'name': 'model_training', 'duration_minutes': 45, 'dependencies': ['data_preprocessing', 'feature_engineering']}, 'bash_script': '#!/bin/bash\necho "Starting model training"\n\n  python train.py --epochs 50\necho "Training complete"\n', 'deployment_notes': 'This deployment includes updated model weights, revised preprocessing logic, and improved monitoring hooks for production stability.\n', 'note_strip': 'This string has no trailing newlines.\nAnother line is added.', 'note_keep': 'This string keeps all trailing newlines.\nAnother line is added.\n\n', 'current_date': datetime.datetime(2024, 6, 15, 12, 0, tzinfo=datetime.timezone.utc), 'database_url': '${

# CI/CD with GitHub Actions

## 1. What is GitHub Actions?

GitHub Actions is GitHub’s built-in automation system for CI/CD. It allows you to define workflows that build, test, and deploy your code based on events in your repository.

- **Workflow**: A YAML-defined automation pipeline
- **Event**: Triggers the workflow (e.g., `push`, `pull_request`)
- **Job**: A set of steps executed on a runner
- **Step**: A single task (e.g., run script, checkout code)
- **Runner**: The compute machine that executes jobs

---

## 2. Anatomy of a GitHub Actions Workflow

### Minimal Example
```yaml
name: CI

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Run script
        run: |
          echo "Running script.py"
          python3 script.py




run: | uses literal block style to preserve line breaks

Jobs can run in parallel unless dependencies are defined

```
## 3. Setting Up GitHub Actions
Steps
Create a repo at github.com/new

Add Python .gitignore and license

Navigate to Settings > Actions and enable permissions

in action tab create simple workflow and create one yaml file by committing changes
![image.png](attachment:cf2c32b5-c4e6-4c93-9c92-9f1e353104db.png)


Create .github/workflows/ci.yaml


Commit and push to trigger the workflow

![image.png](attachment:0def0dd9-1e3a-4304-b86b-302c7ce5dc65.png)

then we will see nww thing pop ulp in github actions tab
![image.png](attachment:980f3ebf-51ec-4d75-b45d-cbfd6371b3cc.png)

## 4. Inspecting Workflow Runs
Go to Actions tab

Click on workflow name

View job logs and step outputs

click new worlflow that we jsut added now see there is build we see all descriptio in build



![image.png](attachment:834e730a-8f01-4305-9308-fbd0d835d652.png)

detailed description will arive like this in build that is output logs in each steps of execution of github action




![image.png](attachment:11b0f4f4-101b-459d-a992-e0785e29c08b.png)






# GitHub Actions – Pull Request Triggered CI Pipeline

 branching, workflow configuration, action syntax, and log inspection.

---

## 2. Shared Repository Model


![image.png](attachment:1a8dbbe6-341d-4808-86f7-3a89f45b214e.png)
In collaborative development:
- Developers work on **feature/topic branches**
- Changes are merged via **pull requests (PRs)**
- CI/CD tools run tests on PR creation:
  - Code quality
  - Security vulnerabilities
  - Compatibility checks

This early feedback ensures high-quality code before merging into `main`.

---

## 3. Creating a Feature Branch

### Steps
1. Go to your repo’s landing page
2. Click **Branch**
3. Click **New branch**
4. Name it (e.g., `pr-workflow`)
5. Confirm it’s active

---

## 4. Adding Repository Code

Create a simple Python script:
```python
# hello_world.py
import datetime

print("Hello, World!")
print("Current time:", datetime.datetime.now())
```
## 5. Configuring Workflow Trigger
Update your workflow YAML to trigger on pull requests:

``` yaml
name: PR Workflow
```
on:
  pull_request:
    branches: [main]

This runs the workflow when a PR targets the main branch.


## 6. GitHub Actions Syntax
Action Format
```
yaml
uses: org_or_user/repo_name@version
```

With Arguments
```yaml
with:
  argument_name: value
```
Think of Actions as functions and with as parameters.

## 7. Workflow Steps and Actions
Full Example
```yaml
name: PR Workflow

on:
  pull_request:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Run script
        run: |
          echo "Running hello_world.py"
          python3 hello_world.py
```

## 8. Creating a Pull Request
Steps
Commit the workflow to pr-workflow

Open a PR from pr-workflow → main

GitHub Actions will trigger automatically

## 9. Inspecting Workflow Logs
Go to Actions tab

Click the workflow run

Click the job (e.g., build)

View logs for:

Checkout repository

Setup Python

Run script

You’ll see output from hello_world.py, confirming successful execution.




## 2. Shared Repository Model
In collaborative environments:
- Multiple developers work on the same repository simultaneously.
- Feature or topic branches are created to organize related work.
- When work is complete, developers open a pull request for code review.
- CI/CD tools run tests automatically on PR creation, checking for:
  - Code quality
  - Security vulnerabilities
  - Compatibility with other components  
This early feedback ensures issues are caught before merging into the main branch, maintaining high-quality code.

---

## 3. Creating a Feature Branch
- Navigate to the repository landing page.
- Click on **Branch** → **New branch**.
- Provide a name (e.g., `pr-workflow`).
- Confirm that the new branch is active.

---

## 4. Adding Repository Code
- Add a simple Python script to the branch.
- The script prints a “Hello World” message and the current time.
- Commit the file as `hello_world.py`.

---

## 5. Configuring Workflow Event
- Modify the workflow trigger from `on: push` to `on: pull_request`.
- The `branches` key specifies the target branch of the PR (in this case, `main`).
- This ensures the workflow runs when a PR is opened against `main`.

---

## 6. Actions Syntax
- Actions are defined in workflow steps under the `uses` key.
- Syntax: `organization/repository@version`.
- Arguments can be passed using the `with` key.
- Actions function like reusable modules, with parameters acting as inputs.
- Many ready-to-use Actions are available in the GitHub Marketplace.

---

## 7. Configuring Workflow Steps and Actions
- To run repository code, two key steps are required:
  - **Checkout**: Uses the `checkout` action to retrieve repository code.
  - **Setup Python**: Uses the `setup-python` action to configure the Python environment.
- Each action specifies its version (e.g., `@v3`, `@v4`) and arguments such as `python-version`.

---

## 8. Putting It Together
- The workflow now triggers on pull requests.
- Steps include:
  - Checking out the repository
  - Setting up Python
  - Running the Python script  
This creates a complete CI pipeline for PR validation.

---

## 9. Creating a Pull Request
- Commit workflow changes to the `pr-workflow` branch.
- Open a PR from `pr-workflow` → `main`.
- The workflow triggers immediately upon PR creation.
- Status can be inspected via the **Details** link.

---

## 10. Inspecting Workflow Logs
- Logs show execution of all steps:
  - Repository checkout
  - Python setup
  - Script execution  
The successful run confirms that the workflow is correctly configured.

# have this code in your yaml file and whever pull requests get it triggers this folowing setting up python and running it in githyuh

name: PR

on:
  pull_request:
    branches:["main"]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      - name: setup python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: run test.py
        run: |
          echo "Starting test.py execution"
          python cicd_assets/test.py
          echo "Finished test.py execution"

 step is have the file in our branch that is pr-workflow in this case and have these file and test.py or any python file to be executed
 above yaml sets up python environment and runs the python code there and we can see it in log after we go to pr and thereis log right there when we see compre and pull request asthis will only run after hitting pullrequest in the branch we see logs as i instructed python to display date and some loggings it did as expected
 
 
![image.png](attachment:2d6bb4ee-4243-45b8-93ed-83b3181d3150.png)
 



## 2. Contexts in GitHub Actions
GitHub provides predefined **contexts**—structured data available during a workflow run. These contexts allow dynamic behavior based on the environment or event.

### Common Contexts:
- `github`: Information about the repository, workflow, and event.
- `env`: Variables defined at workflow, job, or step level.
- `secrets`: Encrypted values like API keys or tokens.
- `job`: Metadata about the current job.
- `runner`: Details about the runner executing the job.

Access syntax: `${{ github.actor }}`, `${{ secrets.MY_SECRET }}`, `${{ env.MY_VAR }}`  
Reference: [GitHub Contexts Documentation](https://docs.github.com/en/actions/learn-github-actions/contexts)

---

## 3. Environment Variables
- Used for non-sensitive data (e.g., compiler flags, usernames).
- Declared using the `env` keyword.
- Scope can be workflow-wide, job-specific, or step-specific.
- Accessed via `${{ env.VARIABLE_NAME }}`.

---

## 4. Secrets
- Used for sensitive data (e.g., passwords, API keys).
- Encrypted and masked in logs.
- Accessed via `${{ secrets.SECRET_NAME }}`.
- Can be passed as environment variables or action inputs.

---

## 5. Setting Secrets in GitHub
To add a repository-level secret:
- Go to the repository → Settings → Secrets and Variables → Actions.
- Click the **Secrets** tab → **New repository secret**.
- 
- ![image.png](attachment:33254306-d10e-47e4-b31d-7ac2b4a82e06.png)
- 
- Provide a name and value → Click **Add secret**.


![image.png](attachment:8aa55c67-c5c0-4880-83b3-ed0674643501.png)


---

## 6. GITHUB_TOKEN Secret
- Built-in secret automatically available in every workflow.
- Enables interaction with GitHub API:
  - Clone repository
  - Open/close issues and PRs
  - Comment on issues and PRs
- Permissions are auto-configured based on the event.
- Can be scoped using:
  ```yaml
  permissions:
    pull-requests: write
7. Example: Commenting on a Pull Request
Use the thollander/actions-comment-pull-request Action to post comments via GitHub Actions.
![image.png](attachment:cff53530-03f0-4b4f-ae82-7641940b5d95.png)

Workflow Snippet:
```yaml
jobs:
  comment-on-pr:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - name: Comment on PR
        uses: thollander/actions-comment-pull-request@v2
        with:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          message: |
            Hello world! This is an automated comment.
This posts a comment on the pull request using the GitHub Actions bot.

## Topic: Model Training with GitHub Actions and CML

---

## 1. Overview
We explored how to automate machine learning model training using GitHub Actions, with a focus on CI/CD integration via Continuous Machine Learning (CML).

---

## 2. Dataset: Weather Prediction in Australia
- Source: [Kaggle – Weather Data](https://www.kaggle.com/datasets/rever3nd/weather-data)
- Task: Binary classification – predict whether it will rain tomorrow.
- Features:
  - 5 categorical: location, wind directions, rain today, etc.
  - 17 numerical: temperature, wind gusts, rainfall amount, etc.

---

## 3. Modeling Workflow
- Convert categorical features to numerical (target encoding).
- Impute missing values (mean strategy).
- Scale features to zero mean and unit variance.
- Split into train/test sets.
- Train a `RandomForestClassifier` with fixed hyperparameters.
- Report metrics: precision, recall, accuracy, F1 score.

> No hyperparameter tuning yet — deferred to later stages.

---

## 4. Data Preparation: Target Encoding
- Reference: [Target Encoding Blog](https://maxhalford.github.io/blog/target-encoding/)
- Strategy:
  - Replace each categorical value with its average target value.
  - Useful for high-cardinality features.
  - Avoids complexity of one-hot encoding.

---

## 5. Imputing and Scaling
- Impute missing values using mean.
- Scale features using `impute_and_scale_data` function:
  - Zero mean
  - Unit standard deviation

---

## 6. Model Training
- Split data using `train_test_split` from scikit-learn.
- Train using `RandomForestClassifier`:
  - High accuracy
  - Robust to overfitting
  - Handles large feature sets

---

## 7. Evaluation Metrics
- Reference: [Scikit-learn Classification Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
- Metrics reported:
  - Accuracy
  - Precision
  - Recall
  - F1 Score

---

## 8. Confusion Matrix Plot
- Visualized as a heatmap.
- Cells show:
  - True Positive
  - False Positive
  - True Negative
  - False Negative
- Diagonal cells = correct predictions.

---

## 9. GitHub Actions Workflow
- Trigger: Pull request from feature branch → main.
- Tool: [Continuous Machine Learning (CML)](https://cml.dev/)
- Purpose:
  - Provision runner
  - Train and evaluate model
  - Compare experiments
  - Monitor dataset changes
  - Auto-generate visual report in PR
![image.png](attachment:f074fdda-fb03-4611-a739-3f2acb4c3c9a.png)

> Reference: [Feature Branch Workflow](https://martinfowler.com/bliki/FeatureBranch.html)

---

## 10. CML Commands in Workflow
- Use `setup-cml` GitHub Action.
- Run training code as shell command.
- Read outputs:
  - `results.txt`
  - Graph image
- Write to markdown file.
- Use:
  ```bash
  cml comment create report.md
to post results in the pull request.

GitHub token is passed as environment variable to enable commenting.

11. Output
When PR is opened, the workflow runs.

CML posts a comment with:

Evaluation metrics

Confusion matrix plot

Summary of model performance


# Traininng model in github actions 
we need to have the files and folder  as 


```
processed_dataset/
rsw_dataset/
  weather.csv
metrics_and_plots.py
model.py
preprocess_dataset.py
train.py
utils_and_constants.py
```

these are supposed to be there and thing is we need to have our stuffs above 

the github will execute them as i will have bloank.yaml in .github/preprocess/
that will be having script to execute teh preprocess and train 
the model we will see the confusion mtrix in png 
and another preprocessed csv in preprocessedd_dataset and finally 
.json file will shot the accuracy metric calculated
that is already given in train.py file


# now as we got them all in place we need this in yaml file to fully setup run debug and report or comment
```
name: model-training

on:
  pull_request:
    branches: ['main']

permissions: write-all

jobs:
  train_and_report_eval_performance:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3


      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9
      - name: Install dependencies
        run: |
           python3 -m pip install --upgrade pip
           python3 -m pip install --no-cache-dir --force-reinstall -r requirements.txt
      - name: Debug Python environment
        run: |
          which python3
          python3 --version
          which pip
          pip --version

      # Setup CML GitHub Action
      - name: Setup CML
        uses: iterative/setup-cml@v1
          
      - name: Train model
        run: |
          python3 preprocess_dataset.py
          python3 train.py

      - name: Write CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Add metrics data to markdown
          cat metrics.json >> model_eval_report.md
          
          # Add confusion matrix plot to markdown
          echo "![confusion matrix plot](./confusion_matrix.png)" >> model_eval_report.md

          # Create comment from markdown report
          cml comment create model_eval_report.md
```
after pull request we will find following

running github action atuomated job execution

![image.png](attachment:8e3fe875-746e-46dc-be7b-561a43aa8a5f.png)

training model in github action

![image.png](attachment:07165680-8a2e-4913-b7e3-77ae37f31540.png)

file structure

![image.png](attachment:e72337f5-bb19-4740-ba8e-3677e5aabe25.png)


bot passing comments using token and accuracy metric logged like this here

![image.png](attachment:d85ff604-8eef-4657-9b12-48e288cf34f6.png)



reaso nfor not seeing confusion mertic diagram json for accuracy logging and preprocessed dataset because we didnt explicitly expose our artifact to it but this is the gist that  much is what we did for cicd pipeline running the code in virrtualized github


Versioning Datasets with DVC 

1.  Why Data Versioning Matters
        Enables reproducible model training and experimentation
        
        Supports collaboration across teams using consistent datasets
        
        Helps diagnose performance degradation from specific data versions
        
        Tracks data changes that may require retraining
        
        Maintains audit trails for regulated industries (e.g., finance, healthcare)

2.  What Is DVC?
        DVC (Data Version Control) is an open-source tool for managing datasets and ML artifacts
        
        Integrates seamlessly with Git: code tracked by Git, data tracked by DVC
        
        Enables unified versioning of both code and data

3.  DVC Storage Setup
        Data is stored remotely (e.g., SSH, cloud, local disk)
        
        Metadata is tracked via Git; actual data lives in external storage
        
        Install DVC via:
        
        pip install dvc


4.  Initializing DVC
        git init        # Initialize Git
        dvc init        # Initialize DVC
        
        Creates:
        
        .dvc/config → DVC settings
        
        .gitignore → excludes DVC cache from Git
        
        .dvc/tmp/ → temporary files and logs

    
![image.png](attachment:056407f2-a75a-47ec-877c-53fba9b001f6.png)

5.  Adding Files to DVC
        dvc add data.csv
        Generates:
        
        data.csv.dvc → metadata file with:
        
        outs: output file path
        
        md5: checksum for change tracking
        
        size: file size
        
        hash: hash type (e.g., MD5)
        
        path: relative path to data file
        
        DVC stores actual data in .dvc/cache/, keeping Git lightweight.

6.  Summary
        DVC enables reproducible, collaborative, and auditable ML workflows
        
        Tracks data changes without bloating your Git repo
        
        Use dvc init and dvc add <file> to start versioning datasets

pip install dvc

dvc init

![image.png](attachment:f55e8b30-5544-452c-be9b-abdbf0d26256.png)


dvc add dataset_file.extension

thisis syntax for travcking the dataset using dvc but we must not track the dataset from git before tracking from dvc it will give conflict os either of them can keep track of data

![image.png](attachment:2a3b830f-99be-4c1f-bc60-a48da33101a3.png)

seeing the dvc checksum md5 if dataset changes checksum changes too




## 1. Why Use DVC Remotes

Git-based platforms like GitHub impose storage limits that make tracking large datasets impractical. DVC remotes solve this by enabling external storage for data and models. Benefits include:

- Syncing large files and directories tracked by DVC
- Centralizing data for collaboration
- Archiving multiple versions of datasets and models
- Reducing local storage usage
- Supporting cloud providers (AWS, GCP, Azure) and on-prem storage (SSH, HTTP)

---

## 2. Setting Up DVC Remotes

DVC remotes are configured using the `dvc remote` command. These settings are stored in `.dvc/config`.

### Example: Add an S3 remote

```bash
dvc remote add -d myAWSremote s3://mybucket/path
This command:

Adds a remote named myAWSremote

Sets it as the default remote (-d flag)

Updates the .dvc/config file

Modify remote settings
bash
dvc remote modify myAWSremote access_key_id <your-access-key>
dvc remote modify myAWSremote secret_access_key <your-secret-key>
Use dvc remote modify to customize credentials or configuration options.

3. Local Remotes and Default Configuration
Local remotes are useful for testing, learning, or when external storage is not needed. You can use:

Local directories

Mounted drives

Network-attached storage (NAS)

Example: Add a local remote
bash
dvc remote add -d localremote /mnt/data/dvc-storage
This sets localremote as the default remote. All dvc push, dvc pull, and dvc fetch commands will use it unless overridden.

4. Uploading and Retrieving Data
Use the following commands to sync data between your workspace and the remote:

Push data to remote
bash
dvc push
Pull data from remote
bash
dvc pull
You can target specific files or use the default cache. To specify a remote explicitly:

bash
dvc push -r myAWSremote
5. Tracking Data Changes
When a data file changes, follow these steps:

Stage the change with DVC:

bash
dvc add data.csv
Stage and commit the .dvc metadata file:

bash
git add data.csv.dvc
git commit -m "Update data.csv version"
Push metadata to Git:

bash
git push
Push updated data to the remote:

bash
dvc push
```

![image.png](attachment:9c87f3a7-38a2-48e6-add6-b97c44fcd19a.png)

```
stages:
  preprocess:
    cmd: python3 preprocess_dataset.py
    deps:
    - preprocess_dataset.py
    - raw_dataset/weather.csv
    - utils_and_constants.py
    outs:
    - processed_dataset/weather.csv
  train:
    cmd: python3 train.py
    deps:
    - metrics_and_plots.py
    - model.py
    - processed_dataset/weather.csv
    - train.py
    - utils_and_constants.py
    metrics:
      # Specify the metrics file as target
      - metrics.json:
          cache: false
    plots:
      # Set the target to the file containing predictions data
      - predictions.csv:
          # Write the plot template
          template: confusion_normalized
          x: predicted_label
          y: true_label
          x_label: 'Predicted label'
          y_label: 'True label'
          title: Confusion matrix
          # Set the cache parameter to store
          # plot data in git repository
          cache: false
```


metrics target → metrics.json (the output metrics file from training).

plot target → predictions.json (the file containing prediction vs. true labels).

template → confusion_normalized (to render a normalized confusion matrix).

cache → false (ensures plot data is tracked in Git, not in DVC cache/remote).


![image.png](attachment:6bd7cbc4-1fdb-459f-b846-4f713854a9e4.png)

changing hyperparam of rfc depth 2 to 4 and execuing by moving into train branch 
then running "dvc repro" code to execute pipeline 
![image.png](attachment:0222e922-a113-474b-9674-f7ac188b1eb9.png)

then observing the output of that change in accuracy score

![image.png](attachment:4e6e1e45-b53a-4854-9571-218f5e019759.png)


finally comparigng the differece between main and train branch 
dvc metrics diff main --md | tee metrics_diff.md






## 1. The Need for a Data Pipeline

- Versioning raw data files alone is insufficient for ML tasks.
- Real-world ML requires preprocessing: filtering, cleaning, and transforming data before training.
- Not all steps need to be rerun. For example:
  - Changing model hyperparameters does not require rerunning preprocessing.
- Pipelines allow chaining tasks such as:
  - Raw dataset → Preprocessing → Processed dataset → Training → Evaluation → Artifacts (plots, metrics).
- These steps form a **Directed Acyclic Graph (DAG)**, where each stage depends on the outputs of previous stages.
![image.png](attachment:aef2b4f2-3339-4112-afc5-c2f479aacaad.png)
---

## 2. What Is a DVC Pipeline?

- A DVC pipeline is a sequence of stages defined in `dvc.yaml`.
- Each stage includes:
  - **deps**: Input data and scripts (e.g., preprocessing or training code).
  - **cmd**: Execution command (e.g., running a Python script).
  - **outs**: Output artifacts (e.g., processed dataset).
  - **metrics/plots**: Special outputs for evaluation and visualization.
- Pipelines are similar to GitHub Actions workflows but tailored for ML tasks.
- Pipelines can be executed directly inside GitHub Actions for CI/CD integration.

---

## 3. Defining Pipeline Stages

- Use `dvc stage add` to create stages in `dvc.yaml`.
- Example: Preprocessing stage
  - `-n` → stage name
  - `-d` → dependencies
  - `-o` → outputs
  - `cmd` → command to run
- Each stage is recorded in `dvc.yaml` with its dependencies and outputs.

---

## 4. Dependency Graphs

- Stages can depend on outputs of previous stages.
- Example:
  - **preprocess** → generates processed data
  - **train** → depends on processed data
- Together, these form a dependency graph (DAG).

---

## 5. Reproducing Pipelines

- Run pipelines with:

  ```bash
  dvc repro
Executes dependent stages in order.

Creates a dvc.lock file:

Captures the current pipeline state.

Should be committed to Git for reproducibility.

6. Cached Results
DVC uses caching to skip unchanged steps.

If dependencies remain the same, rerunning the pipeline will not re-execute those stages.

This saves time in complex DAGs.

7. Visualizing Pipelines
Use dvc dag to visualize the pipeline structure.

Displays stages and their dependencies as a graph.

Reference: DVC DAG Command

![image.png](attachment:540186f2-57ad-4240-9761-488d2991f4bf.png)

![image.png](attachment:948ed8f9-0f23-4702-a237-dcff2f57b152.png)


above commands in shell will autocreate the yaml file  in that image above tht will 

Design the print stage with print.sh as a dependency, pages as output, and ./print.sh as command.
Design the scan stage with scan.sh and pages as dependencies, signed.pdf as output, and ./scan.sh as command.
Verify dvc.yaml is written correctly.
Visualize the pipeline with dvc dag

![image.png](attachment:2f6a6352-6e42-43e2-8308-eebf66ff5f6b.png)


```
.
├── raw_dataset/
│   └── weather.csv                 # Raw input dataset
├── processed_dataset/
│   └── weather.csv                 # Preprocessed dataset (DVC tracked)
├── metrics.json                    # Final evaluation metrics
├── predictions.csv                 # True vs Predicted labels
├── roc_curve.csv                   # ROC curve data points
├── confusion_matrix.png            # Confusion matrix plot
├── metrics_compare.md              # Auto-generated PR report
├── requirements.txt                # Python dependencies
├── dvc.yaml                        # DVC pipeline definition
├── dvc_cml.yaml                    # GitHub Actions + CML workflow
├── utils_and_constants.py          # Constants & utilities
├── preprocess_dataset.py           # Data preprocessing logic
├── model.py                        # Model training & evaluation
├── metrics_and_plots.py            # Save metrics/plots
└── train.py                        # Main training script
```

# utils_and_constants.py – Configuration & Utilities
```
pythonimport shutil
from pathlib import Path
Imports standard libraries for file/directory operations.
pythonDATASET_TYPES = ["test", "train"]
Defines possible dataset splits (though not used in current pipeline).
pythonDROP_COLNAMES = ["Date"]
Columns to drop from raw data (e.g., Date is non-predictive).
pythonTARGET_COLUMN = "RainTomorrow"
Name of the target variable: whether it will rain tomorrow (Yes/No).
pythonRAW_DATASET = "raw_dataset/weather.csv"
Path to the raw input dataset.
pythonPROCESSED_DATASET = "processed_dataset/weather.csv"
Path where the preprocessed dataset will be saved.
pythonRFC_FOREST_DEPTH = 4
Hyperparameter: Maximum depth of trees in Random Forest.
pythondef delete_and_recreate_dir(path):
    try:
        shutil.rmtree(path)
    except:
        pass
    finally:
        Path(path).mkdir(parents=True, exist_ok=True)
Utility function to delete and recreate a directory.
Used to ensure clean output folders before saving new files.
```

# preprocess_dataset.py – Data Preprocessing Pipeline
```
pythonfrom typing import List
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from utils_and_constants import (
    DROP_COLNAMES,
    PROCESSED_DATASET,
    RAW_DATASET,
    TARGET_COLUMN,
)
Imports required libraries and constants.

read_dataset()
pythondef read_dataset(filename: str, drop_columns: List[str], target_column: str) -> pd.DataFrame:
    df = pd.read_csv(filename).drop(columns=drop_columns)
    df[target_column] = df[target_column].map({"Yes": 1, "No": 0})
    return df

Reads CSV file
Drops irrelevant columns (Date)
Converts target (RainTomorrow) from Yes/No → 1/0
Returns cleaned DataFrame


target_encode_categorical_features()
pythondef target_encode_categorical_features(
    df: pd.DataFrame, categorical_columns: List[str], target_column: str
) -> pd.DataFrame:
    encoded_data = df.copy()
    for col in categorical_columns:
        encoding_map = df.groupby(col)[target_column].mean().to_dict()
        encoded_data[col] = encoded_data[col].map(encoding_map)
    return encoded_data
Applies target encoding to categorical features:
Replaces each category with the mean target value for that category.
Converts all categorical columns into numerical ones.

impute_and_scale_data()
pythondef impute_and_scale_data(df_features: pd.DataFrame) -> pd.DataFrame:
    imputer = SimpleImputer(strategy="mean")
    X_preprocessed = imputer.fit_transform(df_features.values)
    scaler = StandardScaler()
    X_preprocessed = scaler.fit_transform(X_preprocessed)
    return pd.DataFrame(X_preprocessed, columns=df_features.columns)

Imputes missing values with column mean
Applies standard scaling (zero mean, unit variance)
Returns scaled DataFrame with original column names


main()
pythondef main():
    weather = read_dataset(filename=RAW_DATASET, drop_columns=DROP_COLNAMES, target_column=TARGET_COLUMN)
    categorical_columns = weather.select_dtypes(include=[object]).columns.to_list()
    weather = target_encode_categorical_features(df=weather, categorical_columns=categorical_columns, target_column=TARGET_COLUMN)
    weather_features_processed = impute_and_scale_data(weather.drop(columns=TARGET_COLUMN, axis=1))
    weather_labels = weather[TARGET_COLUMN]
    weather = pd.concat([weather_features_processed, weather_labels], axis=1)
    weather.to_csv(PROCESSED_DATASET, index=None)
Full preprocessing workflow:
Load raw data
Drop Date, encode target
Identify categorical columns
Apply target encoding
Impute + scale numerical features
Reattach target column
Save final processed dataset

pythonif __name__ == "__main__":
    main()
Allows script to be run independently.
```

# model.py – Model Training & Evaluation
```
import json
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from utils_and_constants import RFC_FOREST_DEPTH
Imports model, metrics, and depth constant.

train_model()
pythondef train_model(X_train, y_train):
    model = RandomForestClassifier(
        max_depth=RFC_FOREST_DEPTH, n_estimators=5, random_state=1993
    )
    model.fit(X_train, y_train)
    return model

Initializes RandomForestClassifier
Uses max_depth=4, n_estimators=5, fixed random_state=1993
Fits on training data
Returns trained model


evaluate_model()
pythondef evaluate_model(model, X_test, y_test, float_precision=4):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

Predicts labels and probabilities
Computes key classification metrics

pythonmetrics = {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1,
    }
Stores metrics in dictionary
pythonmetrics = json.loads(
        json.dumps(metrics), parse_float=lambda x: round(float(x), float_precision)
    )
Rounds all float values to float_precision (default: 4)
pythonreturn metrics, y_pred, y_proba
Returns rounded metrics, predictions, and probabilities
```

# metrics_and_plots.py – Save Results & Visualizations
```
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import ConfusionMatrixDisplay, roc_curve
Imports plotting and metrics tools.

plot_confusion_matrix()
pythondef plot_confusion_matrix(model, X_test, y_test):
    _ = ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, cmap=plt.cm.Blues)
    plt.savefig("confusion_matrix.png")

Generates confusion matrix using test set
Saves as confusion_matrix.png
Uses blue colormap


save_metrics()
pythondef save_metrics(metrics):
    with open("metrics.json", "w") as fp:
        json.dump(metrics, fp)
Saves evaluation metrics to metrics.json

save_predictions()
pythondef save_predictions(y_test, y_pred):
    cdf = pd.DataFrame(
        np.column_stack([y_test, y_pred]), columns=["true_label", "predicted_label"]
    ).astype(int)
    cdf.to_csv("predictions.csv", index=None)

Combines true and predicted labels
Converts to integer type
Saves to predictions.csv


save_roc_curve()
pythondef save_roc_curve(y_test, y_pred_proba):
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:, 1])
    cdf = pd.DataFrame(np.column_stack([fpr, tpr]), columns=["fpr", "tpr"]).astype(float)
    cdf.to_csv("roc_curve.csv", index=None)

Computes False Positive Rate (FPR) and True Positive Rate (TPR)
Uses probability of positive class ([:, 1])
Saves ROC curve coordinates to roc_curve.csv
```


# train.py – Main Training Script
```
import json
import pandas as pd
from sklearn.model_selection import train_test_split
from metrics_and_plots import save_metrics, save_predictions, save_roc_curve
from model import evaluate_model, train_model
from utils_and_constants import PROCESSED_DATASET, TARGET_COLUMN
Imports necessary modules and constants.

load_data()
pythondef load_data(file_path):
    data = pd.read_csv(file_path)
    X = data.drop(TARGET_COLUMN, axis=1)
    y = data[TARGET_COLUMN]
    return X, y
Loads processed dataset and splits into features (X) and target (y)

main()
pythonX, y = load_data(PROCESSED_DATASET)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1993)

Loads preprocessed data
Splits into train/test (75/25 by default) with fixed seed

pythonmodel = train_model(X_train, y_train)
metrics, y_pred, y_pred_proba = evaluate_model(model, X_test, y_test)

Trains model
Evaluates on test set → gets metrics, predictions, probabilities

pythonprint("====================Test Set Metrics==================")
print(json.dumps(metrics, indent=2))
print("======================================================")
Prints formatted metrics to console
pythonsave_metrics(metrics)
save_predictions(y_test, y_pred)
save_roc_curve(y_test, y_pred_proba)
Persists all outputs:
metrics.json
predictions.csv
roc_curve.csv
(Note: confusion_matrix.png is generated in DVC stage, not here)

pythonif __name__ == "__main__":
    main()
```

# dvc_cml.yaml – CI/CD with DVC + CML (GitHub Actions)
```
This workflow runs on every pull request to main.
yamlname: dvc-pipeline
Workflow name
yamlon:
  pull_request:
    branches: ['main']
Triggers on PRs targeting main
yamlpermissions:
  contents: write
  pull-requests: write
Allows writing to repo and commenting on PRs

Job: train_and_report_eval_performance
yamlruns-on: ubuntu-latest
Uses latest Ubuntu runner

Step: Checkout Code
yaml- name: Checkout 
  uses: actions/checkout@v3
Clones the repository

Step: Setup Python
yaml- name: Setup Python
  uses: actions/setup-python@v4
  with:
    python-version: 3.9
Sets up Python 3.9

Step: Install Dependencies
yaml- name: Install dependencies
  run: |
    python3 -m pip install --upgrade pip
    python3 -m pip install --no-cache-dir -r requirements.txt
Installs project dependencies

Step: Setup CML & DVC
yaml- name: Setup CML
  uses: iterative/setup-cml@v1

- name: Setup DVC
  uses: iterative/setup-dvc@v1
Installs CML (for PR reporting) and DVC (for pipeline orchestration)

Step: Debug Git State
yaml- name: Debug Git state
  run: |
    echo "=== Current branch ==="
    git branch --show-current
    echo "=== Latest commit ==="
    git log -1 --oneline
    echo "=== Remote branches ==="
    git branch -r
    echo "=== Files in repo ==="
    ls -R
Helps debug branch/checkout issues

Step: Debug DVC State
yaml- name: Debug DVC state
  run: |
    dvc --version
    dvc list .
    dvc metrics show || echo "No metrics found"
    dvc repro --dry || echo "Pipeline dry-run failed"
Verifies DVC setup and pipeline structure

Step: Run DVC Pipeline
yaml- name: Run DVC pipeline
  run: |
    set -x
    dvc repro
    set +x
Executes the full DVC pipeline defined in dvc.yaml
This runs:
preprocess_dataset.py
train.py
Generates confusion_matrix.png


Step: Debug Metrics Diff
yaml- name: Debug metrics diff
  run: |
    echo "=== Metrics in current branch ==="
    dvc metrics show
    echo "=== Metrics in main branch ==="
    git fetch --prune
    git checkout main
    dvc metrics show || echo "No metrics in main"
    git checkout -
    echo "=== Running metrics diff ==="
    dvc metrics diff --md main > metrics_compare.md || echo "No diff found" > metrics_compare.md
    cat metrics_compare.md

Compares current branch metrics vs main
Outputs Markdown table to metrics_compare.md
Always creates file even if no change


Step: Write CML Report
yaml- name: Write CML report
  env:
    REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  run: |
    if [ -s metrics_compare.md ]; then
      cml comment create metrics_compare.md
    else
      echo "No metrics differences to report."
    fi
Uses CML to post metrics_compare.md as a PR comment
Only posts if file has content
```

# How to Run Locally
```
bash# 1. Clone repo
git clone <your-repo-url>
cd <repo-name>

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Install DVC
pip install dvc

# 5. Run preprocessing
python preprocess_dataset.py

# 6. Run training
python train.py

# 7. (Optional) Run full DVC pipeline
dvc repro
Ensure raw_dataset/weather.csv exists!
```

after that \all we need to do to compare hyperparam accuracy between two branches is run the dvc pypeline in main branch then push them with all the required metrics it generates
then finally push it and make code ready in branch  secondary then yaml automation for running dvc in gh actions actually thats what happens in real workd cicd automation
already working main file in main and secondary anticipating change they hit pull request and if accuracy is bettter that might be merged or else no real workd evaluation 
see folowing
# finally we se benchmark comparing the hyperparameter with 4 as tree depth from pr-workflow and 2 depth from main method and caveat is that we might see high accuracy with 4 as tree depth but it also could be sighn of overfitting but it shall be resolved way ahead and thats how we know it will do good both test and validation to making prediction foolproof

![image.png](attachment:9059d107-1282-453e-9c39-3ee15cfb74e2.png)

# Hyperparameter Tuning with DVC
---


This README documents how to integrate hyperparameter tuning into a DVC pipeline, how to structure training code, and how to trigger stages independently. It also explains how results can be compared and used in CI/CD workflows.

---

## 1. Hyperparameter Tuning Workflow

- Hyperparameter tuning searches a parameter space to optimize model performance (e.g., accuracy).
- The best parameter configuration (e.g., `max_depth = 20`) is written to a file that training code can read.
- Training can be loosely coupled with tuning:
  - Training can run with or without tuning.
  - Both jobs depend on upstream dataset changes.

---

## 2. Training Code Changes

- Training code should read hyperparameters from a file (e.g., `params.yaml` or `best_params.json`).
- This allows training to use tuned parameters or manually edited ones.
- Example snippet:

# setting the permission to have pull request made by github we need to change settings as follows

![image.png](attachment:91c8b78a-36a7-4bd6-a04f-d47026353d23.png)

we can access this in actions> general> workflow permission 

```python
import json
from sklearn.ensemble import RandomForestClassifier

with open("best_params.json") as f:
    params = json.load(f)

model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
3. Hyperparameter Tuning with GridSearch
Use GridSearchCV to evaluate parameter combinations across folds.

Example:

python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import json

param_grid = {"max_depth": [10, 20, 30], "n_estimators": [50, 100]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

best_params = grid.best_params_

with open("best_params.json", "w") as f:
    json.dump(best_params, f)

# Save results table
import pandas as pd
pd.DataFrame(grid.cv_results_).to_markdown("tuning_results.md")
4. DVC YAML Changes
Define a hyperparameter tuning stage in dvc.yaml:

yaml
stages:
  preprocess:
    cmd: python3 preprocess_dataset.py
    deps:
      - preprocess_dataset.py
      - raw_dataset/weather.csv
      - utils_and_constants.py
    outs:
      - processed_dataset/weather.csv

  tune:
    cmd: python3 hyperparameter_tuning.py
    deps:
      - hyperparameter_tuning.py
      - processed_dataset/weather.csv
      - params_grid.json
    outs:
      - tuning_results.md

  train:
    cmd: python3 train.py
    deps:
      - train.py
      - model.py
      - utils_and_constants.py
      - processed_dataset/weather.csv
      - best_params.json
    metrics:
      - metrics.json:
          cache: false
    plots:
      - predictions.json:
          template: confusion_normalized
          x: predicted_label
          y: true_label
          x_label: 'Predicted label'
          y_label: 'True label'
          title: Confusion matrix
          cache: false
Notes:

tuning_results.md is tracked to show performance of parameter combinations.

best_params.json is not tracked as an output to avoid forcing tuning reruns when edited manually.

5. Triggering Individual Stages
Run hyperparameter tuning forcibly:

bash
dvc repro -f tune
Run training stage:

bash
dvc repro train
Both depend on preprocessing, so data changes will trigger upstream steps.

6. Hyperparameter Run Output
GridSearchCV produces tabular results (cv_results_).

Save results to tuning_results.md for later review.

Example output:

Code
| mean_test_score | param_max_depth | param_n_estimators |
|-----------------|-----------------|--------------------|
| 0.82            | 10              | 50                 |
| 0.85            | 20              | 100                |

# Hyperparam tuning full step

# /.github/workflows
*hp_cml.yaml*
```
name: hp-tuning

on:
  pull_request:
    branches: ['main']

permissions: write-all

jobs:
  hp_tune:
    if: startsWith(github.head_ref, 'hp_tune/')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9

      - name: Setup DVC
        uses: iterative/setup-dvc@v1

      - name: Setup CML
        uses: iterative/setup-cml@v1

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run DVC pipeline
        run: |
          dvc pull
          dvc repro -f hp_tune
      
      - name: Create training branch
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          cml comment create hp_tuning_results.md
          export BRANCH_NAME=train/$(git rev-parse --short "${{ github.sha }}")
          cml pr create --user-email hp-bot@cicd.ai --user-name HPBot --message "HP tuning" --branch $BRANCH_NAME rfc_best_params.json
```
*train.yaml*
```
name: train

on:
  pull_request:
    branches: ['main']

permissions: write-all

jobs:
  train_and_publish_report:
    if: startsWith(github.head_ref, 'train/')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9

      - name: Setup DVC
        uses: iterative/setup-dvc@v1

      - name: Setup CML
        uses: iterative/setup-cml@v1

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run DVC pipeline
        run: |
          dvc pull
          dvc repro train
      
      - name: Create CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          git fetch --prune
          dvc metrics diff --md main >> metrics_compare.md
          cml comment create metrics_compare.md
```

*dvc.yaml*
```
stages:
  preprocess:
    cmd: python preprocess_dataset.py
    deps:
    - preprocess_dataset.py
    - raw_dataset/weather.csv
    - utils_and_constants.py
    outs:
    - processed_dataset/weather.csv
  hp_tune:
    cmd: python hp_tuning.py
    deps:
    - processed_dataset/weather.csv
    - hp_config.json
    - hp_tuning.py
    - utils_and_constants.py
    outs:
      - hp_tuning_results.md:
          cache: false
  train:
    cmd: python3 train.py
    deps:
    - metrics_and_plots.py
    - model.py
    - processed_dataset/weather.csv
    - rfc_best_params.json
    - train.py
    - utils_and_constants.py
    metrics:
      - metrics.json:
          cache: false
    plots:
    - predictions.csv:
        template: confusion
        x: predicted_label
        y: true_label
        x_label: 'Predicted label'
        y_label: 'True label'
        title: Confusion matrix
        cache: false
    - roc_curve.csv:
        template: simple
        x: fpr
        y: tpr
        x_label: 'False Positive Rate'
        y_label: 'True Positive Rate'
        title: ROC curve
        cache: false

```

*hp_config.json*
```
{
    "n_estimators": [2, 4, 5],
    "max_depth": [10, 20, 50],
    "random_state": [1993]
}
  
```


*hp_tuning.py*

```
import json

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from utils_and_constants import PROCESSED_DATASET, get_hp_tuning_results, load_data


def main():
    X, y = load_data(PROCESSED_DATASET)
    X_train, _, y_train, _ = train_test_split(X, y, random_state=1993)

    model = RandomForestClassifier()
    # Read the config file to define the hyperparameter search space
    param_grid = json.load(open("hp_config.json", "r"))

    # Perform Grid Search Cross Validation on training data
    grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=1, verbose=2)
    grid_search.fit(X_train, y_train)

    best_params = grid_search.best_params_

    print("====================Best Hyperparameters==================")
    print(json.dumps(best_params, indent=2))
    print("==========================================================")

    with open("rfc_best_params.json", "w") as outfile:
        json.dump(best_params, outfile)

    markdown_table = get_hp_tuning_results(grid_search)
    with open("hp_tuning_results.md", "w") as markdown_file:
        markdown_file.write(markdown_table)


if __name__ == "__main__":
    main()

```

*train.py*
```
import json

from metrics_and_plots import save_metrics, save_predictions, save_roc_curve
from model import evaluate_model, train_model
from sklearn.model_selection import train_test_split
from utils_and_constants import PROCESSED_DATASET, load_data, load_hyperparameters


def main():
    X, y = load_data(PROCESSED_DATASET)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1993)

    # Load hyperparameters from the JSON file
    hyperparameters = load_hyperparameters("rfc_best_params.json")
    model = train_model(X_train, y_train, hyperparameters)
    metrics, y_pred, y_pred_proba = evaluate_model(model, X_test, y_test)

    print("====================Test Set Metrics==================")
    print(json.dumps(metrics, indent=2))
    print("======================================================")

    save_metrics(metrics)
    save_predictions(y_test, y_pred)
    save_roc_curve(y_test, y_pred_proba)


if __name__ == "__main__":
    main()

```


*utils_and_constants.py*
```

import json
import shutil
from pathlib import Path

import pandas as pd
from sklearn.model_selection import GridSearchCV

DATASET_TYPES = ["test", "train"]
DROP_COLNAMES = ["Date"]
TARGET_COLUMN = "RainTomorrow"
RAW_DATASET = "raw_dataset/weather.csv"
PROCESSED_DATASET = "processed_dataset/weather.csv"
RFC_FOREST_DEPTH = 2


def delete_and_recreate_dir(path):
    try:
        shutil.rmtree(path)
    except:
        pass
    finally:
        Path(path).mkdir(parents=True, exist_ok=True)


def load_data(file_path):
    data = pd.read_csv(file_path)
    X = data.drop(TARGET_COLUMN, axis=1)
    y = data[TARGET_COLUMN]
    return X, y


def load_hyperparameters(hyperparameter_file):
    with open(hyperparameter_file, "r") as json_file:
        hyperparameters = json.load(json_file)
    return hyperparameters


def get_hp_tuning_results(grid_search: GridSearchCV) -> str:
    """Get the results of hyperparameter tuning in a Markdown table"""
    cv_results = pd.DataFrame(grid_search.cv_results_)

    # Extract and split the 'params' column into subcolumns
    params_df = pd.json_normalize(cv_results["params"])

    # Concatenate the params_df with the original DataFrame
    cv_results = pd.concat([cv_results, params_df], axis=1)

    # Get the columns to display in the Markdown table
    cv_results = cv_results[
        ["rank_test_score", "mean_test_score", "std_test_score"]
        + list(params_df.columns)
    ]

    cv_results.sort_values(by="mean_test_score", ascending=False, inplace=True)
    return cv_results.to_markdown(index=False)

```


these shall be in the main branch and run followin code to train log accuracy and find optimal hyperparameter 
dvc repro

then save the file using 
git add .
git commit -m ""
git push

then chckout to other branch or make and checkout there then force run 
python hp_tuning.py

this will make hp_tuning_results.md appear populated 

then run dvc training pipeline argain 

dvc repro train


then forcefully run hyperparameter tuning as 

dvc repro -f hp_tune


finally run the comparison code compare code of main and current branch

dvc metrics diff main


example
![image.png](attachment:b3392d6e-8497-4211-aff6-9b1c361cb7a3.png)
![image.png](attachment:5548267c-54df-49f8-9b28-e61a51764372.png)
run the commit and push perhspa in main then sweitch the branch like following
![image.png](attachment:cfb9b816-df67-42e4-9422-6f127436ea56.png)
![image.png](attachment:cdac2a15-95d9-4af2-b562-69cc9eaa7162.png)
![image.png](attachment:20b0a09d-4564-4413-aa39-b955740dccec.png)
![image.png](attachment:d7bb0919-e551-4be4-83d0-6785a7206bfc.png)

we see - because we trained in both branch with same hyperparams and that give zero result so - is there 











# 1. Branching Workflow
Hyperparameter tuning and training are loosely coupled.

Each runs independently in separate branches:

Branch prefix for tuning: hp_tune/…

Branch prefix for training: train/…

# 2. Conditional Execution in GitHub Actions
Use if: condition in workflow YAML to restrict jobs to specific branch prefixes.

Example:

yaml
jobs:
  hyperparameter_tuning:
    if: startsWith(github.head_ref, 'hp_tune/')
    runs-on: ubuntu-latest
    steps:
      - name: Run DVC hyperparameter tuning
        run: dvc repro -f hp_tune

  training:
    if: startsWith(github.head_ref, 'train/')
    runs-on: ubuntu-latest
    steps:
      - name: Run DVC training
        run: dvc repro -f train
github.head_ref → source branch of the pull request.

No ${{ }} needed inside if: because GitHub evaluates it directly.

# 3. Workflow Permissions
To allow workflows to create PRs:

Go to Repository Settings → Actions → General → Workflow permissions.

Enable “Allow GitHub Actions to create and approve pull requests.”

# 4. Hyperparameter Tuning Job
Triggered when a PR is opened from a branch starting with hp_tune/.

Runs the tuning stage via DVC:

bash
dvc repro -f hp_tune
Produces results (e.g., hp_tuning_results.md).

Posts results back to the PR using CML:

bash
cml comment create hp_tuning_results.md
# 5. Creating a Training PR from Hyperparameter Run
After tuning completes, create a new branch for training:

bash
git checkout -b train/<short_SHA>
Commit updated parameter configuration file.

Use CML to open a PR:

bash
cml pr create --user "Rustam" \
              --message "Training with tuned parameters" \
              --branch train/<short_SHA> \
              params.yaml
# 6. Training Branch PR
Once created, GitHub shows a PR with changes in the parameter file (git diff).

Training workflow runs automatically because branch starts with train/.

# 7. Manual Training Run Kickoff
⚠️ Limitation: Workflows triggered by GITHUB_TOKEN do not trigger new workflows (to avoid recursion).

Alternatives:

Use a Personal Access Token (PAT) with equivalent permissions.

Or trigger training manually:

bash
git checkout train/<short_SHA>
git commit --allow-empty -m "Kick off training run"
git push origin train/<short_SHA> --force
8. Training Job
Runs independently when PR is opened from a train/… branch.

Executes training stage:

bash
dvc repro -f train
Can be configured to print/compare metrics and plots.

#  Summary of Commands
Run hyperparameter tuning stage:

bash
dvc repro -f hp_tune
Comment results on PR:

bash
cml comment create hp_tuning_results.md
Create training branch:

bash
git checkout -b train/<short_SHA>
Open PR with tuned parameters:

bash
cml pr create --user "Rustam" --message "Training with tuned parameters" --branch train/<short_SHA> params.yaml
Manual training kickoff:

bash
git commit --allow-empty -m "Kick off training run"
git push origin train/<short_SHA> --force

# Hyperparametr Tuning and comparison with full cicd automatin in github


# /.github/workflows

*hp_cml.yaml*

```
name: hp-tuning

on:
  pull_request:
    branches: ["main"]

permissions: write-all

jobs:
  hp_tune:
    # Only run job if the current repository
    # starts with the right prefix
    if: startsWith(github.head_ref, 'hp_tune/')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9

      - name: Setup DVC
        uses: iterative/setup-dvc@v1

      - name: Setup CML
        uses: iterative/setup-cml@v1

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run DVC pipeline
        run: |
          dvc repro -f hp_tune
      
      - name: Create training branch
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Finish the create pull request command
          cml pr create \
            --user-email hp-bot@cicd.ai \
            --user-name HPBot \
            --message "HP tuning" \
            --branch train/${{ github.sha }} \
            --target-branch main \
            rfc_best_params.json
```
*train.yaml*
```
name: train

on:
  pull_request:
    branches: ['main']

permissions: write-all

jobs:
  train_and_publish_report:
    if: startsWith(github.head_ref, 'train/')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9

      - name: Setup DVC
        uses: iterative/setup-dvc@v1

      - name: Setup CML
        uses: iterative/setup-cml@v1

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run DVC pipeline
        run: |
          dvc pull
          dvc repro train
      
      - name: Create CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          git fetch --prune
          dvc metrics diff --md main >> metrics_compare.md
          cml comment create metrics_compare.md
```
dvc.yaml
```
stages:
  preprocess:
    cmd: python preprocess_dataset.py
    deps:
    - preprocess_dataset.py
    - raw_dataset/weather.csv
    - utils_and_constants.py
    outs:
    - processed_dataset/weather.csv
  hp_tune:
    cmd: python hp_tuning.py
    deps:
    - processed_dataset/weather.csv
    - hp_config.json
    - hp_tuning.py
    - utils_and_constants.py
    outs:
      - hp_tuning_results.md:
          cache: false
  train:
    cmd: python train.py
    deps:
    - metrics_and_plots.py
    - model.py
    - processed_dataset/weather.csv
    - rfc_best_params.json
    - train.py
    - utils_and_constants.py
    metrics:
      - metrics.json:
          cache: false
    plots:
    - predictions.csv:
        template: confusion
        x: predicted_label
        y: true_label
        x_label: 'Predicted label'
        y_label: 'True label'
        title: Confusion matrix
        cache: false
    - roc_curve.csv:
        template: simple
        x: fpr
        y: tpr
        x_label: 'False Positive Rate'
        y_label: 'True Positive Rate'
        title: ROC curve
        cache: false

```


thats prettymuych it it will now do following 



![image.png](attachment:40e492ef-f09e-4f93-b706-882be9a78a5b.png)
might look ordinary but it is going to do hyperparameter tuning in github actions and with each push  the pullrequst will be triggered auto and comments will be passed by the gha bot in github eventually automating whole workflow
this concludes the cicd pipeline using github actions and dvc 
to be continueed....