# Q1. What is the difference between the delivery and the deployment of the project
Ans1. "Delivery" and "deployment" are terms often used in the context of software development and project management. While they are related, they refer to different stages in the development and release process.

1) Delivery:

Definition: Delivery refers to the process of handing over the completed and tested software to the client or end-users. It involves providing the product, along with any necessary documentation, to the stakeholders.

##### Key Aspects:

- Completion of Development: The software development process is considered complete, and the product is ready for release.
- Testing and Quality Assurance: The software has undergone thorough testing, and quality assurance processes have been completed to ensure that it meets the specified requirements and is free of critical issues.
- Documentation: Necessary documentation, such as user manuals, release notes, and any other relevant materials, is prepared and provided along with the software.
- Goal: The primary goal of delivery is to deliver a fully functional and tested product that meets the client's requirements.

2) Deployment:

Definition: Deployment is the process of making the software available and operational in the target environment. It involves installing, configuring, and activating the software so that end-users can access and use it.

##### Key Aspects:

- Installation: The software is installed on the servers or devices where it will be running.
- Configuration: Any necessary configuration settings are applied to ensure that the software works correctly in the target environment.
- Activation: The software is made live and accessible to end-users.
- Goal: The primary goal of deployment is to make the software available for use in the production environment.

Summary:

Delivery is about providing the completed and tested software, along with documentation, to the client or end-users. It marks the completion of the development phase.

Deployment is about making the software operational in the target environment, including the installation, configuration, and activation processes. It brings the software into a live and accessible state for end-users.

In many projects, delivery and deployment are closely related, and the terms are sometimes used interchangeably. However, it's important to recognize the distinctions between them, especially in larger and more complex software development processes.

# Q2. what is the difference between waterfall technique and agile technique in software development?
Ans2. The Waterfall model and the Agile methodology are two different approaches to software development, each with its own set of principles, practices, and characteristics. Here are the key differences between the Waterfall technique and Agile methodology:

1) Development Approach:

<li> Waterfall:

- Sequential and linear development approach.
- Divided into distinct phases (requirements, design, implementation, testing, deployment, and maintenance), with each phase building upon the previous one.
- Progress moves in a strict, predetermined order, and each phase must be completed before the next one begins.

<li> Agile:

- Iterative and incremental development approach.
- Work is divided into small, functional increments called iterations or sprints.
- Allows for flexibility and the ability to revisit and adjust requirements and features throughout the development process.

2) Flexibility and Changes:

<li> Waterfall:

- Less accommodating to changes once the project has started.
- Changes to requirements can be difficult and costly to implement after the project has moved beyond the initial phases.

<li> Agile:

- Embraces change and allows for flexibility.
- Welcomes changing requirements, even late in the development process, to deliver a product that better meets the client's needs.

3) Client Involvement:

<li> Waterfall:

- Limited client involvement after the initial requirements phase.
- Clients typically see the product only after it has been fully developed.

<li> Agile:

- Encourages continuous client involvement throughout the development process.
- Clients are often part of the development team, providing feedback on each iteration and influencing the direction of the project.

4) Testing:

<li> Waterfall:

- Testing is a separate phase that occurs after development is complete.
- Potential for identifying issues late in the development process.

<li> Agile:

- Testing is integrated throughout the development process.
- Continuous testing allows for early identification and resolution of issues.

5) Delivery Time:

<li> Waterfall:

- The entire project is delivered at once after the completion of all phases.
- Longer time to market.

<li> Agile:

- Incremental delivery of functional parts throughout the development process.
- Shorter time to market, with the ability to release usable products at the end of each iteration.

6) Project Visibility:

<li> Waterfall:

- Limited visibility into the project's progress until it reaches the testing phase.
- Potential for surprises late in the project.

<li> Agile:

- Provides continuous visibility and transparency through regular iterations and progress updates.
- Stakeholders have a clear understanding of the project's status and direction.

In summary, the Waterfall model follows a linear and sequential approach, while the Agile methodology embraces flexibility, collaboration, and iterative development. The choice between Waterfall and Agile depends on the project's requirements, complexity, and the level of adaptability needed throughout the development process.

# Note:

#### Testers use Testing modules in python namely pytest, tox

# What is MLflow?
Stepping into the world of Machine Learning (ML) is an exciting journey, but it often comes with complexities that can hinder innovation and experimentation.

MLflow is a solution to many of these issues in this dynamic landscape, offering tools and simplifying processes to streamline the ML lifecycle and foster collaboration among ML practitioners.

Whether you’re an individual researcher, a member of a large team, or somewhere in between, MLflow provides a unified platform to navigate the intricate maze of model development, deployment, and management. MLflow aims to enable innovation in ML solution development by streamlining otherwise cumbersome logging, organization, and lineage concerns that are unique to model development. This focus allows you to ensure that your ML projects are robust, transparent, and ready for real-world challenges.

Read on to discover the core components of MLflow and understand the unique advantages it brings to the complex workflows associated with model development and management.

# Core Components of MLflow
MLflow, at its core, provides a suite of tools aimed at simplifying the ML workflow. It is tailored to assist ML practitioners throughout the various stages of ML development and deployment. Despite its expansive offerings, MLflow’s functionalities are rooted in several foundational components:

1. Tracking: MLflow Tracking provides both an API and UI dedicated to the logging of parameters, code versions, metrics, and artifacts during the ML process. This centralized repository captures details such as parameters, metrics, artifacts, data, and environment configurations, giving teams insight into their models’ evolution over time. Whether working in standalone scripts, notebooks, or other environments, Tracking facilitates the logging of results either to local files or a server, making it easier to compare multiple runs across different users.

2. Model Registry: A systematic approach to model management, the Model Registry assists in handling different versions of models, discerning their current state, and ensuring a smooth transition from development to production. It offers a centralized model store, APIs, and UI to collaboratively manage an MLflow Model’s full lifecycle, including model lineage, versioning, stage transitions, and annotations.

3. AI Gateway: This server, equipped with a set of standardized APIs, streamlines access to both SaaS and OSS LLM models. It serves as a unified interface, bolstering security through authenticated access, and offers a common set of APIs for prominent LLMs.

4. Evaluate: Designed for in-depth model analysis, this set of tools facilitates objective model comparison, be it traditional ML algorithms or cutting-edge LLMs.

5. Prompt Engineering UI: A dedicated environment for prompt engineering, this UI-centric component provides a space for prompt experimentation, refinement, evaluation, testing, and deployment.

6. Recipes: Serving as a guide for structuring ML projects, Recipes, while offering recommendations, are focused on ensuring functional end results optimized for real-world deployment scenarios.

7. Projects: MLflow Projects standardize the packaging of ML code, workflows, and artifacts, akin to an executable. Each project, be it a directory with code or a Git repository, employs a descriptor or convention to define its dependencies and execution method.

8. By integrating these core components, MLflow offers an end-to-end platform, ensuring efficiency, consistency, and traceability throughout the ML lifecycle.

# Why Use MLflow?
The machine learning (ML) process is intricate, comprising various stages, from data preprocessing to model deployment and monitoring. Ensuring productivity and efficiency throughout this lifecycle poses several challenges:

1. Experiment Management: It’s tough to keep track of the myriad experiments, especially when working with files or interactive notebooks. Determining which combination of data, code, and parameters led to a particular result can become a daunting task.

2. Reproducibility: Ensuring consistent results across runs is not trivial. Beyond just tracking code versions and parameters, capturing the entire environment, including library dependencies, is critical. This becomes even more challenging when collaborating with other data scientists or when scaling the code to different platforms.

3. Deployment Consistency: With the plethora of ML libraries available, there’s often no standardized way to package and deploy models. Custom solutions can lead to inconsistencies, and the crucial link between a model and the code and parameters that produced it might be lost.

4. Model Management: As data science teams produce numerous models, managing these models, their versions, and stage transitions becomes a significant hurdle. Without a centralized platform, managing model lifecycles, from development to staging to production, becomes unwieldy.

5. Library Agnosticism: While individual ML libraries might offer solutions to some of the challenges, achieving the best results often involves experimenting across multiple libraries. A platform that offers compatibility with various libraries while ensuring models are usable as reproducible “black boxes” is essential.

MLflow addresses these challenges by offering a unified platform tailored for the entire ML lifecycle. Its benefits include:

1. Traceability: With tools like the Tracking Server, every experiment is logged, ensuring that teams can trace back and understand the evolution of models.

2. Consistency: Be it accessing models through the AI Gateway or structuring projects with MLflow Recipes, MLflow promotes a consistent approach, reducing both the learning curve and potential errors.

3. Flexibility: MLflow’s library-agnostic design ensures compatibility with a wide range of machine learning libraries. It offers comprehensive support across different programming languages, backed by a robust REST API, CLI, and APIs for Python API, R API, and Java API.

By simplifying the complex landscape of ML workflows, MLflow empowers data scientists and developers to focus on building and refining models, ensuring a streamlined path from experimentation to production.

# Use Cases of MLflow
MLflow is versatile, catering to diverse machine learning scenarios. Here are some typical use cases:

1. Experiment Tracking: A data science team leverages MLflow Tracking to log parameters and metrics for experiments within a particular domain. Using the MLflow UI, they can compare results and fine-tune their solution approach. The outcomes of these experiments are preserved as MLflow models.

2. Model Selection and Deployment: MLOps engineers employ the MLflow UI to assess and pick the top-performing models. The chosen model is registered in the MLflow Registry, allowing for monitoring its real-world performance.

3. Model Performance Monitoring: Post deployment, MLOps engineers utilize the MLflow Registry to gauge the model’s efficacy, juxtaposing it against other models in a live environment.

4. Collaborative Projects: Data scientists embarking on new ventures organize their work as an MLflow Project. This structure facilitates easy sharing and parameter modifications, promoting collaboration.

# Q4. What dvc (data version control) is used for?
Ans4. DVC is an open-source version control system for machine learning projects. It is designed to handle versioning of both code and datasets, making it easier to manage and reproduce machine learning experiments. Key features of DVC include:

1. Data Versioning: DVC allows you to version control datasets separately from your code. This is essential for ensuring that the data used in a machine learning experiment is reproducible and consistent across different stages of the project.

2. Dependency Management: DVC helps manage dependencies between code and data. It tracks the relationships between source code, datasets, and the output of experiments, ensuring that changes to one component do not break the reproducibility of the entire project.

3. Storage and Sharing: DVC integrates with various storage backends (local, cloud, or network-attached storage). It enables sharing and collaboration by allowing teams to store and retrieve large datasets efficiently.

4. Reproducibility: With DVC, you can reproduce any experiment by retrieving the exact code and data associated with a specific version. This is crucial for building trust in the results of machine learning experiments and for collaborating on projects.

In summary, DVC is used for versioning datasets, managing dependencies, and ensuring the reproducibility of machine learning projects by integrating with traditional version control systems like Git. Both MLflow and DVC contribute to making machine learning projects more manageable, reproducible, and scalable. They are often used together to address different aspects of the machine learning lifecycle.

# Note:

1. For using dvc in our project, we have to initialize the git.

2. Git is used for version control or for SCM (Source Code Management)

3. For seeing all the commits and commit id, write command - git log

4. For seeing the source code of each commits, write command - git checkout < commit id > (first 5-6 characters are enough)

5. For checking the history of all the commands we have used in git bash write - history and in cmd write - doskey /history

6. We have our code in our local folder. After initializing git, we can no longer say it as a local folder, we say it as a local repository.

7. After initializing git, .git folder is generated automatically. It's a hidden folder.

# .git folder

The `.git` folder is a hidden directory that is automatically created by Git when you initialize a new repository using the `git init` command. This folder is at the root of your Git repository, and it plays a crucial role in tracking changes, managing branches, and storing metadata related to your project.

Here are some key aspects of the `.git` folder:

1. **Repository Configuration:**
   - The `.git` folder contains configuration files and settings for the Git repository. This includes information about the repository's remote (if any), branch settings, and other configuration options.

2. **Object Database:**
   - Inside the `.git` folder, there is an object database that stores all the objects that Git manages. Git uses a content-addressable filesystem to store these objects, including blobs (file content), trees (directory structure), commits, and tags.

3. **Branches and Head:**
   - The `.git` folder includes references to branches and the `HEAD` pointer, which points to the current commit or branch. Information about the commit history is stored in the form of a directed acyclic graph.

4. **Index:**
   - The `.git` folder contains an index file (also known as the staging area) that tracks changes to be included in the next commit. This file helps Git efficiently manage and track modifications before they are committed.

5. **Hooks:**
   - Git hooks, which are scripts that can be executed at certain points in the Git workflow, are stored in the `.git/hooks` directory.

6. **Config and Hooks:**
   - Configuration files (`config`) and sample hooks are stored in the `.git` folder, allowing you to customize the behavior of the Git repository.

7. **Logs:**
   - The `.git/logs` directory contains logs and records of various Git actions, including commits, ref updates, and other activities.

8. **Submodules (if used):**
   - If you're using Git submodules (repositories embedded within a parent repository), information about submodules is stored in the `.git/modules` directory.

9. **Alternate Object Stores (if used):**
   - For large repositories, Git can use an alternate object store, and the location of this store is specified in the `.git/objects/info/alternates` file.

It's important to note that the `.git` folder is a critical part of the Git version control system, and it should not be modified manually unless you have a deep understanding of Git's internal workings. Modifying the contents of the `.git` folder can result in a corrupted repository.

# What Is Data Version Control?
In standard software engineering, many people need to work on a shared codebase and handle multiple versions of the same code. This can quickly lead to confusion and costly mistakes.

To address this problem, developers use version control systems, such as Git, that help keep team members organized.

In a version control system, there’s a central repository of code that represents the current, official state of the project. A developer can make a copy of that project, make some changes, and request that their new version become the official one. Their code is then reviewed and tested before it’s deployed to production.

These quick feedback cycles can happen many times per day in traditional development projects. But similar conventions and standards are largely missing from commercial data science and machine learning. Data version control is a set of tools and processes that tries to adapt the version control process to the data world.

Having systems in place that allow people to work quickly and pick up where others have left off would increase the speed and quality of delivered results. It would enable people to manage data transparently, run experiments effectively, and collaborate with others.

Note: An experiment in this context means either training a model or running operations on a dataset to learn something from it.

One tool that helps researchers govern their data and models and run reproducible experiments is DVC, which stands for Data Version Control.

# What Is DVC?
DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. 

DVC is meant to be run alongside Git. In fact, the git and dvc commands will often be used in tandem, one after the other. While Git is used to store and version code, DVC does the same for data and model files.

Git can store code locally and also on a hosting service like GitHub, Bitbucket, or GitLab. Likewise, DVC uses a remote repository to store all your data and models. This is the single source of truth, and it can be shared amongst the whole team. You can get a local copy of the remote repository, modify the files, then upload your changes to share with team members.

The remote repository can be on the same computer you’re working on, or it can be in the cloud. DVC supports most major cloud providers, including AWS, GCP, and Azure. But you can set up a DVC remote repository on any server and connect it to your laptop. There are safeguards to keep members from corrupting or deleting the remote data.

When you store your data and models in the remote repository, a .dvc file is created. A .dvc file is a small text file that points to your actual data files in remote storage.

The .dvc file is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go in your DVC remote storage, and small .dvc files that point to your data go in GitHub.

# What is pipeline reproducibility
Pipeline reproducibility in the context of machine learning refers to the ability to recreate and reproduce the entire end-to-end workflow or pipeline, including data processing, model training, and evaluation, in a consistent and predictable manner. Achieving pipeline reproducibility is essential for several reasons:

1. Consistent Results: Reproducibility ensures that the same set of input data, code, and model training process will yield consistent results across different runs. This is crucial for building trust in the reliability of machine learning models.
2. Debugging and Diagnosing Issues: In the event of unexpected results or errors, having a reproducible pipeline allows developers and data scientists to easily identify the cause of issues by rerunning the same pipeline with the same conditions.
3. Collaboration: When multiple team members collaborate on a machine learning project, pipeline reproducibility ensures that everyone can work with the same consistent set of inputs and conditions. This facilitates collaboration, code sharing, and troubleshooting.
4. Model Versioning: Reproducibility is closely tied to model versioning. By versioning both the code and the data used in model training, it becomes possible to recreate the exact environment and conditions under which a model was trained.
5. Experimentation and Hyperparameter Tuning: In the context of hyperparameter tuning and experimentation, reproducibility is crucial for comparing different model configurations. Researchers can track the performance of various model versions and hyperparameter settings over time.

To achieve pipeline reproducibility, consider the following best practices:

- Version Control: Version control both your code (using tools like Git) and your data (using tools like DVC or a dedicated data version control system).

- Environment Management: Use virtual environments or containerization tools (e.g., Docker) to manage the software dependencies and environment in which your machine learning pipeline runs.

- Dependency Tracking: Explicitly track dependencies between different stages of your pipeline. Tools like DVC can help manage these dependencies.

- Documentation: Document the steps and configurations used in your pipeline. This includes details about data preprocessing, feature engineering, model architecture, and hyperparameters.

# dvc steps Notes

1. initialize the dvc first - dvc init

2. it will automatically create two folders named as .dvc and .dvcignore

3. for adding data for data versioning use command - dvc add < directory name of the data file >

4. under .dvc/cache/files/md5(hash name can be anything instead of md5) we will find the different versions of data

5. under data.csv.dvc (name can be anything instead of data), we will find the latest version data id

6. For seeing the source code of each commits, write command - git checkout < commit id > (first 5-6 characters are enough). Also for getting the particular data version for that particular commit write command - dvc checkout

7. For creating remote folder for storing all the data we can use command - dvc remote add -d remote_storage path/to/your/dvc_remote -- Here -d means detatch and we can provide any sort of url for creating remote folder for storing our data. After executing above command, write - dvc push for showing the folder. For everytime, pushing the different version of data in the remote folder write - dvc push .We will find the details inside .dvc/config

# dvc reproducibility steps (pipeline versioning)

1. create 1 file named as dvc.yaml As it is a yaml file, write all the code in key and value pair

2. execute command - dvc repro in terminal. As soon as we execute dvc repro it will look for dvc.yaml file and execute all the configurations written in dvc.yaml file.

3. it will automatically generate dvc.lock file to track the latest information regarding the pipeline.

4. .dvc/cache/runs will keep the track of all the versions that happened during the changes committed in the pipeline.

5. .dvc/cache/files will keep the track of all the outputs that happened during the changes committed in the pipeline.

6. dvc.dag command is used to visualize the dependency graph (DAG) of your machine learning pipeline.

7. We can send our update to remote storage by using command - dvc push

# dvc dag (directed acyclic graph)

The `dvc dag` command in Data Version Control (DVC) is used to visualize the dependency graph (DAG) of your machine learning pipeline. The DAG represents the relationships and dependencies between different stages or steps in your pipeline. Each stage corresponds to a DVC command or a step in your workflow.

Here's how you can use `dvc dag`:

1. **Generate DAG:**
   - To generate and visualize the DAG, use the following command:

     ```bash
     dvc dag
     ```

2. **Viewing the DAG:**
   - The `dvc dag` command outputs a graph representation of your pipeline in a textual format. This graph illustrates the dependencies between stages.

   - The nodes in the graph represent DVC stages, and the edges represent dependencies between stages.

3. **Graph Visualization Tools:**
   - If you want a more visually appealing representation, you can use graph visualization tools like `dot` (Graphviz) to render the graph. Redirect the output of the `dvc dag` command to a DOT file and then use a tool like Graphviz to create an image.

     ```bash
     dvc dag > dag.dot
     dot -Tpng dag.dot -o dag.png
     ```

   - Open the generated `dag.png` file to view the DAG as an image.

Here's a simplified example to illustrate how a DAG might look:

```
                 +-------------+
                 | train_model |
                 +------+------+
                        |
                        v
                 +------v------+
                 | evaluate    |
                 +------+------+
                        |
            +-----------+------------+
            |                        |
     +------v-------+        +-------v-------+
     | preprocess   |        | feature       |
     | data         |        | engineering   |
     +------+-------+        +---------------+
            |
     +------v------+
     | load_data   |
     +-------------+
```

In this example, each box represents a stage in your DVC pipeline, and arrows indicate dependencies between stages. The `dvc dag` command helps you visualize and understand the structure of your pipeline.

Keep in mind that the actual structure of your DAG will depend on how you've defined your DVC stages and their dependencies in your project. The DAG provides insights into the order in which stages are executed and the relationships between them.

![image.png](attachment:image.png)