<a href="https://colab.research.google.com/github/23AD083/MLOPS_INTERNSHIP/blob/main/DVC%2BS3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GitHub + Google Colab + DVC (with Amazon S3 Storage)

This notebook provides a comprehensive guide to setting up and using DVC (Data Version Control) with Google Colab and Amazon S3 as the remote storage. This combination is ideal for scalable and reproducible machine learning experiments in a cloud environment.

## Prerequisites for Amazon S3

Before starting, you need to set up your AWS account:

1.  **AWS Account:** You must have an active AWS account.
2.  **S3 Bucket:** Create an S3 bucket in your chosen AWS region (e.g., `us-east-1`, `ap-south-1`). This bucket will be your DVC remote storage. Make sure the bucket name is globally unique.
3.  **IAM User with Permissions:** It's highly recommended to create a dedicated IAM (Identity and Access Management) user for DVC with programmatic access.
    * Go to IAM -> Users -> Add user.
    * Give it a name (e.g., `dvc-colab-user`).
    * Select "Access key - Programmatic access" as the credential type.
    * For permissions, attach an existing policy directly. **Create a custom policy** that grants access *only* to your specific S3 bucket. A minimal policy would look like this (replace `your-dvc-bucket-name` with your actual bucket name):

    ```json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:HeadBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::your-dvc-bucket-name",
                    "arn:aws:s3:::your-dvc-bucket-name/*"
                ]
            }
        ]
    }
    ```
    * Review and create the user. **Crucially, save the Access Key ID and Secret Access Key when they are displayed.** You will not be able to retrieve the Secret Access Key again. Treat these as highly sensitive credentials.


## Hands-On Coding Examples (with Amazon S3)

Let's set up our environment and perform DVC operations using Amazon S3.

### Setting Up the Environment

First, we need to install DVC, the AWS CLI, and configure our AWS credentials in Google Colab.

In [1]:
# Cell 1: Install DVC with S3 support and AWS CLI
# We specifically install 'dvc[s3]' for S3 integration.
# awscli is the AWS Command Line Interface which DVC will use for authentication.
!pip install dvc[s3] awscli

Collecting awscli
  Downloading awscli-1.41.4-py3-none-any.whl.metadata (11 kB)
Collecting dvc[s3]
  Downloading dvc-3.61.0-py3-none-any.whl.metadata (17 kB)
Collecting celery (from dvc[s3])
  Downloading celery-5.5.3-py3-none-any.whl.metadata (22 kB)
Collecting colorama>=0.3.9 (from dvc[s3])
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting configobj>=5.0.9 (from dvc[s3])
  Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting dpath<3,>=2.1.0 (from dvc[s3])
  Downloading dpath-2.2.0-py3-none-any.whl.metadata (15 kB)
Collecting dulwich (from dvc[s3])
  Downloading dulwich-0.23.2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (5.2 kB)
Collecting dvc-data<3.17,>=3.16.2 (from dvc[s3])
  Downloading dvc_data-3.16.10-py3-none-any.whl.metadata (5.0 kB)
Collecting dvc-http>=2.29.0 (from dvc[s3])
  Downloading dvc_http-2.32.0-py3-none-any.whl.metadata (1.3 kB)
Collecting dvc-objects (from dvc[s3])
  Downloading dvc_objects-5.1.1-py3-none-any.

In [2]:
# Cell 2: Configure AWS Credentials
# We will set AWS credentials as environment variables for the Colab session.
# DO NOT hardcode these in a public notebook.
# In a real project, use Colab's "Secrets" feature or AWS Secrets Manager.
# For this tutorial, we'll use input() for demonstration purposes.

import os
import getpass

# Prompt for AWS credentials
# For production, consider using Colab's "Secrets" feature (Tools -> Secrets)
# and retrieve them like: os.environ['AWS_ACCESS_KEY_ID']
os.environ['AWS_ACCESS_KEY_ID'] = getpass.getpass('Enter your AWS Access Key ID: ')
os.environ['AWS_SECRET_ACCESS_KEY'] = getpass.getpass('Enter your AWS Secret Access Key: ')

# Set your desired AWS region (e.g., 'us-east-1', 'ap-south-1')
# This should match the region where your S3 bucket is located.
AWS_REGION = 'us-east-1' # Example region, change this to your bucket's region
os.environ['AWS_DEFAULT_REGION'] = AWS_REGION

print(f"AWS credentials and region '{AWS_REGION}' configured.")

KeyboardInterrupt: Interrupted by user

In [None]:
# Cell 3: Verify AWS CLI configuration (optional, but good for debugging)
!aws configure list

      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************HCU2              env    
secret_key     ****************5x4U              env    
    region                us-east-1              env    AWS_DEFAULT_REGION


In [None]:
# Cell 4: Create and navigate to our project directory
# We'll work in /content/ for this example as we don't need Google Drive mounting.
%cd /content/
%mkdir -p dvc_s3_colab_project
%cd dvc_s3_colab_project=

/content
/content/dvc_s3_colab_project


### Initializing Git and DVC

Now, let's initialize a Git repository and DVC within our project directory.

In [None]:
# Cell 5: Initialize Git repository
!git init

Reinitialized existing Git repository in /content/dvc_s3_colab_project/.git/


In [None]:
# Cell 6: Initialize DVC
!dvc init --no-scm
# Again, --no-scm because Colab's Git isn't fully integrated in the way DVC expects initially.
# We'll manually add .dvc files to git.

[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

In [None]:
# Cell 7: Configure DVC remote to Amazon S3
# REPLACE 'your-dvc-s3-bucket' with the name of YOUR S3 bucket
# Optionally, you can add a path within the bucket: s3://your-dvc-s3-bucket/dvc_data/
S3_BUCKET_NAME = 's3-dvc-buck' # <--- IMPORTANT: Replace with your actual S3 bucket name
!dvc remote add -d s3_remote s3://{S3_BUCKET_NAME}/dvc_data_store/

Setting 's3_remote' as a default remote.
[31mERROR[39m: configuration error - config file error: remote 's3_remote' already exists. Use `-f|--force` to overwrite it.
[0m

In [None]:
!git config --global user.email "codeboosterstech@gmail.com"
!git config --global user.name "codeboosterstech"

In [None]:
# Cell 8: Add and commit DVC configuration files to Git
!git add .dvc/config .dvcignore
!git commit -m "Initialize Git and DVC for S3"

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.dvc/cache/[m
	[31m.dvc/tmp/[m
	[31mmy_s3_dataset.csv[m
	[31mmy_s3_dataset.csv.dvc[m

nothing added to commit but untracked files present (use "git add" to track)


### Tracking Data with DVC

Let's create a dummy dataset and track it with DVC.

In [None]:
# Cell 9: Create a dummy dataset
import pandas as pd
import numpy as np

# Create a simple DataFrame
data = {
    'feature1': np.random.rand(100),
    'feature2': np.random.randint(0, 10, 100),
    'target': np.random.rand(100) * 10
}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
DATA_FILE_NAME = 'my_s3_dataset.csv'
df.to_csv(DATA_FILE_NAME, index=False)

print(f"'{DATA_FILE_NAME}' created.")
!ls -lh

'my_s3_dataset.csv' created.
total 8.0K
-rw-r--r-- 1 root root 3.9K Jul 11 00:17 my_s3_dataset.csv
-rw-r--r-- 1 root root   97 Jul 11 00:15 my_s3_dataset.csv.dvc


In [None]:
# Cell 10: Add the dataset to DVC
!dvc add $DATA_FILE_NAME

[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding my_s3_dataset.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding my_s3_dataset.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                          [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...: 100% 1/1 [00:00<00:00, 17.54file/s{'info': ''}]
[0m

In [None]:
# Cell 11: Commit the .dvc file to Git
!git add my_s3_dataset.csv.dvc .gitignore # .gitignore might also be created by DVC
!git commit -m "Add initial S3 dataset"

### Pushing Data to Amazon S3

Now that DVC is tracking our `my_s3_dataset.csv` and Git knows about `my_s3_dataset.csv.dvc`, let's push the actual data to our Amazon S3 remote.

In [None]:
# Cell 12: Push the DVC-tracked data to the remote (Amazon S3)
# This will upload the cached data to your S3 bucket.
!dvc push

### Simulating a New Session / New User

Let's imagine you close this Colab notebook, or a teammate wants to work on your project. They would clone your Git repository and then `dvc pull` the data.

In [None]:
# Cell 13: Simulate a clean environment (remove current data)
# This will remove the symbolic link and the data itself from the workspace,
# but the .dvc file and the data in cache/remote will remain.
!rm $DATA_FILE_NAME

# Verify it's gone
!ls -lh

In [None]:
# Cell 14: Pull the data back using DVC from S3
!dvc pull

In [None]:
# Cell 15: Verify data is back
!ls -lh

### Versioning Data Changes

Let's modify our dataset and see how DVC helps us track the changes.

In [None]:
# Cell 16: Modify the dataset
import pandas as pd

# Load the existing dataset
df = pd.read_csv(DATA_FILE_NAME)

# Add a new column
df['new_s3_feature'] = df['feature1'] * df['feature2']

# Save the modified dataset
df.to_csv(DATA_FILE_NAME, index=False)

print(f"'{DATA_FILE_NAME}' modified.")
!ls -lh

In [None]:
# Cell 17: Update DVC with the modified dataset
!dvc add $DATA_FILE_NAME

In [None]:
# Cell 18: Commit the updated .dvc file to Git
!git add my_s3_dataset.csv.dvc
!git commit -m "Update dataset with new_s3_feature"

In [None]:
# Cell 19: Push the updated data to the remote (S3)
!dvc push

### Viewing Data History

You can use `dvc status` and `dvc diff` to see the status of your DVC-tracked files and the differences between versions.

In [None]:
# Cell 20: Check DVC status
!dvc status

## Real-Time Use Cases (Same as before, but with S3's benefits)

The real-world use cases are fundamentally the same as with Google Drive (Reproducible ML Experiments, Collaborative Data Science, Model Deployment/Rollback), but S3 offers distinct advantages:

* **Scalability:** S3 is designed for massive scale, handling petabytes of data, making it suitable for very large datasets and models.
* **Performance:** Generally offers higher performance for data transfer compared to Google Drive, especially for programmatic access.
* **Security & Access Control:** AWS IAM provides granular control over who can access your data and how. You can create specific policies for DVC users, services, or roles, ensuring strong security.
* **Integration with AWS Ecosystem:** Seamless integration with other AWS services like EC2, SageMaker, Lambda, etc., which is beneficial if your ML pipeline is already heavily invested in AWS.
* **Industry Standard:** S3 is a de-facto standard for cloud object storage in many enterprise and production environments.

### Example: Reproducible ML Training with S3

**Scenario:** An ML engineer is developing a model to forecast demand for an e-commerce platform. They need to experiment with different versions of historical sales data, and the data volumes are growing rapidly.

**DVC (with S3) Application:**
* **Data Ingestion:** Raw sales data is regularly uploaded to an S3 landing zone.
* **Preprocessing:** A Colab notebook (or an AWS Glue/SageMaker processing job) processes this raw data, and the *processed version* is `dvc add`ed and `dvc push`ed to the DVC-specific S3 bucket.
* **Model Training:** Different Colab notebooks (or EC2/SageMaker instances) can `dvc pull` specific versions of the processed data from S3, train models, and then `dvc add` and `dvc push` the resulting model artifacts (e.g., `model.pkl`, `evaluation_metrics.json`) back to the same S3 DVC remote.
* **Version Control:** The Git repository tracks the code for data processing, model training, and the `.dvc` files that point to the data and model versions in S3.
* **Reproducibility:** If a stakeholder asks, "What data was used for Model v3.1?", the engineer can `git checkout` the v3.1 commit and `dvc pull` to instantly retrieve the exact data and model that produced those results, directly from S3.


## Hands-On Practice Tasks (with Amazon S3)

These tasks are identical in concept to the Google Drive ones, but you'll be using your S3 remote.

### Task 1: Create and Track a New Dataset (S3)

1.  Create a new Python script (e.g., `generate_s3_data.py`) that generates a simple CSV file called `another_s3_dataset.csv` with 50 rows and 3 random columns.
2.  Run the script to generate the file.
3.  Use DVC to track `another_s3_dataset.csv`.
4.  Commit the `.dvc` file to Git.
5.  Push the actual data to your DVC Amazon S3 remote.


### Task 2: Simulate a Model Training Workflow (S3)

1.  Create a dummy Python script named `train_s3_model.py`. This script should:
    * Load `my_s3_dataset.csv`.
    * Perform a simple "model training" (e.g., calculate the mean of the `target` column and save it to a text file named `s3_model_output.txt`).
    * Save `s3_model_output.txt` in a new directory named `s3_models/`.
2.  Run `train_s3_model.py`.
3.  Track the `s3_models/` directory (and its contents) with DVC.
4.  Commit the relevant `.dvc` file to Git.
5.  Push the `s3_models/` directory content to your DVC Amazon S3 remote.


### Task 3: Revert Data to a Previous Version (S3)

1.  Modify `my_s3_dataset.csv` again: add a fourth random column.
2.  Track this modified dataset with DVC and commit its `.dvc` file to Git. Do *not* push the data to the remote yet.
3.  Now, use Git and DVC to revert `my_s3_dataset.csv` back to its *first* version (the one without `new_s3_feature` or the fourth column).
    * Hint: You will need to use `git log` to find the commit hash of the first dataset addition, then `git checkout` that commit, and finally `dvc pull`.
4.  Verify that `my_s3_dataset.csv` indeed reverted to its original state (check its columns).


## Detailed Solutions with Explanations (for S3 Tasks)

### Solution to Task 1: Create and Track a New Dataset (S3)

**Explanation:**
This is a direct application of the `dvc add`, `git add`, `git commit`, `dvc push` workflow, now pointing to S3.

**Code:**

In [None]:
# Task 1: Create and Track a New Dataset (S3)

# 1. Create a new Python script (e.g., generate_s3_data.py)


In [None]:
%%writefile generate_s3_data.py
import pandas as pd
import numpy as np

def generate_data(num_rows=50):
    data = {
        f's3_col{i}': np.random.rand(num_rows) for i in range(1, 4)
    }
    df = pd.DataFrame(data)
    df.to_csv('another_s3_dataset.csv', index=False)
    print("Generated 'another_s3_dataset.csv'")

if __name__ == "__main__":
    generate_data()

In [None]:
# 2. Run the script to generate the file.
!python generate_s3_data.py
!ls -lh

In [None]:
# 3. Use DVC to track another_s3_dataset.csv.
!dvc add another_s3_dataset.csv

In [None]:
# 4. Commit the .dvc file to Git.
!git add another_s3_dataset.csv.dvc
!git commit -m "Add another_s3_dataset.csv"

In [None]:
# 5. Push the actual data to your DVC Amazon S3 remote.
!dvc push

### Solution to Task 2: Simulate a Model Training Workflow (S3)

**Explanation:**
Similar to the previous model tracking, but now the `s3_models/` directory and its contents will be stored in your S3 bucket.

**Code:**

In [None]:
# Task 2: Simulate a Model Training Workflow (S3)

# 1. Create a dummy Python script named train_s3_model.py


In [None]:
%%writefile train_s3_model.py
import pandas as pd
import os

def train_and_save_model(data_path='my_s3_dataset.csv', output_dir='s3_models/'):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    df = pd.read_csv(data_path)

    # Simulate a simple "model" (e.g., calculating mean of a column)
    target_mean = df['target'].mean()

    model_output_path = os.path.join(output_dir, 's3_model_output.txt')
    with open(model_output_path, 'w') as f:
        f.write(f"Mean of target column: {target_mean}\n")

    print(f"Model output saved to '{model_output_path}'")

if __name__ == "__main__":
    train_and_save_model()

In [None]:
# 2. Run train_s3_model.py.
!python train_s3_model.py
!ls -lh
!ls -lh s3_models/

In [None]:
# 3. Track the s3_models/ directory (and its contents) with DVC.
!dvc add s3_models/

In [None]:
# 4. Commit the relevant .dvc file to Git.
!git add s3_models.dvc
!git commit -m "Add trained S3 model output"

In [None]:
# 5. Push the s3_models/ directory content to your DVC Amazon S3 remote.
!dvc push

### Solution to Task 3: Revert Data to a Previous Version (S3)

**Explanation:**
This task reaffirms the Git+DVC revert mechanism, demonstrating that it works seamlessly regardless of the DVC remote type (Google Drive or S3). The core is `git checkout` to select the `.dvc` file version, followed by `dvc pull` to fetch the corresponding large data from S3.

**Code:**

In [None]:
# Task 3: Revert Data to a Previous Version (S3)

# 1. Modify my_s3_dataset.csv again: add a fourth random column.
import pandas as pd
import numpy as np

df = pd.read_csv(DATA_FILE_NAME)
df['fourth_s3_feature'] = np.random.rand(len(df)) * 20
df.to_csv(DATA_FILE_NAME, index=False)

print("my_s3_dataset.csv modified with 'fourth_s3_feature'")
print(df.head())

In [None]:
# 2. Track this modified dataset with DVC and commit its .dvc file to Git.
# Do NOT push the data to the remote yet.
!dvc add $DATA_FILE_NAME
!git add my_s3_dataset.csv.dvc
!git commit -m "Add fourth_s3_feature to my_s3_dataset.csv"

In [None]:
# 3. Now, use Git and DVC to revert my_s3_dataset.csv back to its first version
# (the one without new_s3_feature or the fourth column).

# First, find the commit hash for "Add initial S3 dataset"
# This will show you the commit history. Look for the message.
!git log --oneline

In [None]:
# Replace 'initial_s3_commit_hash' with the actual hash from your git log output
# For example: initial_s3_commit_hash = 'abcdef1'
initial_s3_commit_hash = 'YOUR_INITIAL_S3_COMMIT_HASH_HERE' # <--- REPLACE THIS

# Checkout the specific Git commit
!git checkout $initial_s3_commit_hash

In [None]:
# Now, pull the data associated with this commit using DVC
!dvc pull

In [None]:
# 4. Verify that my_s3_dataset.csv indeed reverted to its original state.
import pandas as pd

df_reverted = pd.read_csv(DATA_FILE_NAME)
print(df_reverted.head())
print(f"Columns after revert: {df_reverted.columns.tolist()}")

In [None]:
# Go back to the latest state of your master branch
!git checkout master
!dvc pull # Pull the latest data associated with the master branch