![image](https://user-images.githubusercontent.com/57321948/196933065-4b16c235-f3b9-4391-9cfe-4affcec87c35.png)

# Submitted by: Mohammad Wasiq

## Email: `gl0427@myamu.ac.in`

# Pre-Placement Training Assignment - `Big Data` 

## Docker, Airflow, Sqoop


## TOPIC: Docker

**Q1. Scenario: You are building a microservices-based application using Docker. Design a Docker Compose file that sets up three containers: a web server container, a database container, and a cache container. Ensure that the containers can communicate with each other properly.**

In [None]:
version: "3"
services:
  web:
    build: ./web
    ports:
      - "80:80"
    depends_on:
      - db
      - cache
  db:
    image: postgres:latest
    environment:
      - POSTGRES_USER=your_username
      - POSTGRES_PASSWORD=your_password
      - POSTGRES_DB=your_database_name
    ports:
      - "5432:5432"
    volumes:
      - ./data:/var/lib/postgresql/data
  cache:
    image: redis:latest
    ports:
      - "6379:6379"


**Q2. Scenario: You want to scale your Docker containers dynamically based on the incoming traffic. Write a Python script that utilizes Docker SDK to monitor the CPU usage of a container and automatically scales the number of replicas based on a threshold.**

In [None]:
import docker
import psutil

client = docker.from_env()

def get_container_cpu_usage(container_id):
    container = client.containers.get(container_id)
    stats = container.stats(stream=False)
    cpu_stats = stats['cpu_stats']
    precpu_stats = stats['precpu_stats']
    cpu_delta = cpu_stats['cpu_usage']['total_usage'] - precpu_stats['cpu_usage']['total_usage']
    system_delta = cpu_stats['system_cpu_usage'] - precpu_stats['system_cpu_usage']
    cpu_percent = cpu_delta / system_delta * 100
    return cpu_percent

def scale_containers(container_name, max_replicas, cpu_threshold):
    containers = client.containers.list(filters={'name': container_name})
    total_cpu_usage = 0
    for container in containers:
        cpu_usage = get_container_cpu_usage(container.id)
        total_cpu_usage += cpu_usage
    average_cpu_usage = total_cpu_usage / len(containers)
    if average_cpu_usage > cpu_threshold and len(containers) < max_replicas:
        client.services.scale(container_name, len(containers) + 1)
    elif average_cpu_usage < cpu_threshold and len(containers) > 1:
        client.services.scale(container_name, len(containers) - 1)

if __name__ == "__main__":
    container_name = "your_container_name"
    max_replicas = 5
    cpu_threshold = 80.0

    while True:
        scale_containers(container_name, max_replicas, cpu_threshold)
        time.sleep(60)  # Adjust the interval as needed

**Q3. Scenario: You have a Docker image stored on a private registry. Develop a script in Bash that authenticates with the registry, pulls the latest version of the image, and runs a container based on that image.**

In [None]:
DOCKER_REGISTRY="your-registry-url"
DOCKER_IMAGE="your-image-name"
DOCKER_TAG="latest"

# Authenticate with the Docker registry
docker login $DOCKER_REGISTRY

# Pull the latest version of the image
docker pull $DOCKER_REGISTRY/$DOCKER_IMAGE:$DOCKER_TAG

# Run a container based on the latest image
docker run -d --name my_container $DOCKER_REGISTRY/$DOCKER_IMAGE:$DOCKER_TAG

# Clean up (optional)
docker logout $DOCKER_REGISTRY

## TOPIC: Airflow

**Q1. Scenario: You have a data pipeline that requires executing a shell command as part of a task. Create an Airflow DAG that includes a BashOperator to execute a specific shell command.**


In [None]:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

# Define the DAG
dag = DAG(
    dag_id='execute_shell_command',
    start_date=datetime(2023, 7, 1),
    schedule_interval='0 0 * * *'  # Runs once daily at midnight
)

# Define the task
execute_command_task = BashOperator(
    task_id='execute_shell_command_task',
    bash_command='your_shell_command_here',
    dag=dag
)

**Q2. Scenario: You want to create dynamic tasks in Airflow based on a list of inputs. Design an Airflow DAG that generates tasks dynamically using PythonOperator, where each task processes an element from the input list.**



In [None]:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def process_input(input_value):
    # Add your processing logic here
    print(f"Processing input: {input_value}")

# Define the DAG
dag = DAG(
    dag_id='dynamic_tasks',
    start_date=datetime(2023, 7, 1),
    schedule_interval=None  # Run manually or triggered by another DAG
)

# Define the list of inputs
input_list = ['input1', 'input2', 'input3']

# Generate tasks dynamically
for input_value in input_list:
    task_id = f'task_{input_value}'
    task = PythonOperator(
        task_id=task_id,
        python_callable=process_input,
        op_kwargs={'input_value': input_value},
        dag=dag
    )

**Q3. Scenario: You need to set up a complex task dependency in Airflow, where Task B should start only if Task A has successfully completed. Implement this dependency using the "TriggerDagRunOperator" in Airflow.**

In [None]:
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.utils.trigger_rule import TriggerRule
from datetime import datetime

def check_task_a_success(**context):
    # Check if Task A was successful
    task_a_success = True  # Replace with your logic to check Task A success

    if task_a_success:
        return 'task_b'
    else:
        return 'task_c'

# Define the DAG
dag = DAG(
    dag_id='complex_dependency',
    start_date=datetime(2023, 7, 1),
    schedule_interval=None  # Run manually or triggered by another DAG
)

# Define the tasks
task_a = ...
task_b = ...
task_c = ...

# Define the dependency
check_task_a_success_op = BranchPythonOperator(
    task_id='check_task_a_success',
    python_callable=check_task_a_success,
    provide_context=True,
    dag=dag
)

trigger_task_b_op = TriggerDagRunOperator(
    task_id='trigger_task_b',
    trigger_dag_id='complex_dependency',
    dag=dag,
    trigger_rule=TriggerRule.ONE_SUCCESS
)

# Set up the task dependency
task_a >> check_task_a_success_op
check_task_a_success_op >> [task_b, task_c]
check_task_a_success_op >> trigger_task_b_op

## TOPIC: Sqoop

**Q1. Scenario: You want to import data from an Oracle database into Hadoop using Sqoop, but you only need to import specific columns from a specific table. Write a Sqoop command that performs the import, including the necessary arguments for column selection and table mapping.**

In [None]:
sqoop import \
  --connect jdbc:oracle:thin:@//host:port/service_name \
  --username your_username \
  --password-file /path/to/password_file \
  --table your_table \
  --columns "column1,column2,column3" \
  --target-dir /path/to/hadoop_directory \
  --as-parquetfile \
  --num-mappers 4

**Q2. Scenario: You have a requirement to perform an incremental import of data from a MySQL database into Hadoop using Sqoop. Design a Sqoop command that imports only the new or updated records since the last import.**

In [None]:
sqoop import \
  --connect jdbc:mysql://host:port/database \
  --username your_username \
  --password-file /path/to/password_file \
  --table your_table \
  --target-dir /path/to/hadoop_directory \
  --as-parquetfile \
  --incremental append \
  --check-column last_modified_date \
  --last-value '2022-01-01'

**Q3. Scenario: You need to export data from Hadoop to a Microsoft SQL Server database using Sqoop. Develop a Sqoop command that exports the data, considering factors like database connection details, table mapping, and appropriate data types.**

In [None]:
sqoop export \
  --connect "jdbc:sqlserver://<hostname>:<port>;database=<database>" \
  --username <username> \
  --password-file /path/to/password_file \
  --table <table_name> \
  --export-dir /path/to/hadoop_directory \
  --input-fields-terminated-by ',' \
  --input-lines-terminated-by '\n' \
  --input-null-string '\\N' \
  --input-null-non-string '\\N'