# Introduction to Amazon Q Developer

### Overview

Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant designed to enhance the 
software development process, particularly within the AWS ecosystem. Amazon Q can assist with various coding tasks such as providing inline code completions, generating new code, and scanning for security vulnerabilities. Amazon Q Developer can be accessed through an IDE such as VSCode and JupyterLab or through the command line. 

The model that Amazon Q utilizes has been supplemented with the high-quality AWS content, allowing users to ask questions
about AWS architecture, AWS resources, best practices, documentation, and support. 


### Learning Objectives

* Understand the capabilities of Amazon Q Developer and learn how to access the tool through an IDE 
* Engage with Amazon Q Developer and ask code-related queries
* Implement inline code completions 
* Generate new code snippets based on specific requirements 

### Prerequisites

To complete this tutorial, you will need access to SageMaker Studio. 
Alternatively, you may install Amazon Q Developer in an alternative IDE such as VSCode or JupyterLab. 

### Installing Amazon Q Developer on SageMaker Studio
1. Navigate to SageMaker Studio and create a domain

Note: In order to make Amazon Q available within the JupyterLab, you will need to modify the IAM permissions associated with the SageMaker ExecutionRole. You will find the domain name in your SageMaker Studio launch page. 


2. Navigate to IAM and search for the execution role pertaining to your SageMaker domain. The execution role will be named AmazonSageMaker-ExecutionRole-(SageMaker domain id)


4. Click on the role and scroll down to the role policies

![alt text](IAM-role.png)

![alt text](role-policy.png)

5. Add the following policy to the IAM role: 

```JSON
		{
			"Effect": "Allow",
			"Action": [
				"q:SendMessage"
			],
			"Resource": [
				"*"
			]
		}
		{
			"Sid": "Amazon QDeveloperPermissions",
			"Effect": "Allow",
			"Action": [
				"codewhisperer:GenerateRecommendations"
			],
			"Resource": "*"
		}

```
6. Review and save your changes

![alt text](iam-policy-review.png)

7. Open your SageMaker studio UI and create a JupyterLab Space

![alt text](jupy-lab.png)

8. Click on the Amazon Q logo on the left panel 

![alt text](amazon-q-jup.png)

Alternatively, you may open a CodeEditor application from the SageMaker Studio UI and install Amazon Q as an extension. Please note that when this application is shut down, the extension will be removed. 

### Let's get started! 


#### Environment Set Up

Download the scripts required to run this tutorial using wget: 

```bash
wget https://raw.githubusercontent.com/STRIDES/NIHCloudLabAWS/refs/heads/drafts/notebooks/GenAI/example_scripts/bioinformatics_testing.py --no-check-certificate 
wget https://raw.githubusercontent.com/STRIDES/NIHCloudLabAWS/refs/heads/drafts/notebooks/GenAI/example_scripts/quick-actions-testing.ipynb --no-check-certificate 

```



### Use Cases

#### Use Case 1: Utilize in-line code completion.

In-line code completion is a feature that helps you write code faster and with fewer errors. As you type, it suggests possible ways to complete your code based on what you've started to write. 

Try it out! Let's try adding a sixth step to the `bioinformatics-testing.py` script, where we will run samtools sort, to this script. As you type, press the tab key to see what the coding assistant suggests for you!

![alt text](code-completion-1.png.png)

Samtools Sorting Example: 
```python
# Step 6: Run Samtools Sort
for index, row in sample_sheet.iterrows():
    bam_file = f"./star_results/{row['sample_id']}.bam"
    sorted_bam_file = f"./star_results/{row['sample_id']}_sorted.bam"
    samtools_sort_command = f"samtools sort {bam_file} -o {sorted_bam_file}"
    subprocess.run(samtools_sort_command, shell=True)
```

#### Use Case 2: Python Notebook functionalities

The quick actions menu provides a list of ways that you may prompt the coding assistant. In this use case, we will test the `/fix`, `optimize`, and `explain` quick actions 

##### **`/fix` Prompt**
    1. Open the quick-actions-test.ipynb file
    2. Run the notebook
    3. Select the cell that contains and error - Cell 3
    4. Navigate to the Amazon Q search bar and type in `/fix`
    5. Click on the down arrow next to the send button and select "Send message with selection" 

##### **Response** 
I'll help fix this cell. The error occurs because `describe()` is a method that should be called on a DataFrame, but here it's being called directly and the DataFrame variable `df` is incorrectly placed after it.

Here's the corrected code:

```python
# View summary statistics
df.describe()
```

This will generate summary statistics including:
- count
- mean
- std (standard deviation)
- min
- 25% (first quartile)
- 50% (median)
- 75% (third quartile)
- max

For more detailed statistics, you could also specify which statistics to include:
```python
df.describe(include='all', percentiles=[.05, .25, .5, .75, .95])
```

Let me know if you need any clarification!

##### **Response Breakdown**
* The response contains the corrected code, a description of what the code does and suggestions for improving it. 
* You can easily implement the suggested changes by clicking on the three dots at the top of the response and selecting "Replace selection"

##### **`/optimize` Prompt**
    1. Select the cell in which data is added to the dataframe - Cell 4
    2. Navigate to the Amazon Q search bar and type in `/optimize`
    3. Click on the down arrow next to the send button and select "Send message with selection" 

##### **Response**

I'll optimize the code for adding multiple columns to the DataFrame more efficiently. Here's the improved version:

```python
# Add multiple columns efficiently using a dictionary and assign()
new_data = {
    'Sample_Type': ['Tissue1', 'Tissue2', 'Tissue1', 'Tissue2'],
    'P_Value': [0.05, 0.01, 0.03, 0.07]
}
df = df.assign(**new_data)

# Optional: Verify the data types are optimal
df = df.astype({
    'Sample_Type': 'category',  # Convert to category type for memory efficiency
    'P_Value': 'float32'       # Use float32 instead of float64 if precision allows
})
```

Key optimizations:
1. Uses `assign()` for a single operation instead of multiple assignments
2. Converts string column to category type for memory efficiency
3. Uses appropriate data type for numerical values
4. Reduces DataFrame copies from 2 to 1

Alternative method if you prefer updating in place:
```python
# Alternative optimization using update
df.update(pd.DataFrame(new_data, index=df.index))
```

These optimizations are particularly beneficial for:
- Large datasets
- Memory management
- Performance
- Code maintainability

##### **Response Breakdown** 
A few methods of optimization are suggested. 

##### **`/explain` Prompt**
    1. Select the cell in which matplotlib is used to create a plot from the dataframe - Cell 5
    2. Navigate to the Amazon Q search bar and type in `/explain`
    3. Click on the down arrow next to the send button and select "Send message with selection" 

##### **Response** 
I'll explain this code block line by line and suggest some enhancements:

```python
# Import matplotlib's pyplot module for plotting
import matplotlib.pyplot as plt

# Create a line plot using pandas DataFrame plotting interface
df.plot(
    x='Sample_ID',          # X-axis: Sample identifiers
    y='Expression_Level',   # Y-axis: Expression level values
    kind='line'            # Type of plot: line graph
)

# Display the plot
plt.show()
```

Let's enhance this visualization with better formatting and features:

```python
# Create an enhanced line plot
plt.figure(figsize=(10, 6))  # Set figure size for better visibility

# Create the plot with additional parameters
df.plot(
    x='Sample_ID', 
    y='Expression_Level',
    kind='line',
    marker='o',           # Add data point markers
    linestyle='-',        # Solid line style
    linewidth=2,          # Thicker line
    color='steelblue',    # Professional color
    title='Gene Expression Levels Across Samples'
)

# Customize the plot
plt.xlabel('Sample ID', fontsize=12)
plt.ylabel('Expression Level', fontsize=12)
plt.grid(True, alpha=0.3)  # Add subtle grid
plt.xticks(rotation=45)    # Rotate x-axis labels for better readability

# Add mean expression line
plt.axhline(
    y=df['Expression_Level'].mean(),
    color='red',
    linestyle='--',
    alpha=0.5,
    label='Mean Expression'
)

plt.legend()
plt.tight_layout()  # Adjust layout to prevent label cutoff
plt.show()
```

Key components explained:
1. **Data Visualization**:
   - Shows expression levels (y-axis) for each sample (x-axis)
   - Line connects points to show trends
   - Markers highlight actual data points

2. **Formatting**:
   - Figure size set for clarity
   - Axis labels and title added
   - Grid helps read values
   - Rotated x-axis labels prevent overlap

3. **Statistical Context**:
   - Mean line adds reference
   - Legend identifies plot elements

This visualization helps:
- Track expression changes across samples
- Identify patterns or trends
- Spot potential outliers
- Compare expression levels between samples
- Reference against mean expression

##### **Response Breakdown** 
* The functions present in the code snippet are explained. Additionally, Amazon Q provides suggestions to enhance the code

#### Use Case 3: Rewrite or add to an existing script. 

Prompting can be used to modify an existing script. We will incorporate parallel processing into a script. Parallel processing allows a program to execute multiple tasks simultaneously, which can significantly speed up the execution time, especially for tasks that are computationally intensive. 

Note: In this prompt, we include the script. When utilizing Amazon Q in other IDEs such as VSCode or Code Editor, you may reference files through their filepaths. 

##### **Prompt** 

I would like to use parallel processing in my `bioinformatics-testing.py` script. Can you modify this script to do so?

Script: 
```python
import pandas as pd
import subprocess

# Step 1: Read the sample sheet
sample_sheet = pd.read_csv('samplesheet.csv')

# Step 2: Run FastQC
for index, row in sample_sheet.iterrows():
    fastqc_command = f"fastqc {row['file_path']} -o ./fastqc_results/"
    subprocess.run(fastqc_command, shell=True)

# Step 3: Run MultiQC
multiqc_command = "multiqc ./fastqc_results/ -o ./multiqc_report/"
subprocess.run(multiqc_command, shell=True)

# Step 4: Run STAR aligner
for index, row in sample_sheet.iterrows():
    star_command = f"STAR --genomeDir /path/to/genome --readFilesIn {row['file_path']} --outFileNamePrefix ./star_results/{row['sample_id']}"
    subprocess.run(star_command, shell=True)

# Step 5: Index BAM files with Samtools
for index, row in sample_sheet.iterrows():
    bam_file = f"./star_results/{row['sample_id']}.bam"
    samtools_command = f"samtools index {bam_file}"
    subprocess.run(samtools_command, shell=True)
```


##### **Response**

I'll modify your script to use parallel processing for the FastQC and STAR alignment steps, as these are independent operations that can be run in parallel. Here's the modified version:

```python
import pandas as pd
import subprocess
import multiprocessing as mp
from functools import partial
import os

def run_fastqc(row):
    """Run FastQC for a single sample"""
    fastqc_command = f"fastqc {row['file_path']} -o ./fastqc_results/"
    try:
        subprocess.run(fastqc_command, shell=True, check=True)
        return f"FastQC completed for {row['file_path']}"
    except subprocess.CalledProcessError as e:
        return f"Error running FastQC for {row['file_path']}: {str(e)}"

def run_star(row):
    """Run STAR alignment for a single sample"""
    star_command = f"STAR --genomeDir /path/to/genome --readFilesIn {row['file_path']} --outFileNamePrefix ./star_results/{row['sample_id']}"
    try:
        subprocess.run(star_command, shell=True, check=True)
        return f"STAR alignment completed for {row['sample_id']}"
    except subprocess.CalledProcessError as e:
        return f"Error running STAR for {row['sample_id']}: {str(e)}"

def run_samtools_index(row):
    """Index BAM file for a single sample"""
    bam_file = f"./star_results/{row['sample_id']}.bam"
    samtools_command = f"samtools index {bam_file}"
    try:
        subprocess.run(samtools_command, shell=True, check=True)
        return f"Samtools indexing completed for {row['sample_id']}"
    except subprocess.CalledProcessError as e:
        return f"Error running Samtools for {row['sample_id']}: {str(e)}"

def parallel_process(function, data, num_processes=None):
    """Generic parallel processing function"""
    if num_processes is None:
        num_processes = max(1, mp.cpu_count() - 1)  # Leave one CPU free
    
    with mp.Pool(processes=num_processes) as pool:

    return results

if __name__ == "__main__":
    # Create output directories if they don't exist
    os.makedirs("./fastqc_results/", exist_ok=True)
    os.makedirs("./star_results/", exist_ok=True)
    os.makedirs("./multiqc_report/", exist_ok=True)

    # Step 1: Read the sample sheet
    sample_sheet = pd.read_csv('samplesheet.csv')

    # Step 2: Run FastQC in parallel
    print("Running FastQC in parallel...")
    fastqc_results = parallel_process(run_fastqc, sample_sheet)
    for result in fastqc_results:
        print(result)

    # Step 3: Run MultiQC (this needs to run after all FastQC jobs are complete)
    print("\nRunning MultiQC...")
    multiqc_command = "multiqc ./fastqc_results/ -o ./multiqc_report/"
    try:
        subprocess.run(multiqc_command, shell=True, check=True)
        print("MultiQC completed successfully")
    except subprocess.CalledProcessError as e:
        print(f"Error running MultiQC: {str(e)}")

    # Step 4: Run STAR aligner in parallel
    print("\nRunning STAR alignment in parallel...")
    star_results = parallel_process(run_star, sample_sheet)
    for result in star_results:
        print(result)

    # Step 5: Index BAM files with Samtools in parallel
    print("\nIndexing BAM files in parallel...")
    samtools_results = parallel_process(run_samtools_index, sample_sheet)
    for result in samtools_results:
        print(result)
```

Key improvements in this version:

1. **Parallel Processing**: Uses Python's multiprocessing to run FastQC, STAR alignment, and Samtools indexing in parallel.

2. **Error Handling**: Added try-except blocks to catch and report errors for each process.

3. **Progress Reporting**: Added status messages to track progress.

4. **Resource Management**: 
   - Automatically determines optimal number of processes based on CPU cores
   - Uses context managers (`with` statements) for proper resource cleanup

5. **Directory Creation**: Automatically creates output directories if they don't exist.

To customize the parallel processing:
```python
# You can specify the number of processes explicitly:
fastqc_results = parallel_process(run_fastqc, sample_sheet, num_processes=4)
```

Important notes:

1. The script will use (number of CPU cores - 1) processes by default. You can adjust this by specifying `num_processes`.

2. Memory usage will increase with the number of parallel processes. Monitor system resources and adjust accordingly.

3. Make sure your system has enough resources to handle parallel processing of your data.

4. Some tools (like STAR) might have their own parallel processing capabilities. You might need to adjust their thread settings to avoid oversubscription.

5. The MultiQC step remains sequential as it needs to process all FastQC results at once.

To optimize performance, you might want to:
- Adjust the number of processes based on your system's capabilities
- Consider the I/O limitations of your storage system
- Monitor memory usage and adjust accordingly
- Consider using tool-specific parallelization options (e.g., STAR's built-in threading)


##### **Response Breakdown**

The response includes the following elements: 

1. A Modified Python Script:The complete modified script with parallel processing for FastQC and STAR alignment steps.
2. Key Improvements: A list of the main improvements made in the modified script.
3. Customization Instructions: Instructions on how to customize the parallel processing by specifying the number of processes.
4. Notes: Some considerations and caveats related to the script and parallel processing.
5. Performance Optimization Tips: Tips for optimizing performance when using the modified script. 

Alternatively, you can prompt the model to suggest ways to modify your existing scripts to run more efficiently. 

##### **Prompt**

What are some ways that I can optimize this script? 

Script: 
```python
import pandas as pd
import subprocess

# Step 1: Read the sample sheet
sample_sheet = pd.read_csv('samplesheet.csv')

# Step 2: Run FastQC
for index, row in sample_sheet.iterrows():
    fastqc_command = f"fastqc {row['file_path']} -o ./fastqc_results/"
    subprocess.run(fastqc_command, shell=True)

# Step 3: Run MultiQC
multiqc_command = "multiqc ./fastqc_results/ -o ./multiqc_report/"
subprocess.run(multiqc_command, shell=True)

# Step 4: Run STAR aligner
for index, row in sample_sheet.iterrows():
    star_command = f"STAR --genomeDir /path/to/genome --readFilesIn {row['file_path']} --outFileNamePrefix ./star_results/{row['sample_id']}"
    subprocess.run(star_command, shell=True)

# Step 5: Index BAM files with Samtools
for index, row in sample_sheet.iterrows():
    bam_file = f"./star_results/{row['sample_id']}.bam"
    samtools_command = f"samtools index {bam_file}"
	subprocess.run(samtools_command, shell=True)

```

##### **Response** 
Here's an optimized version of your script with several improvements: [[1]](https://stackoverflow.com/questions/48773816)

```python
import pandas as pd
import subprocess
import multiprocessing as mp
import os
import logging
from datetime import datetime
from pathlib import Path
import sys
from typing import List, Dict

# Configure logging
def setup_logging(log_dir: str = "logs") -&gt; None:
    """Set up logging configuration"""
    Path(log_dir).mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    log_file = f"{log_dir}/pipeline_{timestamp}.log"
    
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(log_file),
            logging.StreamHandler(sys.stdout)
        ]
    )

def create_directories(dirs: List[str]) -&gt; None:
    """Create necessary directories if they don't exist"""
    for dir_path in dirs:
        Path(dir_path).mkdir(parents=True, exist_ok=True)
        logging.info(f"Created directory: {dir_path}")

def run_command(cmd: str, description: str) -&gt; bool:
    """Execute a shell command and handle errors"""
    try:
        logging.info(f"Starting {description}")
        result = subprocess.run(
            cmd,
            shell=True,
            check=True,
            capture_output=True,
            text=True
        )
        logging.info(f"Completed {description}")
        return True
    except subprocess.CalledProcessError as e:
        logging.error(f"Error in {description}: {e}")
        logging.error(f"Command output: {e.output}")
        return False

def process_sample(row: pd.Series, config: Dict) -&gt; Dict:
    """Process a single sample through the pipeline"""
    results = {"sample_id": row['sample_id'], "success": True, "errors": []}
    
    # Run FastQC
    fastqc_cmd = f"fastqc {row['file_path']} -t {config['threads_per_sample']} -o {config['fastqc_dir']}"
    if not run_command(fastqc_cmd, f"FastQC for {row['sample_id']}"):
        results["success"] = False
        results["errors"].append("FastQC failed")
    
    # Run STAR
    star_cmd = (f"STAR --genomeDir {config['genome_dir']} "
                f"--readFilesIn {row['file_path']} "
                f"--runThreadN {config['threads_per_sample']} "
                f"--outFileNamePrefix {config['star_dir']}/{row['sample_id']}")
    if not run_command(star_cmd, f"STAR alignment for {row['sample_id']}"):
        results["success"] = False
        results["errors"].append("STAR alignment failed")
    
    # Run Samtools
    bam_file = f"{config['star_dir']}/{row['sample_id']}.bam"
    samtools_cmd = f"samtools index {bam_file}"
    if not run_command(samtools_cmd, f"Samtools indexing for {row['sample_id']}"):
        results["success"] = False
        results["errors"].append("Samtools indexing failed")
    
    return results

def main():
    # Configuration
    config = {
        "fastqc_dir": "./fastqc_results",
        "multiqc_dir": "./multiqc_report",
        "star_dir": "./star_results",
        "genome_dir": "/path/to/genome",
        "log_dir": "./logs",
        "threads_per_sample": 4,
        "max_parallel_samples": max(1, mp.cpu_count() - 1)
    }
    
    # Setup
    setup_logging(config['log_dir'])
    create_directories([
        config['fastqc_dir'],
        config['multiqc_dir'],
        config['star_dir'],
        config['log_dir']
    ])
    
    try:
        # Read sample sheet
        logging.info("Reading sample sheet")
        sample_sheet = pd.read_csv('samplesheet.csv')
        
        # Process samples in parallel
        with mp.Pool(processes=config['max_parallel_samples']) as pool:
            results = pool.starmap(
                process_sample,
                [(row, config) for _, row in sample_sheet.iterrows()]
            )
        
        # Run MultiQC
        multiqc_cmd = f"multiqc {config['fastqc_dir']} -o {config['multiqc_dir']}"
        run_command(multiqc_cmd, "MultiQC")
        
        # Report summary
        successful = sum(1 for r in results if r['success'])
        logging.info(f"Pipeline completed. "
                    f"Successful: {successful}/{len(results)} samples")
        
        # Report errors if any
        for result in results:
            if not result['success']:
                logging.error(f"Sample {result['sample_id']} failed: "
                            f"{', '.join(result['errors'])}")
                
    except Exception as e:
        logging.error(f"Pipeline failed: {str(e)}")
        raise

if __name__ == "__main__":
    main()
```

Key optimizations and improvements:

1. **Parallel Processing**:
   - Uses multiprocessing to run samples in parallel
   - Configurable number of parallel processes
   - Efficient resource utilization

2. **Error Handling**:
   - Comprehensive error catching and logging
   - Detailed error messages
   - Continues processing other samples if one fails

3. **Logging System**:
   - Detailed logging with timestamps
   - Both console and file output
   - Organized log directory structure

4. **Code Organization**:
   - Modular functions for better maintainability
   - Type hints for better code clarity
   - Clear separation of concerns

5. **Resource Management**:
   - Configurable threads per sample
   - Automatic CPU core detection
   - Directory creation handling

6. **Progress Tracking**:
   - Status updates for each step
   - Summary report at completion
   - Error reporting for failed samples

To further optimize the script, you could:

1. Add configuration file support:
```python
import yaml

def load_config(config_file: str) -&gt; Dict:
    with open(config_file) as f:
        return yaml.safe_load(f)
```

2. Add checkpointing to resume interrupted runs:
```python
def check_completion(sample_id: str, output_dir: str) -&gt; bool:
    """Check if sample has already been processed"""
    return Path(f"{output_dir}/{sample_id}.bam.bai").exists()
```

3. Add memory management:
```python
def get_memory_usage() -&gt; float:
    """Monitor memory usage"""
    import psutil
    return psutil.Process().memory_info().rss / 1024 / 1024  # MB
```

4. Add performance monitoring:
```python
from time import time

def timer_decorator(func):
    def wrapper(*args, **kwargs):
        start = time()
        result = func(*args, **kwargs)
        end = time()
        logging.info(f"{func.__name__} took {end - start:.2f} seconds")
        return result
    return wrapper
```

5. Add input validation:
```python
def validate_sample_sheet(df: pd.DataFrame) -&gt; bool:
    """Validate sample sheet format and contents"""
    required_columns = ['sample_id', 'file_path']
    return all(col in df.columns for col in required_columns)
```

These optimizations make the script more robust, efficient, and maintainable while providing better monitoring and error handling capabilities.

##### **Response Breakdown** 

The response includes 

1. An Optimized Python Script: The script uses multiprocessing to run samples in parallel, handles errors comprehensively, sets up detailed logging, organizes code into modular functions, manages resources efficiently, and tracks progress with status updates and summary reports.
2. Explanation of Optimization Techniques: The response explains each of these optimization techniques 
3. Suggestions for Further Optimizations: Further optimizations such as checkpointing and adding a configuration file are suggested
4. References used for optimization are provided 




#### Use Case 4: Code conversion

Code conversion is often necessary when you want to adapt existing scripts to different environments, workflows, or tools. A common use-case is converting scripts into a workflow language. 

##### **Prompt** 

Convert the `starAlign.slurm` script into a Snakemake workflow.
Script: 

```bash
#!/bin/bash
#SBATCH --job-name=star_alignment
#SBATCH --output=star_output.txt
#SBATCH --error=star_error.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00

module load star

INPUT_DIR=/path/to/input
OUTPUT_DIR=/path/to/output
GENOME_DIR=/path/to/genome

STAR --genomeDir $GENOME_DIR --readFilesIn $INPUT_DIR/sample_R1.fastq $INPUT_DIR/sample_R2.fastq --outFileNamePrefix $OUTPUT_DIR/sample_

```

##### **Response** 

I'll convert your SLURM script into a Snakemake workflow. Here's the implementation with a Snakefile and configuration file: [[1]](https://stackoverflow.com/questions/77929511)

```python
# Snakefile

# Configuration
configfile: "config.yaml"

# Wildcard constraints
wildcard_constraints:
    sample="[A-Za-z0-9_-]+"

# Rules
rule all:
    input:
        expand("{output_dir}/{sample}_Aligned.sortedByCoord.out.bam",
               output_dir=config["output_dir"],
               sample=config["samples"])

rule star_align:
    input:
        r1 = "{input_dir}/{sample}_R1.fastq",
        r2 = "{input_dir}/{sample}_R2.fastq"
    output:
        bam = "{output_dir}/{sample}_Aligned.sortedByCoord.out.bam"
    params:
        genome_dir = config["genome_dir"],
        out_prefix = "{output_dir}/{sample}_"
    threads: 8
    resources:
        mem_mb = 32000,
        time = "02:00:00"
    log:
        "logs/star/{sample}.log"
    shell:
        """
        STAR --genomeDir {params.genome_dir} \
             --readFilesIn {input.r1} {input.r2} \
             --outFileNamePrefix {params.out_prefix} \
             --runThreadN {threads} \
             --outSAMtype BAM SortedByCoordinate \
             2&gt; {log}
        """
```

And here's the corresponding configuration file:

```yaml
# config.yaml

# Directories
input_dir: "/path/to/input"
output_dir: "/path/to/output"
genome_dir: "/path/to/genome"

# Sample names (without _R1/R2.fastq suffix)
samples:
    - "sample1"
    - "sample2"
    - "sample3"
```

To run this workflow with SLURM, create a cluster configuration file:

```yaml
# cluster.yaml

__default__:
    partition: "default"
    output: "logs/slurm/slurm-%j.out"
    error: "logs/slurm/slurm-%j.err"
    nodes: 1
    ntasks: 1

star_align:
    cpus-per-task: 8
    mem: "32G"
    time: "02:00:00"
```

You can also create a submission script:

```bash
#!/bin/bash
# submit.sh

snakemake \
    --profile slurm \
    --jobs 20 \
    --cluster-config cluster.yaml \
    --use-conda \
    --conda-frontend mamba
```

Key features of this implementation:

1. **Modular Configuration**:
   - Separate config files for workflow, cluster, and samples
   - Easy to modify parameters without changing the workflow

2. **Resource Management**:
   - Specifies memory and time requirements
   - Uses threading appropriately
   - Configurable through cluster.yaml

3. **Logging**:
   - Separate log files for each sample
   - SLURM output and error logs

4. **Scalability**:
   - Handles multiple samples
   - Easy to add more samples in config.yaml

To use this workflow:

1. Create the directory structure:
```bash
mkdir -p logs/star logs/slurm
```

2. Modify the config files:
   - Update paths in config.yaml
   - Add your sample names
   - Adjust resources in cluster.yaml if needed

3. Run the workflow:
```bash
# With SLURM profile
snakemake --profile slurm

# Or with direct SLURM submission
snakemake \
    --cluster "sbatch -p {cluster.partition} -t {cluster.time} \
              --mem {cluster.mem} -c {cluster.cpus-per-task}" \
    --cluster-config cluster.yaml \
    --jobs 20
```

Additional tips:

1. Add more rules for quality control or downstream analysis:
```python
rule fastqc:
    input:
        "{input_dir}/{sample}_{read}.fastq"
    output:
        html = "qc/fastqc/{sample}_{read}_fastqc.html",
        zip = "qc/fastqc/{sample}_{read}_fastqc.zip"
    threads: 1
    resources:
        mem_mb = 2000
    shell:
        "fastqc -o qc/fastqc {input}"
```

2. Add conda environment specification:
```yaml
# env.yaml
channels:
  - bioconda
  - conda-forge
dependencies:
  - star=2.7.10a
  - fastqc=0.11.9
```

3. Include quality control in the workflow:
```python
rule all:
    input:
        expand("qc/fastqc/{sample}_{read}_fastqc.html",
               sample=config["samples"],
               read=["R1", "R2"]),
        expand("{output_dir}/{sample}_Aligned.sortedByCoord.out.bam",
               output_dir=config["output_dir"],
               sample=config["samples"])
```


##### **Response Breakdown** 

The response includes the following elements: 

1.  Scripts (Snakefile, config.yaml, cluster.yaml, submit.sh):
    * The `Snakefile` defines the workflow rules, including the all rule for final outputs and the star_align rule for alignment using STAR
    * The `config.yaml` specifies workflow configurations
    * The `cluster.yaml file` specifies default SLURM settings and resource requirements for specific rules
    * The `submit.sh` script is used to run the Snakemake workflow on a SLURM cluster
2. The key features of the workflow are highlighted and explained
3. The response provides instructions on how to run the workflow. Multiple options are given for this purpose.
3. Instructions are given to create necessary directories, modify configuration files, and run the workflow using the SLURM profile or direct submission.
4. The response suggests adding more rules for quality control or downstream analysis, defining dependencies in an environment file, and including quality control outputs in the workflow.

#### Use Case 5: Cloud migration assistance. 

Cloud migration of bioinformatics pipelines involves moving data and computational workflows to the cloud. This allows researchers to use scalable and powerful cloud resources, making it easier to process large datasets and perform complex analyses efficiently and cost-effectively. Prompting can be used to facilitate the migration of pipelines to the cloud. As Amazon Q specializes in queries and tasks related to AWS, let's prompt the model to facilitate migration to AWS infrastructure. 

##### **Prompt** 

Modify this Snakemake workflow to run using cloud resources in AWS. What are the best practices for securing Snakemake workflows when using AWS cloud resources? 

Note: In this prompt, we will be continuing to ask questions in the same chat thread that created the snakemake script earlier. Referencing your chat history and fine-tuning the response to best suit your needs is an example of chain-of-thought prompting. 


##### **Response**

I'll modify the Snakemake workflow for AWS execution and include security best practices. Here's the implementation: [[1]](https://repost.aws/articles/ARVnzwqlzzSU279RlcKvJU5w/securing-your-cloud-infrastructure)

```python
# Snakefile

# Import AWS utilities
from snakemake.remote.S3 import RemoteProvider
import boto3

# Configure AWS
S3 = RemoteProvider(
    access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
    secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY")
)

# Import configuration
configfile: "config.yaml"

# Wildcard constraints
wildcard_constraints:
    sample="[A-Za-z0-9_-]+"

# Target rule
rule all:
    input:
        expand(
            "s3://{bucket}/{output_prefix}/{sample}_Aligned.sortedByCoord.out.bam",
            bucket=config["aws"]["bucket"],
            output_prefix=config["aws"]["output_prefix"],
            sample=config["samples"]
        )

rule star_align:
    input:
        r1 = S3.remote("s3://{bucket}/{input_prefix}/{sample}_R1.fastq",
                      keep_local=True),
        r2 = S3.remote("s3://{bucket}/{input_prefix}/{sample}_R2.fastq",
                      keep_local=True)
    output:
        bam = S3.remote(
            "s3://{bucket}/{output_prefix}/{sample}_Aligned.sortedByCoord.out.bam"
        )
    params:
        genome_dir = config["aws"]["genome_dir"],
        out_prefix = "temp/{sample}_",
        region = config["aws"]["region"]
    threads: 8
    resources:
        mem_mb = 32000,
        time = "02:00:00"
    log:
        S3.remote("s3://{bucket}/{output_prefix}/logs/{sample}.log")
    container:
        "docker://quay.io/biocontainers/star:2.7.10a--h9ee0642_0"
    shell:
        """
        # Create temporary directory
        mkdir -p temp

        # Run STAR
        STAR --genomeDir {params.genome_dir} \
             --readFilesIn {input.r1} {input.r2} \
             --outFileNamePrefix {params.out_prefix} \
             --runThreadN {threads} \
             --outSAMtype BAM SortedByCoordinate \
             2&gt; {log}

        # Cleanup
        rm -rf temp
        """
```

Configuration file (config.yaml):
```yaml
# config.yaml

aws:
  bucket: "XXXXXXXXXXXXXXXXXX"
  region: "us-west-2"
  input_prefix: "raw-data"
  output_prefix: "aligned-data"
  genome_dir: "s3://my-genomics-bucket/reference/genome"
  instance_type: "m5.4xlarge"
  subnet: "subnet-xxxxx"
  security_group: "sg-xxxxx"

samples:
  - "sample1"
  - "sample2"
  - "sample3"
```

AWS Batch configuration (batch.yaml):
```yaml
# batch.yaml

__default__:
  job_queue: "genomics-queue"
  job_definition: "snakemake-job"
  region: "us-west-2"

star_align:
  instance_type: "m5.4xlarge"
  vcpus: 8
  memory: 32000
  timeout: 7200
```

IAM Policy for Snakemake:
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-genomics-bucket/*",
                "arn:aws:s3:::my-genomics-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "batch:SubmitJob",
                "batch:DescribeJobs",
                "batch:ListJobs"
            ],
            "Resource": "*"
        }
    ]
}
```

AWS Batch submission script:
```python
#!/usr/bin/env python3
# submit_aws.py

import boto3
import yaml
import os

def submit_workflow():
    session = boto3.Session(
        region_name=config["aws"]["region"]
    )
    
    batch = session.client('batch')
    
    response = batch.submit_job(
        jobName='snakemake-workflow',
        jobQueue=config["aws"]["job_queue"],
        jobDefinition=config["aws"]["job_definition"],
        containerOverrides={
            'command': [
                'snakemake',
                '--use-conda',
                '--conda-frontend', 'mamba',
                '--jobs', '20',
                '--default-remote-provider', 'S3',
                '--default-remote-prefix', config["aws"]["bucket"]
            ]
        }
    )
    
    print(f"Submitted job: {response['jobId']}")

if __name__ == "__main__":
    with open("config.yaml") as f:
        config = yaml.safe_load(f)
    submit_workflow()
```

Security Best Practices:

1. **IAM Configuration**:
```python
# Use AWS SDK with minimal permissions
def get_aws_session():
    return boto3.Session(
        region_name=os.environ.get("AWS_REGION"),
        profile_name=os.environ.get("AWS_PROFILE")
    )
```

2. **Secrets Management**:
```python
# Use AWS Secrets Manager for sensitive data
def get_secret():
    session = get_aws_session()
    client = session.client('secretsmanager')
    response = client.get_secret_value(
        SecretId='genomics/credentials'
    )
    return json.loads(response['SecretString'])
```

3. **Network Security**:
```yaml
# VPC configuration in config.yaml
vpc:
  subnet_id: "subnet-xxxxx"
  security_group_ids: 
    - "sg-xxxxx"
  enable_vpc_endpoints: true
```

4. **Data Encryption**:
```python
# Enable S3 encryption
S3 = RemoteProvider(
    access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
    secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
    config_kwargs={'s3': {'ServerSideEncryption': 'AES256'}}
)
```

5. **Logging and Monitoring**:
```python
# Add CloudWatch logging
rule star_align:
    # ... other directives ...
    log:
        cloudwatch = "aws_logs"
    shell:
        """
        aws logs create-log-stream \
            --log-group-name genomics-workflow \
            --log-stream-name {wildcards.sample}

        STAR ... 2&gt;&amp;1 | tee &gt;(aws logs put-log-events \
            --log-group-name genomics-workflow \
            --log-stream-name {wildcards.sample})
        """
```

Additional Security Measures:

1. **VPC Endpoint Configuration**:
```bash
# Create VPC endpoints for AWS services
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-xxxxx \
    --service-name com.amazonaws.region.s3 \
    --route-table-ids rtb-xxxxx
```

2. **Data Lifecycle Policy**:
```json
{
    "Rules": [
        {
            "ID": "Delete old data",
            "Status": "Enabled",
            "ExpirationInDays": 30
        }
    ]
}
```

3. **AWS KMS Encryption**:
```python
# Use KMS for encryption
S3 = RemoteProvider(
    config_kwargs={
        's3': {
            'ServerSideEncryption': 'aws:kms',
            'SSEKMSKeyId': 'arn:aws:kms:region:account:key/key-id'
        }
    }
)
```

4. **Access Logging**:
```yaml
# Enable S3 access logging
s3_logging:
  target_bucket: "XXXXXXXXXX"
  target_prefix: "genomics-logs/"
```

These security measures ensure:
- Secure data transmission and storage
- Proper access control
- Audit trails
- Network isolation
- Encryption at rest and in transit
- Compliance with security best practices

Remember to:
- Regularly rotate credentials
- Monitor access logs
- Update security groups
- Patch software dependencies
- Implement least privilege access
- Use private subnets where possible
- Enable AWS CloudTrail
- Implement resource tagging

##### **Response Breakdown** 

1. Modified Script: The new response contained modifications that enable the pipeline to run using AWS infrastructure. For example, the snakefile contained S3 URLs and AWS-specific parameters. Additionally, an IAM policy was defined for running in AWS and a submit_aws.py script was defined. This script can be used to submit the Snakemake workflow to AWS Batch using the AWS SDK for Python. 
2. Key Features of the Implementation: The response highlights key features such as AWS integration for input/output storage and job scheduling, security best practices (IAM policies, secrets management, network security, data encryption, logging/monitoring), and resource management through AWS Batch settings.
3. Steps to Use the Workflow: Instructions are given to configure AWS settings in the configuration files, define necessary IAM policies, and run the workflow using the AWS Batch submission script.
4. Additional Security Measures: The response suggests creating VPC endpoints for AWS services, setting up S3 lifecycle policies for data retention and deletion, using AWS KMS for server-side encryption, and enabling S3 access logging for audit trails.

#### Use Case 6: Write code from scratch. 

Prompting an AI tool like Amazon Q to write a script can be incredibly useful for several reasons. It significantly enhances efficiency by reducing the time and effort needed to create code from scratch. Additionally, it serves as a valuable learning aid, helping users understand coding practices and library usage through generated examples. The generated scripts provide a flexible starting point that can be easily customized to meet specific requirements, allowing users to quickly adapt and expand their projects.

##### **Prompt** 

Can you assist me in writing an R script that generates a plot of gene expression levels from a dataset? Please use the ggplot2 library for visualization. The script should read a CSV file containing gene expression data and produce a bar plot showing the expression levels of each gene. 

##### **Response**

I'll help you create an R script for visualizing gene expression data using ggplot2. Here's a comprehensive solution: [[1]](https://stackoverflow.com/questions/70765003)

```R
# Load required libraries
library(ggplot2)
library(tidyr)
library(dplyr)
library(RColorBrewer)

# Read the gene expression data
# Assuming your CSV file has genes in columns and samples in rows
gene_data &lt;- read.csv("expression_data.csv")

# Convert data from wide to long format for plotting
gene_data_long &lt;- gene_data %&gt;%
  gather(key = "Gene", value = "Expression", -Sample_ID)

# Calculate mean and standard error for each gene
gene_stats &lt;- gene_data_long %&gt;%
  group_by(Gene) %&gt;%
  summarise(
    mean_expr = mean(Expression),
    se = sd(Expression) / sqrt(n()),
    .groups = 'drop'
  )

# Create the basic bar plot
ggplot(gene_stats, aes(x = Gene, y = mean_expr)) +
  # Add bars
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7, width = 0.7) +
  # Add error bars
  geom_errorbar(aes(ymin = mean_expr - se, ymax = mean_expr + se),
                width = 0.2) +
  # Customize theme
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    panel.grid.major.x = element_blank()
  ) +
  # Add labels
  labs(
    title = "Gene Expression Levels",
    x = "Genes",
    y = "Expression Level"
  )

# Save the plot
ggsave("gene_expression_plot.pdf", width = 10, height = 6)

# Alternative visualization: Create a heatmap
# Reorder genes by expression level
gene_data_heatmap &lt;- gene_data_long %&gt;%
  group_by(Gene) %&gt;%
  mutate(mean_expr = mean(Expression)) %&gt;%
  ungroup() %&gt;%
  arrange(desc(mean_expr))

ggplot(gene_data_heatmap, aes(x = Gene, y = Sample_ID, fill = Expression)) +
  geom_tile() +
  scale_fill_gradientn(colors = brewer.pal(9, "YlOrRd")) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5)
  ) +
  labs(
    title = "Gene Expression Heatmap",
    x = "Genes",
    y = "Samples",
    fill = "Expression\nLevel"
  )

# Save the heatmap
ggsave("gene_expression_heatmap.pdf", width = 12, height = 8)

# Add statistical analysis
# Perform ANOVA to test for differences between genes
anova_result &lt;- aov(Expression ~ Gene, data = gene_data_long)
anova_summary &lt;- summary(anova_result)

# Add significance levels to the plot
gene_stats &lt;- gene_stats %&gt;%
  mutate(
    significance = case_when(
      mean_expr &gt; quantile(mean_expr, 0.75) ~ "***",
      mean_expr &gt; quantile(mean_expr, 0.5) ~ "**",
      mean_expr &gt; quantile(mean_expr, 0.25) ~ "*",
      TRUE ~ "ns"
    )
  )

# Create enhanced bar plot with significance levels
ggplot(gene_stats, aes(x = reorder(Gene, -mean_expr), y = mean_expr)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7, width = 0.7) +
  geom_errorbar(aes(ymin = mean_expr - se, ymax = mean_expr + se),
                width = 0.2) +
  geom_text(aes(label = significance, y = mean_expr + se),
            vjust = -0.5) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    panel.grid.major.x = element_blank()
  ) +
  labs(
    title = "Gene Expression Levels with Significance",
    x = "Genes",
    y = "Expression Level",
    caption = "Significance levels: *** p&lt;0.001, ** p&lt;0.01, * p&lt;0.05, ns: not significant"
  )

# Save the enhanced plot
ggsave("gene_expression_plot_with_significance.pdf", width = 12, height = 8)

# Add interactive features using plotly
library(plotly)

interactive_plot &lt;- ggplotly(
  ggplot(gene_stats, aes(x = reorder(Gene, -mean_expr), y = mean_expr,
                        text = paste("Gene:", Gene,
                                   "\nExpression:", round(mean_expr, 2),
                                   "\nSE:", round(se, 2)))) +
    geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
    geom_errorbar(aes(ymin = mean_expr - se, ymax = mean_expr + se),
                  width = 0.2) +
    theme_minimal() +
    labs(title = "Interactive Gene Expression Plot",
         x = "Genes",
         y = "Expression Level")
)

# Save the interactive plot as HTML
htmlwidgets::saveWidget(interactive_plot, "interactive_gene_expression.html")
```

This script provides:

1. **Basic Visualization**:
   - Bar plot with error bars
   - Clear labels and titles
   - Customized theme

2. **Alternative Visualizations**:
   - Heatmap for multiple samples
   - Interactive plot using plotly

3. **Statistical Analysis**:
   - Calculation of means and standard errors
   - ANOVA testing
   - Significance level indicators

4. **Customization Options**:
   - Color schemes
   - Text formatting
   - Plot dimensions

To use this script:

1. Prepare your data in CSV format with columns:
   - Sample_ID
   - Gene names as column headers
   - Expression values in cells

2. Adjust file paths:
```R
gene_data &lt;- read.csv("your_file_path.csv")
```

3. Customize visualization parameters:
```R
# Change colors
ggplot(...) +
  geom_bar(fill = "your_color")

# Adjust plot dimensions
ggsave("output.pdf", width = your_width, height = your_height)
```

4. Add additional analyses as needed:
```R
# Add correlation analysis
cor_matrix &lt;- cor(gene_data[,-1])
corrplot(cor_matrix, method = "color")
```

The script includes error handling and produces publication-quality figures with statistical annotations.

##### **Response Breakdown**
1. Script: The response includes an R script that accomplishes the prompt instructions. Within the script, comments are provided to help users understand the logic behind the script. 
2. Feature explanation: The features included in the script are highlighted 
3. Instructions on how to run the script and customize it as needed are provided. 

#### Use Case 7: Error debugging

Amazon Q can also be used to identify and fix errors in your code. This is highly beneficial as it can save time and identify errors that may have been difficult resolve. 

### Prompting Best Practices 

Here are some tips to help you in your prompting journey!

1. Be Specific: Clearly state your request or question. Provide specific details to avoid ambiguity.

2. Provide Context: Provide background information that would be relevant to your prompt.

3. Break Down Complex Requests: Divide complex tasks into smaller, manageable parts.

5. Iterate and Refine: Refine your prompts based on the responses you receive. Provide additional context to fine-tune the responses according to your end goal. 

6. Validate the Response: Always use a human-in-the-loop approach to validate the responses you receive. 


### Conclusions

Congrats! You have successfully experimented with the features of Amazon Q. We hope you continue leveraging the powers of GenAI and Amazon Q to drive impactful results in your projects.

### Clean Up

Once you have completed the tutorial, you may stop the JupyterLab application and delete the SageMaker Studio domain. 

#### Stopping domain applications

1. Navigate to Amazon SageMaker 
2. Click on Admin configurations > Domains in the left menu bar
3. Select the SageMaker Studio domain by clicking on the circle found on the right side of the domain name
4. Scroll down to the available applications 
5. Select any applications you have created and click Stop

#### Deleting the SageMaker Studio domain

1. Navigate to Amazon SageMaker 
2. Click on Admin configurations > Domains in the left menu bar
3. Click on the domain name 
4. Scroll down to the Delete domain box and select Delete domain