# Hello Google Collab

IMPORTANT: Please make a copy to edit/run: File > "Save a copy in Drive"

Notes:
- Colab sessions expire after 12 hours
- Limited computing resources
- Files are temporary unless saved to Google Drive
- Run cells in order


## Some Keyboard shortcuts

| Command | Description |
|---------|-------------|
| `Ctrl/Cmd + m` | Initiate command |
| `Ctrl/Cmd + m b` | Insert `code` cell below |
| `Ctrl/Cmd + m a` | Insert `code` cell above |
| `Ctrl/Cmd + m d` | Delete selected cell |
| `Ctrl/Cmd + Enter` | **Run** selected cell |
| `Ctrl/Cmd + m y` | Convert to `code` cell |
| `Ctrl/Cmd + m m` | Convert to **markdown** cell |
| `Ctrl/Cmd + m z` | Undo last cell deletion |
| `Ctrl/Cmd + s` | Save notebook |
| `Ctrl/Cmd + f` | Find and replace |
| `Ctrl/Cmd + shift + H` | replace within cell |
| `Ctrl/Cmd + h` | replace within cell |
| `Tab` | Code completion |
| `Shift + Tab` | Show documentation |
| `Esc` | Enter command mode |
| `Enter` | Enter edit mode |

# Setup

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Install nextflow
! pip install nextflow
! pip install scikit-learn

Setup directories

In [None]:
%%bash
mkdir -p data
mkdir -p bin
mkdir -p results

In [None]:
!wget https://raw.githubusercontent.com/PMBB-Informatics-and-Genomics/psb2025-workshop/refs/heads/main/penguin_analysis/data/penguins_size.csv -O data/penguins_size.csv
!wget https://github.com/PMBB-Informatics-and-Genomics/psb2025-workshop/blob/main/penguin_analysis/bin/data_cleaning.py -O bin/data_cleaning.py
!wget https://github.com/PMBB-Informatics-and-Genomics/psb2025-workshop/blob/main/penguin_analysis/bin/species_analysis.py -O bin/species_analysis.py

Project Structure

This is how the directory should look:

```sh
penguin_analysis/
├── main.nf
├── nextflow.config
├── data/
│   └── penguins.csv
├── bin/
│   ├── data_cleaning.py
│   ├── species_analysis.py
│   └── visualization.py
└── results/
```

# Hello World (Again)
- let's fill in the script ourselves
- process: input, output
- workflow
  - make a channel from greeting list
  -

In [None]:
%%writefile bin/hello_world.nf
#!/usr/bin/env nextflow

process say_hello {
    input:
      // value channel for greeting

    output:
      // redirect to standard output

    shell:
        """
        echo "${greeting}, World!"
        """
}

workflow {
    // greeting list
    greeting_list = ['Hello', 'Hola', 'Bonjour', 'Ciao']

    // make channel from list


    // greetings_stdout is another Channel


    // view the channel
    greetings_stdout | view
}

In [None]:
!nextflow run bin/hello_world.nf

# Penguins EDA (non nextflow)
- exploratory data anlysis

In [None]:
input_file = "data/penguins_size.csv"
df = pd.read_csv(input_file)
df.head()

In [None]:
#Check the statistics of numerical features
df.describe()

In [None]:
#Check the values of categorical features
# Identify categorical columns (e.g., dtype == 'object' or 'category')
categorical_columns = df.select_dtypes(include=['object', 'category']).columns

# Display unique values for each categorical column
for col in categorical_columns:
    unique_values = df[col].unique()
    print(f"Unique values in '{col}': {unique_values}")

# Let's explore the pre-written scripts
<- click on the left to see:

1. data_cleaning
2. species_analysis


These scripts are configured to run on command line, independent of nextflow

# Write Nextflow Code

Write the parameters
- data path
- data cleaning
- data analysis

Write Clean Data Process
- syntax for code:
```sh
python  ${cleaning_script} --input_file ${raw_input}
```

Species Analysis process
- syntax:
```sh
python ${analysis_script} --input_file ${cleaned_data} --species ${species}
```





In [None]:
%%writefile bin/penguins.nf
#!/usr/bin/env nextflow

params.data = ''
params.cleaning_script = ''
params.analysis_script = ''

process clean_data {
    publishDir "${launchDir}/data/"
    input:

    output:
        path 'penguins_cleaned.csv'

    script:
    """
    script here
    """
}

process species_analysis {
    publishDir "${launchDir}/results/"
    input:

    output:
        path "${species}_basic_stats.csv"
        path "${species}_correlations.png"
        path "${species}_dimorphism_stats.csv"
        path "${species}_distributions.png"

    script:
        """
        script here
        """
}


workflow {
    // create a species channel
    species_channel = Channel.from('Adelie', 'Gentoo', 'Chinstrap')

    // clean the data

    // run the analysis

}

Run the script!

In [None]:
! nextflow run bin/penguins.nf

# Results

In [None]:
species = 'Adelie'
# species = 'Gentoo'
# species = 'Chinstrap'

## results: Stats

In [None]:
input_file = f"results/{species}_basic_stats.csv"
stats = pd.read_csv(input_file)
stats.head()

## Results: Sexual Dimorphism

In [None]:
# input_file = f"results/{species}_basic_stats.csv"
input_file = f"results/{species}_dimorphism_stats.csv"
dimporphism = pd.read_csv(input_file)
dimporphism.head()

## Results: Correlation

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

correlations = f"results/{species}_correlations.png"

img = mpimg.imread(correlations)
plt.imshow(img)
plt.axis('off')
plt.show()

## Results: Distribution

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# correlations = f"results/{species}_correlations.png"
distributions = f"results/{species}_distributions.png"

img = mpimg.imread(distributions)
plt.imshow(img)
plt.axis('off')
plt.show()