# `Project Outline Berlin`


## Objective:

> Make a `Snakemake pipeline` that outputs a dataframe or an Excel sheet that show which amino acid is present at a given position per lineage and segment of influenza viruses. Hereby trying to make a somewhat more flexible tool, so it could also be applied to other lineages than used here such as H5N1 or H7N9 lineages.
>
> FASTA files may contain nucloetide seqeunces from segments of influenza virus subtypes `H1N1pdm`, `H3N2`, `B-Vic` and `B-Yam` downloaded from GISAID.
>
> This FASTA file will first be processed by `Nextclade` (for `HA` & `NA` segments) and `Nextalign` (for `PB2`, `PB1`, `PA`, `NP`, `MA` & `NS` segments). Here, input nucleotide sequences are aligned to their appropriate reference genomes. Translated alignments are truncated to the sequence length representing the viral functional protein.
> 
> `Nextclade` and `Nextalign` both output aligned nucleotide sequences and aligned amino acid sequences that are truncated according to a correct numbering used for annotation.
>
>   **Reference for HA numbering:**<br/>
>Burke DF and Smith DJ (2014) <br/>A standardised numbering for all subtypes of Influenza A hemaggluttin (HA) sequences. <br/>Plos One 9(11): e112302 <br/>DOI: 10.1371/journal.pone.0112302 <br/>[Reference table](https://antigenic-cartography.org/surveillance/evergreen/HAnumbering/)
>
> Python scripts will than process the output files from `Nextclade` and `Nextalign` and finally output tables where rows are represented by the input sequence names and where columns represent correctly numbered amino acids posititions within the analysed segment.

## Input and output files

#### INPUT FILES

1. Downloaded FASTA files from GISAID with nucleotide sequences and headers that are formatted like:

```
Isolate ID | Isolate name | Type | Lineage | Segment | Passage details/history | Collection date
```



**Example fasta header:**    
```
>EPI_ISL_342146|A/Netherlands/00322/2019|A_/_H1N1|pdm09|HA|Original|2019-02-06
```



2. Lists of amino acid positions per viral protein sequences grouped by subtype and gene segment.       
    Applying an Excel spreadsheet as a provided input might give more flexibility in use of the final pipeline.


    *Example input table:*

lineage | segment | position
--------|---------|---------
H1N1pdm | HA      | 156
H1N1pdm | HA      | 158
H1N1pdm | HA      | 162
H1N1pdm | NA      | 274
H1N1pdm | PA      | 31
H3N2    | HA      | 145
H3N2    | HA      | 184
H3N2    | HA      | 185
etc...  | etc...  | etc...




3. Datasets that are used for `Nextclade` and `Nextalign`to run:

- **Nextalign input files:**
    - `_nextalign/data/flu_h1n1pdm_pa/genemap.gff`
    - `_nextalign/data/flu_h1n1pdm_pa/reference.fasta`
    - `_nextalign/data/flu_h1n1pdm_pa/sequences.fasta`

<br/>

- **Nextclade input files:**
    - `_nextclade/data/flu_h1n1pdm_ha/primers.cvs`
    - `_nextclade/data/flu_h1n1pdm_ha/genemap.gff`
    - `_nextclade/data/flu_h1n1pdm_ha/qc.json`
    - `_nextclade/data/flu_h1n1pdm_ha/reference.fasta`
    - `_nextclade/data/flu_h1n1pdm_ha/sequences.fasta`
    - `_nextclade/data/flu_h1n1pdm_ha/tag.json`
    - `_nextclade/data/flu_h1n1pdm_ha/tree.json`
    - `_nextclade/data/flu_h1n1pdm_ha/virus_properties.json`



#### OUTPUT FILES

- `Nextclade` and `Nextalign` should give outputs in defined folders that are named in convention with their lineage and segment (`output/Nextclade/H1N1/PB1/`).

- **Nextclade output files:**
    - `output/Nextclade/H1N1/HA/nextclade.csv` (Contains additional clade information)
    - `output/Nextclade/H1N1/HA/nextclade.aligned.fasta.fasta`
    - `output/Nextclade/H1N1/HA/nextclade_gene_HA1.translation.fasta`
    - `output/Nextclade/H1N1/HA/nextclade_gene_HA2.translation.fasta`
    - `output/Nextclade/H1N1/HA/nextclade_gene_SigPep.translation.fasta.fasta`

<br/>

 - **Nextalign output files:**
    - `output/Nextalign/H1N1/PA/nextalign.aligned.fasta`
    - `output/Nextalign/H1N1/PA/nextalign_gene_PA.translation.fasta`



2. Dataframe or Excel sheet with listing what amino acids are present for given positions per sequence in the final protein alignments. For HA and NA clades could be included to complement the output data of AA's per position.

## NOTE  |  Conda subdirectory for `osx-64` packages

### Conda set-up within osx-arm64 installation of conda.
Conda packages are installed within an osx-arm64 architecture (Intel) subdirectory of conda installation that runs osx-arm64 architecture (Apple M1/2).
In order for the `Snakemake`, `Nextclade` and `Nextalign` packages to work, they should run using a Conda environment that supports osx-64 packages.

This is where we tell Conda that we want our environment to use x86 architecture rather than the native ARM64 architecture. We can do that with following lines of code, which create and activate a new conda environment called `berlin`:

```
CONDA_SUBDIR=osx-64 conda create --name berlin python=3.10
conda activate berlin
conda config --env --set subdir osx-64
```

- The first line creates the environment. We simply set the CONDA_SUBDIR environment variable to indicate that conda should create the ennvironment with an osx-64 Python executable. 
- The second line activates the environment, and the third line ensures that conda installs osx-64 versions of packages into the environment.

> This new environment will now be able to use dependencies with osx-64 builds.

## Conda packages

#### Git:
```
conda install -c anaconda git
```

#### Jupyter:
```
conda install -c anaconda jupyter
```

#### Snakemake (osx-64):
```
conda install -c bioconda snakemake
```

#### Nextalign (osx-64):
```
conda install -c bioconda nextalign
```

#### Nextclade (osx-64):
```
conda install -c bioconda nextclade  
```

#### Graphviz (extension for Snakemake):
Helper tool to visualize the DAG (Directed Acyclic Graph) and save it as a .png file.

*Source:*
- Youtube | [An introduction to Snakemake tutorial for beginners (CC248)](https://www.youtube.com/watch?v=r9PWnEmz_tc)

```
conda install graphviz
```

To use the tool on the Snakemake file, run:

```
snakemake --dag targets | dot Tpng > dag.png
```

#### *A better working option:*
```
snakemake --forceall --rulegraph | dot -Tpdf > dag.pdf
```


> NOTE: In the video the ```rule targets``` was made, listing all the outputs that need to be generated when running the basic snakemake command: 

```snakemake -c 1``` (runs using one thread)

# Code chunks

## Retrieve positions of interest for given lineages and seqments.
- Use pandas to read an excel file that lists positions per lineage and segment.

    Reminder: the file `input/positions_by_lineage_and_segment.xlsx` has the folowing columns:

lineage | segment | position
--------|---------|---------
H1N1pdm | HA      | 156
H1N1pdm | HA      | 158
H1N1pdm | HA      | 162
H1N1pdm | NA      | 274
H1N1pdm | PA      | 31
H3N2    | HA      | 145
H3N2    | HA      | 184
H3N2    | HA      | 185
etc...  | etc...  | etc...

<br>

- Import data from `input/positions_by_lineage_and_segments.xlsx` to get the desired positions to look for, within the AA sequence for a given lineage and segment.

In [56]:
# Import libraries:
import pandas as pd
from Bio import SeqIO

#########################################################################################
#                                                                                       #
# FUNCTION: Get headers and amino acid sequences from Nextaligns output files.          #
#                                                                                       #
#########################################################################################

def get_sequence_names_and_amino_acid_sequences(fasta_file):

    # Read the FASTA file and extract the nucleotide sequences.
    records = list(SeqIO.parse(fasta_file, "fasta"))

    # If there are no sequences in the fasta file, abort the function.
    if len(records) < 1:
        print("No sequences found in the FASTA file.")
        return

    # Get a list of sequence names from the headers.
    sequence_names = []
    for record in records:
        sequence_names.append(record.id)

    # Get a list of corresponding amino acid sequences
    amino_acid_sequences = []
    for record in records:
        amino_acid_sequences.append(record.seq)

    return sequence_names, amino_acid_sequences


#########################################################################################
#                                                                                       #
# FUNCTION: Get a list of positions of interest for lineage and segment of interest.    #
#                                                                                       #
#########################################################################################

def get_positions_from_excel(excel_file, lineage, segment):

    # Import pandas
    import pandas as pd

    # Use pandas to read columns from the Excel file
    position_by_lineage_and_segment = pd.read_excel(excel_file)

    # Make a 'selection' by filtering data to match positions for 'ha' segments belonging to the 'vic'lineage.
    selection = position_by_lineage_and_segment.query(
        f'lineage=="{lineage}" & segment=="{segment}"')

    # Get the selected values form the 'positions' columns as a list.
    positions = selection['position'].to_list()

    return positions


#########################################################################################
#                                                                                       #
# FUNCTION: Make an Excel file listing headers with amino acids at given positions.     #
#                                                                                       #
#########################################################################################

def generate_excel_output(output, sequence_names, amino_acid_sequences, positions):

    data = []
    for sequence_name, amino_acid_sequence in zip(sequence_names, amino_acid_sequences):
        row = [sequence_name]
        row.extend(amino_acid_sequence[pos - 1] for pos in positions)
        data.append(row)

    column_names = ["Sequence"] + [f"Pos {pos}" for pos in positions]

    df = pd.DataFrame(data, columns=column_names)

    df.to_excel(output, index=False)


#########################################################################################
#                                                                                       #
#  Run the functions above...                                                           #
#                                                                                       #
#########################################################################################



#
# -----------------------------    HEADERS & AA SEQUENCES   ----------------------------
#
# Set input fasta file (SNAKEFILE):
aa_fasta = '_nextclade/output/flu_vic_ha/nextclade_gene_HA1.translation.fasta'

# Get 'aa_names' and 'aa_sequences' from input fasta file:
aa_names, aa_sequences = get_sequence_names_and_amino_acid_sequences(
                            fasta_file=aa_fasta
                            )






#
# --------------------------    FILTER POSITIONS OF INTEREST   ------------------------
#
# Set 'positions' using the input Excel file and by defining filters (SNAKEFILE):
positions_table = 'input/positions_by_lineage_and_segment.xlsx'

filter_lineage = 'vic'

filter_segment = 'ha'

# Get the input positions from the input excel_file that are filtered by lineage and segment:
filtered_positions = get_positions_from_excel(
                        excel_file=positions_table,
                        lineage=filter_lineage,
                        segment=filter_segment
                        )






#
# -------------------------    CREATE OUTPUT EXCEL FILE   -----------------------------
#
# Set the name of the final output file (SNAKEFILE):
excel_output = 'output/aa_at_positions_for_vic_HA1.xlsx'

# Create an Excel file as final output:
if aa_names and aa_sequences:
    generate_excel_output(
        output=excel_output,
        sequence_names=aa_names,
        amino_acid_sequences=aa_sequences,
        positions=filtered_positions
    )


### Looking at the output of the function `input_func_all()` using a dictionary matching segments and genes.

In [50]:
from pathlib import Path

data_dir=Path('input')
# prefixes=[p.stem for p in data_dir.glob('sequences_*_*.fasta')]

# Lineages and Segments in input files for Nextalign or Nextclade.
# lineages = ['h1n1pdm', 'h3n2', 'vic', 'yam']
# segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'na', 'ma', 'ns']


# Genes in output files from Nextalign or Nextclade.
# genes    = ['PB2', 'PB1', 'PA', 'HA1', 'NP', 'NA', 'MA', 'NS2'] # any() ????
gene_lookup = {'pb2':'PB2', 'pb1':'PB1', 'pa':'PA', 'ha':'HA1', 'np':'NP', 'na':'NA', 'ma':'MA', 'ns':'NS2'}


def input_func_all():
    l,l2,l3 = [],[],[]
    for path in list(data_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        l.append(f"output/{lineage}/{segment}/nextalign_gene_{gene_lookup[segment]}.translation.fasta")
        l2.append(f"output/{lineage}/{segment}/nextalign.aligned.fasta" )
        l3.append(f"output/aa_at_positions_for_{lineage}_{gene_lookup[segment]}.xlsx")
    # Return a dictionary
    return {"genes_translated" : l, "nt_segments_aligned" : l2, "excel_output" : l3}

rule_all_input_dict = input_func_all()


print(rule_all_input_dict)
print(rule_all_input_dict["genes_translated"])
print(rule_all_input_dict["nt_segments_aligned"])
print(rule_all_input_dict["excel_output"])

{'genes_translated': ['output/vic/pa/nextalign_gene_PA.translation.fasta', 'output/h1n1pdm/np/nextalign_gene_NP.translation.fasta', 'output/vic/ha/nextalign_gene_HA1.translation.fasta', 'output/h3n2/na/nextalign_gene_NA.translation.fasta', 'output/h1n1pdm/ns/nextalign_gene_NS2.translation.fasta', 'output/vic/na/nextalign_gene_NA.translation.fasta', 'output/h3n2/pa/nextalign_gene_PA.translation.fasta', 'output/h3n2/ha/nextalign_gene_HA1.translation.fasta', 'output/h1n1pdm/ma/nextalign_gene_MA.translation.fasta', 'output/h1n1pdm/ha/nextalign_gene_HA1.translation.fasta', 'output/h1n1pdm/pa/nextalign_gene_PA.translation.fasta', 'output/vic/np/nextalign_gene_NP.translation.fasta', 'output/h1n1pdm/pb2/nextalign_gene_PB2.translation.fasta', 'output/h3n2/ma/nextalign_gene_MA.translation.fasta', 'output/h3n2/pb1/nextalign_gene_PB1.translation.fasta', 'output/vic/pb2/nextalign_gene_PB2.translation.fasta', 'output/h3n2/ns/nextalign_gene_NS2.translation.fasta', 'output/vic/ns/nextalign_gene_NS2.tr

## Flatten the output to a list

In [51]:
from pathlib import Path

data_dir=Path('input')

gene_lookup = {'pb2':'PB2', 'pb1':'PB1', 'pa':'PA', 'ha':'HA1', 'np':'NP', 'na':'NA', 'ma':'MA', 'ns':'NS2'}

def input_func_all():
    l,l2,l3 = [],[],[]
    for path in list(data_dir.glob("*.fasta"))[:1]:
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        l.append(f"output/{lineage}/{segment}/nextalign_gene_{gene_lookup[segment]}.translation.fasta")
        l2.append(f"output/{lineage}/{segment}/nextalign.aligned.fasta" )
        l3.append(f"output/aa_at_positions_for_{lineage}_{gene_lookup[segment]}.xlsx")
    return {"genes_translated" : l, "nt_segments_aligned" : l2, "excel_output" : l3}


original_list = input_func_all().values()
flat_list_of_output_files = [item for l in original_list for item in l]
flat_list_of_output_files

# input_func_all().values()




['output/vic/pa/nextalign_gene_PA.translation.fasta',
 'output/vic/pa/nextalign.aligned.fasta',
 'output/aa_at_positions_for_vic_PA.xlsx']

### Added gene variable that might be used as wildcard further along...

In [55]:
from pathlib import Path

data_dir=Path('input')

gene_lookup = {'pb2':'PB2', 'pb1':'PB1', 'pa':'PA', 'ha':'HA1', 'np':'NP', 'na':'NA', 'ma':'MA', 'ns':'NS2'}

def input_func_all():
    l,l2,l3 = [],[],[]
    for path in list(data_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        # Get gene variable from dict.
        gene = gene_lookup[segment]
        l.append(f"output/{lineage}/{segment}/nextalign_gene_{gene}.translation.fasta")
        l2.append(f"output/{lineage}/{segment}/nextalign.aligned.fasta" )
        l3.append(f"output/aa_at_positions_for_{lineage}_{gene}.xlsx")
    # Return a dictionary
    return {"genes_translated" : l, "nt_segments_aligned" : l2, "excel_output" : l3}

rule_all_input_dict = input_func_all()


print(rule_all_input_dict)
print(rule_all_input_dict["genes_translated"])
print(rule_all_input_dict["nt_segments_aligned"])
print(rule_all_input_dict["excel_output"])

{'genes_translated': ['output/vic/pa/nextalign_gene_PA.translation.fasta', 'output/h1n1pdm/np/nextalign_gene_NP.translation.fasta', 'output/vic/ha/nextalign_gene_HA1.translation.fasta', 'output/h3n2/na/nextalign_gene_NA.translation.fasta', 'output/h1n1pdm/ns/nextalign_gene_NS2.translation.fasta', 'output/vic/na/nextalign_gene_NA.translation.fasta', 'output/h3n2/pa/nextalign_gene_PA.translation.fasta', 'output/h3n2/ha/nextalign_gene_HA1.translation.fasta', 'output/h1n1pdm/ma/nextalign_gene_MA.translation.fasta', 'output/h1n1pdm/ha/nextalign_gene_HA1.translation.fasta', 'output/h1n1pdm/pa/nextalign_gene_PA.translation.fasta', 'output/vic/np/nextalign_gene_NP.translation.fasta', 'output/h1n1pdm/pb2/nextalign_gene_PB2.translation.fasta', 'output/h3n2/ma/nextalign_gene_MA.translation.fasta', 'output/h3n2/pb1/nextalign_gene_PB1.translation.fasta', 'output/vic/pb2/nextalign_gene_PB2.translation.fasta', 'output/h3n2/ns/nextalign_gene_NS2.translation.fasta', 'output/vic/ns/nextalign_gene_NS2.tr

In [23]:
from pathlib import Path

data_dir=Path('input')

gene_lookup = {'pb2':'PB2', 'pb1':'PB1', 'pa':'PA', 'ha':'HA1', 'np':'NP', 'na':'NA', 'ma':'M1', 'ns':'NS2'}

def input_func_all():
    l,l2,l3,l4,l5,l6 = [],[],[],[],[],[]
    for path in list(data_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        # Get gene variable from dict.
        gene = gene_lookup[segment]
        l.append(f"output/{lineage}/{segment}/nextalign_gene_{gene}.translation.fasta")
        l2.append(f"output/{lineage}/{segment}/nextalign.aligned.fasta" )
        l3.append(f"output/{lineage}/{segment}/aa_at_positions_for_{gene}.xlsx")
        l4.append(gene)
        l5.append(segment)
        l6.append(lineage)
    # Return a dictionary
    return {"genes_translated" : l,
            "nt_segments_aligned" : l2,
            "excel_output" : l3,
            "gene": l4,
            "segment": l5,
            "lineage": l6}


rule_all_input_dict = input_func_all()

segments = rule_all_input_dict["segment"]

print(segments)

lineages = rule_all_input_dict["lineage"]

genes = rule_all_input_dict["gene"]


SyntaxError: invalid syntax (1445278234.py, line 20)

In [1]:
from pathlib import Path

data_dir=Path('input')

gene_lookup = {'pb2':'PB2', 'pb1':'PB1', 'pa':'PA', 'ha':'HA1', 'np':'NP', 'na':'NA', 'ma':'M1', 'ns':'NS2'}

def input_func_all():
    l,l2,l3,l4,l5,l6 = [],[],[],[],[],[]
    for path in list(data_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        # Get gene variable from dict.
        gene = gene_lookup[segment]
        l.append(f"output/{lineage}/{segment}/nextalign_gene_{gene}.translation.fasta")
        l2.append(f"output/{lineage}/{segment}/nextalign.aligned.fasta" )
        l3.append(f"output/{lineage}/{segment}/aa_at_positions_for_{gene}.xlsx")
        l4.append(gene)
        l5.append(segment)
        l6.append(lineage)
        # Return a dictionary
    return {"genes_translated" : l,
            "nt_segments_aligned" : l2,
            "excel_output" : l3,
            "gene": l4,
            "segment": l5,
            "lineage": l6}

rule_all_input_dict = input_func_all()

segments = rule_all_input_dict["segment"]

print(segments)





['pa', 'np', 'ha', 'na', 'ns', 'na', 'pa', 'ha', 'ma', 'ha', 'pa', 'np', 'pb2', 'ma', 'pb1', 'pb2', 'ns', 'ns', 'pb1', 'na', 'np', 'pb1', 'pb2', 'ma']


In [35]:
from pathlib import Path

input_dir = Path('input')
data_dir = Path('data')

gene_lookup = {'pb2': 'PB2', 'pb1': 'PB1', 'pa': 'PA', 'ha': 'HA1', 'np': 'NP', 'na': 'NA', 'ma': 'M1', 'ns': 'NS2'}

def input_func_by_segment(segment_list):
    filtered_filenames = []
    for path in list(input_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_')  # sequences_h1n1pdm_ha.fasta
        if segment in segment_list:
            filtered_filenames.append(str(path))  # Convert the Path object to a string
    return filtered_filenames

ha_and_na_filenames = input_func_by_segment(['ha', 'na'])
print(ha_and_na_filenames)


def data_func_by_segment(segment_list):
    filtered_foldernames = []
    for path in list(data_dir):
        _, lineage, segment = path.stem.split('_')  # sequences_h1n1pdm_ha.fasta
        if segment in segment_list:
            filtered_foldernames.append(str(path))  # Convert the Path object to a string
    return filtered_foldernames

ha_and_na_data_folders = data_func_by_segment(['ha', 'na'])
print(ha_and_na_data_folders)




['input/sequences_vic_ha.fasta', 'input/sequences_h3n2_na.fasta', 'input/sequences_vic_na.fasta', 'input/sequences_h3n2_ha.fasta', 'input/sequences_h1n1pdm_ha.fasta', 'input/sequences_h1n1pdm_na.fasta']


TypeError: 'PosixPath' object is not iterable

In [37]:
from pathlib import Path
input_dir = Path('input')

gene_lookup = {'pb2': 'PB2', 'pb1': 'PB1', 'pa': 'PA', 'ha': 'HA1', 'np': 'NP', 'na': 'NA', 'ma': 'M1', 'ns': 'NS2'}

def input_func_by_segment(segment_list):
    filtered_filenames = []
    for path in list(input_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_')  # sequences_h1n1pdm_ha.fasta
        if segment in segment_list:
            filtered_filenames.append(str(path))  # Convert the Path object to a string
    return filtered_filenames

ha_and_na_filenames = input_func_by_segment(['ha', 'na'])
print(ha_and_na_filenames)


import os
data_dir = '_nextclade/data'

def data_func_by_segment(segment_list):
    filtered_folders = []
    for root, dirs, files in os.walk(data_dir):
        for folder in dirs:
            _, lineage, segment = folder.split('_')
            if segment in segment_list:
                filtered_folders.append(os.path.join(root, folder))
    return filtered_folders

ha_and_na_data_folders = data_func_by_segment(['ha', 'na'])
print(ha_and_na_data_folders)

['input/sequences_vic_ha.fasta', 'input/sequences_h3n2_na.fasta', 'input/sequences_vic_na.fasta', 'input/sequences_h3n2_ha.fasta', 'input/sequences_h1n1pdm_ha.fasta', 'input/sequences_h1n1pdm_na.fasta']
['_nextclade/data/flu_h3n2_ha', '_nextclade/data/flu_h1n1pdm_na', '_nextclade/data/flu_vic_ha', '_nextclade/data/flu_yam_ha', '_nextclade/data/flu_h1n1pdm_ha', '_nextclade/data/flu_h3n2_na', '_nextclade/data/flu_vic_na']


In [47]:
def input_func_all(segment_list):
    l,l2,l3,l4,l5,l6 = [],[],[],[],[],[]
    for path in list(input_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        # Get gene variable from dict.
        if segment in segment_list:
            gene = gene_lookup[segment]
            l.append(f"output/{lineage}/{segment}/nextalign_gene_{gene}.translation.fasta")
            l2.append(f"output/{lineage}/{segment}/nextalign.aligned.fasta" )
            l3.append(f"output/{lineage}/{segment}/aa_at_positions_for_{gene}.xlsx")
            l4.append(gene)
            l5.append(segment)
            l6.append(lineage)
    # Return a dictionary
    return {"genes_translated" : l,
            "nt_segments_aligned" : l2,
            "excel_output" : l3,
            "gene": l4,
            "segment": l5,
            "lineage": l6}



# Dictionary containing the 6 lists defined above in the function `input_func_all()`
input_data = input_func_all(['ha', 'na'])
print(input_data)

# Get the dictionary values as lists to expand on using zip later on...
GENES = input_data["gene"]
SEGMENTS = input_data["segment"]
LINEAGES = input_data["lineage"]

print(GENES)
print(SEGMENTS)
print(LINEAGES)

{'genes_translated': ['output/vic/ha/nextalign_gene_HA1.translation.fasta', 'output/h3n2/na/nextalign_gene_NA.translation.fasta', 'output/vic/na/nextalign_gene_NA.translation.fasta', 'output/h3n2/ha/nextalign_gene_HA1.translation.fasta', 'output/h1n1pdm/ha/nextalign_gene_HA1.translation.fasta', 'output/h1n1pdm/na/nextalign_gene_NA.translation.fasta'], 'nt_segments_aligned': ['output/vic/ha/nextalign.aligned.fasta', 'output/h3n2/na/nextalign.aligned.fasta', 'output/vic/na/nextalign.aligned.fasta', 'output/h3n2/ha/nextalign.aligned.fasta', 'output/h1n1pdm/ha/nextalign.aligned.fasta', 'output/h1n1pdm/na/nextalign.aligned.fasta'], 'excel_output': ['output/vic/ha/aa_at_positions_for_HA1.xlsx', 'output/h3n2/na/aa_at_positions_for_NA.xlsx', 'output/vic/na/aa_at_positions_for_NA.xlsx', 'output/h3n2/ha/aa_at_positions_for_HA1.xlsx', 'output/h1n1pdm/ha/aa_at_positions_for_HA1.xlsx', 'output/h1n1pdm/na/aa_at_positions_for_NA.xlsx'], 'gene': ['HA1', 'NA', 'NA', 'HA1', 'HA1', 'NA'], 'segment': ['ha

In [4]:
gene_lookup = {
    'ha': ['SigPep', 'HA1', 'HA2'],
    'na': ['NA']
}

from pathlib import Path

input_dir = Path('input')

def input_func_all(segment_list):
    l,l2,l3,l4,l5,l6 = [],[],[],[],[],[]
    for path in list(input_dir.glob("*.fasta")):
        _, lineage, segment = path.stem.split('_') # sequences_h1n1pdm_ha.fasta
        # Get gene variable from dict.
        if segment in segment_list:
            gene = gene_lookup[segment]
            l.append(f"output/nextclade/{lineage}/{segment}/nextclade.tsv")
            l2.append(f"output/nextclade/{lineage}/{segment}/nextclade.aligned.fasta" )
            l3.append(f"output/nextclade/{lineage}/{segment}/nextclade_gene_{gene}.translation.fasta")
            l4.append(gene)
            l5.append(segment)
            l6.append(lineage)
    # Return a dictionary
    return {"nextclade_tsv" : l,
            "nextclade_aligned" : l2,
            "nextclade_gene_translated" : l3,
            "gene": l4,
            "segment": l5,
            "lineage": l6}

# Dictionary containing the 6 lists defined above in the function `input_func_all()`
input_data = input_func_all(['ha', 'na'])
GENES = input_data["gene"]
SEGMENTS = input_data["segment"]
LINEAGES = input_data["lineage"]
print(GENES)
print(SEGMENTS)
print(LINEAGES)

[['SigPep', 'HA1', 'HA2'], ['NA'], ['NA'], ['SigPep', 'HA1', 'HA2'], ['SigPep', 'HA1', 'HA2'], ['NA']]
['ha', 'na', 'na', 'ha', 'ha', 'na']
['vic', 'h3n2', 'vic', 'h3n2', 'h1n1pdm', 'h1n1pdm']


Here we will try to count the values across columns in Pandas DataFrame.

In [2]:
import pandas as pd

df = pd.read_table('temp/positions_clades_aa_h3n2_HA1.txt')
df.head()



Unnamed: 0,Sequence,clade,Pos 81,Pos 133,Pos 135,Pos 137,Pos 140,Pos 144,Pos 145,Pos 156,Pos 192,Pos 222,Pos 223
0,EPI_ISL_16600559|A/Netherlands/12245/2022|A_/_...,3C.2a1b.2a.2b,D,N,A,S,K,S,S,H,I,R,I
1,EPI_ISL_16700806|A/Netherlands/10125/2023|A_/_...,3C.2a1b.2a.2a.3a,N,N,A,S,I,S,S,S,F,R,V
2,EPI_ISL_16812568|A/Netherlands/10162/2023|A_/_...,3C.2a1b.2a.2b,N,N,A,S,K,S,S,H,I,R,I
3,EPI_ISL_16699238|A/Netherlands/00245/2023|A_/_...,3C.2a1b.2a.2b,N,N,A,S,K,S,S,H,I,R,I
4,EPI_ISL_16254955|A/Netherlands/01438/2022|A_/_...,3C.2a1b.2a.2b,N,N,A,S,K,S,S,H,I,R,I


In [3]:
pos_columns = df.columns[2:]
pos_columns

Index(['Pos 81', 'Pos 133', 'Pos 135', 'Pos 137', 'Pos 140', 'Pos 144',
       'Pos 145', 'Pos 156', 'Pos 192', 'Pos 222', 'Pos 223'],
      dtype='object')

In [4]:
counts = df[ pos_columns ].value_counts()
counts.to_csv(('temp/positions_clades_aa_h3n2_HA1_counts.csv'))


In [30]:
import pandas as pd

# Import generated Excel file from rule 'get_positions'
excel_output = "output/positions/h1n1pdm/ha/positions_aa_h1n1pdm_HA1.xlsx"

# Import generated csv file from 'nextclade'
nextclade_csv = "output/nextclade/h1n1pdm/ha/nextclade.csv"

# Set the path and name of the final exported Excel output file.
excel_with_clades = "output/positions/h1n1pdm/ha/positions_clades_aa_h1n1pdm_HA1.xlsx"

# Read Excel sheet with pandas and store Dataframe in 'xlsx'
xlsx = pd.read_excel(excel_output, sheet_name="Sheet1")

# Read csv sheet with pandas and store Dataframe in 'csv'
csv = pd.read_csv(nextclade_csv, sep=";")
csv.columns = csv.columns.str.strip()
csv[['seqName', 'clade']]

# Get 'seqName' and 'clade' columns from the csv data

Unnamed: 0,seqName,clade
0,EPI_ISL_17672205|A/Netherlands/01598/2023|A_/_...,6B.1A.5a.2a
1,EPI_ISL_17672203|A/Netherlands/01584/2023|A_/_...,6B.1A.5a.2a.1
2,EPI_ISL_17672201|A/Netherlands/01578/2023|A_/_...,6B.1A.5a.2a
3,EPI_ISL_17672199|A/Netherlands/01568/2023|A_/_...,6B.1A.5a.2a
4,EPI_ISL_17672198|A/Netherlands/01560/2023|A_/_...,6B.1A.5a.2a
...,...,...
600,EPI_ISL_15544289|A/Netherlands/11759/2022|A_/_...,6B.1A.5a.2a.1
601,EPI_ISL_14918898|A/Netherlands/11721/2022|A_/_...,6B.1A.5a.1
602,EPI_ISL_15774528|A/Netherlands/11801/2022|A_/_...,6B.1A.5a.2a.1
603,EPI_ISL_15544291|A/Netherlands/11767/2022|A_/_...,6B.1A.5a.2a.1
