# Overview: Complex Data Structures
-------------------------------------------------------------------------------------------------

In this tutorial we will move past simple variables to explore more complex and mutable data types. We have already explored basic data types (string, integer). However, these basic types often fall short of the kinds of structures that would be helpful for Python programming in general and bioinformatics processing in particular. Two key data types are provided by Python: will be common to Python (list & dictionaries) while two classes will be imported from Biopython (Seq & SeqRecord)

## Prerequisites
<ul>
    <li>Submodule Lesson 1:  </li>
    <li>Submodule 1 lesson 2: Variables</li>
</ul>

## Learning objectives

- Define "immutable"
- Work with a **list**  
    - a sequence of objects
- Use a **dict (dictionary)**
    - List of objects with a "key", grouped into key/value pairs
- Use the Biopython **Seq (sequence object)** 
    - A class which can be imported from Biopython. It is like a string, but has a defined alphabet and useful bioinformatics methods
- Use the Biopython **SeqRecord**
    - Another class which can be imported from Biopython
    - Combines a sequence variable with additional annotation information
<br>
Let's take a look at these classes of data structures. 

## Getting Started

Run the next code box to load the required packages

In [None]:
#Run this cell to install needed packages

#install the required packages
%pip install jupyterquiz
%pip install Bio
import json
from jupyterquiz import display_quiz
import os
print("done installing required packages")


# Lists

Lists are basically arrays of information. A list in Python can contain different types of data and *any* Python data type can be an element of a list. Each item in the list is separated by a comma and the entire list is enclosed in square brackets [ ] .
<br>
<br>
The contents of a list can be:
+ edited (they are mutable)
+ concatenated using $ + $
+ nested
+ indexed
+ members can be accessed using [] brackets. Counting starts at 0 but indexing can be positive or negative (as we saw with strings.)

## Mutability
First let's define a term applied to lists: *immutable*.
<br>
We use the term, immutable, to describe the *inabillity* to edit parts of the data. You can, however overwrite the data. 
<br>
Why do we care about this? If you used immutable types in a program, the program can run faster, *because* they cannot be changed. However, that is not a significant concern for the small scripts we will use here. 

In [None]:
# list
list1 = [3,4,5,6]

# the list is mutable, so information may be substituted
list1[2] = 10

print(list1)


Lists have many associated functions which you can use to change and query them:
<table>
<thead>
<tr><th>Function</th><th>Purpose</th><th>Example</th></tr>
</thead>
<tbody>
<tr><td>+</td><td>Concatenates the lists</td><td>list3 = [1, 2, 3] + [4, 5]; Output: [1, 2, 3, 4, 5]</td></tr>
<tr><td>len(x)</td><td>Returns the number of elements in the list</td><td>len([1, 2, 3]) Output: 3</td></tr>
<tr><td>max(x)</td><td>Returns the largest item in the list</td><td>max(["cat", "Fleece", "yogurt"]) Output: 'yogurt'</td></tr>
<tr><td>min(x)</td><td>Returns the smallest item in the list</td><td>min([1, 2, 3]) Output: '1'</td></tr>
<tr><td>sum(x)</td><td>Returns the sum of all elements in the list (works only with numbers)</td><td>sum([1, 2, 3]) Output: '6'</td></tr>
<tr><td>x.append(item)</td><td>Adds an item to the end of the list.</td><td>lst = [1, 2]
lst.append(3) Output: lst becomes [1, 2, 3]</td></tr> 
<tr><td>index(item, start, end)</td><td>Returns the index of the first occurrence of the item</td><td>lst = [1, 2, 3]; lst.index(2) Output: 1</td></tr>
<tr><td>insert(position, item)</td><td>Inserts the item at a location in the list</td><td>lst = [1, 2]; lst.insert(1,"W") Output: [1,'W']</td></tr>
<tr><td>slice([start:stop:increment]</td><td>Returns the elements from start to stop -1</td><td>a=["AGTCCAG"], print(a[0:2]) Output: 'AG'</td></tr>
</tbody>
</table>
<div class="alert alert-block alert-info"> <b>Tip:</b> Try each of these in the python box below</a>. </div>

In [None]:
# change the variable length ​
conditions = ["high","average","low"]
conditions.append("out of range")
print(conditions)

# Find the longest or largest element of a list, here created explicitly in the script
max(["AGATTCA", "kidney", "ATG"])

Lists can be aldo be "nested." In other words,  a list can contain other lists, which can contain still more lists (or other variable types), and so on. 
<br>
We will, in the code box, create a list with two elements
Each element is a list with a pair of words

<div class="alert alert-block alert-info"> <b>Tip:</b> Try to add another two-part element nested_list</a>. </div>

In [None]:
nested_list = [("rs3452789","CYP1"),("rs5900382","MHC1")]
print(nested_list)
#add another list pair to nested_list

print(nested_list)

In [None]:
# Print the list's first element
print(nested_list[0])
# Print the first sub-list's first element's first element (which is a string)
print(nested_list[0][0])
# Print the first lists's first element's first element's third element (which is a character)
print(nested_list[0][0][2])

<div style="font-size:18px">We can append, insert, and remove list elements.​</div>

In [None]:
# Create
list2 = ["a","b","c"]

#Read
print(list2)

#update
list2.remove("b")
print(list2)

# What about this command?​ what do you think it will do?
list2.append(["d","e","f"])
# Did it work as you expected?
print(list2)


What happened is the **append** method adds the list as an element, rather than a bunch of list items.
<br>
<br>
So one element ends up holding an entire list, rather than just one element.
<br>
<br>
To fix this problem, use **extend**, rather than **append**:

In [None]:
list2.extend(["d","e","f"])
print(list2)

list1= ["a",1,2,(1,2)]
print(list1)
list1.remove(2)
print(list1)
# To get all the methods and properties of the list, remove the comment symbol
#help(list)


# Exercise

Test your skill:
<br>
- Create a list using the numbers 1 through 5​
<br>
- Create a list using the letters a-e​
<br>
- Add the lists together and print the result​
<br>
- Try multiply operator *, what happens?​

## Slicing

Slicing is a useful way to get a portion of a string or of a list. In bioinformatics, we might want to look at the first 3 bases or amino acids in each string in a FASTA file. 
<br>
Slicing has 3 arguments.

- [ start : stop : increment ]

Notice the same three arguments are also used with **string** data structures.

With slicing, start/stop can be *negative*, impying count from the end.

Increment default value is 1, but it can be any integer (including negative).

In [None]:
# Slice stepping ​examples
seq1 = list(range(1,11))  # Ends up as (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

print(seq1[-1])         # Start from the end and take the first element from the end.
print(seq1[:9:2])       # Start from 0, move to position 9, take every other element.
print(seq1[9:0:-2])     # Start from position 9, move to position 0 (backwards), take every other element.
print(seq1[-2:-5])      # Start from position -2, move to position -5, take every element.
                        # Note you can't move backwards, so this is empty.
print(seq1[-2:-5:-1])   # Now start from position -2 and move backwards to position -5, taking every element.   
#How could you print the entire sequence in reverse? Answer in next box

In [None]:
# How would you Reverse the entire sequence?
print(seq1[::-1])  

### Negative Indexes
A way to look at negative indexes is to think of the list as being recycled.

In [None]:
# How you look at indexes when using negatives.
list = [0, 1, 2, 3, 4, 5]
print(list)
print(list[0:6])
print(list[-1:-7:-1])

# Here's how you might want to think of it:
# -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5  # The index of the list
#  0,  1,  2,  3,  4,  5, 0, 1, 2, 3, 4, 5  # The list value at position

Be sure to set your boundary parameters properly.

In [None]:
seq1 = list(range(1,11))

# This next statement gives us an empty list - why?

print(seq1[9:-1:-2])

# Test Your Knowledge
Now it’s your turn to apply what you have learned in the following maatching quiz exercise. You **should feel free** to write the code in a Python cell if you cannot predict the outcome. 

In [None]:
from jupyterquiz import display_quiz
listqz="PythonQuizQuestions/ListFunction.json"
display_quiz(listqz)

# Dictionaries

In bioinformatics, you often need lookup tools or to access parts of a long set of data, without having to know the data's position in a list. The ideal tool for this is a python **dictionary**. 
<br>
In Python, a dictionary is a data structure that stores data in **key-value pairs.**  As you can see in the next code box, an example is a dictionary for the genetic code.

The basic properties of a dictionary are:
 + Key-Value Pairs: Each item in a dictionary consists of a unique key and its corresponding value. Think of it like a real-world dictionary where words are the keys and their definitions are the value. Or, for the genetic code, where AUG is the key and Met or Methionine is the value
 + Unordered: Unlike lists, dictionaries don't maintain any inherent order for the elements
 + Mutabile: You can add, remove, or modify items in a dictionary after its created
 + Accessing Values: You retrieve values by using their associated keys
 + Creating dictionaries: use curly braces {}
<br>

Let's examine a few common examples

In [None]:
# Ways to Create dicts
aa_names = dict(F="phenylalanine", P="proline", A="alanine") 
aa_Threes={} #an empty dictionary that you would later fill
dict2= {"first": "a", "second": "b", "third": "c"}
codon_table = {
    "UUU": "F", "UUC": "F", "UUA": "L", "UUG": "L", "CUU": "L", "CUC": 
    "L", "CUA": "L", "CUG": "L", "AUU": "I", "AUC": "I", "AUA": "I", 
    "AUG": "M","GUU": "V", "GUC": "V", "GUA": "V", "GUG": "V","UCU": 
    "S", "UCC": "S", "UCA": "S", "UCG": "S","CCU": "P", "CCC": "P", 
    "CCA": "P", "CCG": "P","ACU": "T", "ACC": "T", "ACA": "T", "ACG": "T",
    "GCU": "A", "GCC": "A", "GCA": "A", "GCG": "A","UAU": "Y", "UAC": "Y",
    "UAA": "*", "UAG": "*","CAU": "H", "CAC": "H", "CAA": "Q", "CAG": "Q",
    "AAU": "N", "AAC": "N", "AAA": "K", "AAG": "K","GAU": "D", "GAC": "D",
    "GAA": "E", "GAG": "E","UGU": "C", "UGC": "C", "UGA": "*", "UGG": "W",
    "CGU": "R", "CGC": "R", "CGA": "R", "CGG": "R","AGU": "S", "AGC": "S",
    "AGA": "R", "AGG": "R","GGU": "G", "GGC": "G", "GGA": "G", "GGG": "G"
}

# And ways to access key:value pairs
print(aa_names)
print(aa_names["P"])
print(codon_table["ACA"])

Other useful dictionary functions:

- **del** (This is a keyword, for removing an element),
- **update()** (This is a method of adding new elements)
<br>
<br>
Let's see some examples:

In [None]:
# Remove element
#del aa_names['P']    #the elements in a dictionary are not indexable with numbers. Rather, use the key to remove the key:object pair
# Update element
aa_names.update({'G':"glycine"})
print(aa_names)
#and, to give you a taste of loops in coming sections:
for i in ("GGAF"):
    print(aa_names[i],'',end='')
    

## Test your knowledge

In [None]:
print(aa_names)

In [None]:
from jupyterquiz import display_quiz
dict_qz="PythonQuizQuestions/dictTF.json"
display_quiz(dict_qz)

# Creating other data structures 

In Python, you can create new classes to define custom types of variables by using the **class** keyword. 

This is useful because it allows one to group related data (attributes) and behaviors (methods) into a single structure, making your code more organized, reusable, and intuitive. 

For example, if working with genetic data, you might create a **Gene** class to encapsulate information like the gene's name, sequence, and functions, along with methods for analyzing it. 

Even if **you** don't make your own class, it is useful to see how to use these variable classes. 

## Biopython data structures- Seq and SeqRecord

The open-source library **Biopython** defines two data structures—**`Seq`** and **`SeqRecord`** which form the backbone for working with biological sequence data. 

The **`Seq`** object represents a sequence (DNA, RNA, or protein) as an immutable string-like object with added biological methods, such as reverse complement (`.reverse_complement()`) and translation (`.translate()`). It enables you to easily manipulate and analyze sequences while maintaining biological accuracy. 

Building on this, the **`SeqRecord`** object is a more versatile structure that not only contains a `Seq` object but also stores rich metadata, such as an identifier (`id`), name, description, and annotations. This makes it particularly useful for handling sequences in the context of FASTA/GenBank files or any scenario where additional information accompanies the sequence. Together, these structures allow biologists to efficiently store, process, and annotate biological sequence data in a programmatic way. We'll look at the SeqRecord in the next tutorial.

## Seq

Bioinformatics sequences (typically RNA, DNA, or protieins) are string variables. But, if we knew which type of string it was, there are different kinds of methods that might be applied to them. For example, you might want to generate the transcript of a DNA sequence or find open reading frames. RNA sequences might logically need to be translated into protein sequences. 

Key seq methods are shown below
<table>
<thead>
<tr><th>Function</th><th>Purpose</th><th>Example</th></tr>
</thead>
<tbody>
<tr><td>gc_fraction</td><td>calculates the GC% of a sequence string</td><td> NO EXAMPLE</td></tr>
<tr><td>complement of nucleotide sequence X</td><td>x.compement()</td><td>x=["AGTCCAG"] Output: Seq('TCAGGTC')</td></tr>
</tbody>
</table>


In [None]:
from Bio.Seq import Seq
#from Bio.SeqUtils import gc_fraction
dna=Seq("AGTCCAG")
dna.complement() 
#gc_fraction(dna)
#dna.reverse_complement() #returns another sequence object

## SeqRecord

In Biopython, the SeqRecord object is used to hold a <u>biological sequence</u> along with its <u>associated metadata.</u>

If you are familiar with GenBank or EMBL data structures, you can see that the SeqRecord is well-equipped to include the variety of informative data pieces that are used in addition to the simple DNA or Protein sequence alone.

A SeqRecord works a bit like a * dictionary * with some built-in keys that all files will have.

With a SeqRecord, the key attributes you can access directly are:
<ul>
    <li>ID: the sequence identifier</li>
    <li>name: the sequence name</li>
    <li>description: a descriptive string</li>
    <li>seq: The biological sequence, stored as a Seq object</li>
    <li>features: A list of SeqFeature objects which provide structured information about features like gene locations, domains, or other relevant biological annotations on the sequence</li>
    <li>dbxrefs: a list of database cross-references</li>
</ul>

In [None]:
#Practice with SeqRec

# Conclusion
After this Tutorial, you've learned to work with several different data structures that are common to bioinformatic data sets.
<br>
The next tutorial will use  tools to analyze and import dataset using [Functions](./Submodule_1_Tutorial4_Functions.ipynb).

## Clean up
Remember to shut down your Jupyter Notebook instance when you are done for the day to avoid unnecessary charges. You can do this by stoping the notebook instance from the Cloud console.